Professional Documents
Culture Documents
1 Introduction
This book is a compilation of the readings developed for the Fall 2021
Semester offering of Psychology 207: Advanced Statistics.
Please don’t try to sell this book because there are about a million copyright
violations in it.
Best, DB
2 Categorizing and Summarizing
Information
2.1 How to Tell a Story
When you tell a story, you usually have to leave some things out. If you saw a
movie and a friend asked you to describe the plot, it wouldn’t be terribly
helpful if you described literally everything that happened in the movie – that
would take longer than just watching the movie – you would likely instead
discuss the basic conflict, the characters, the most interesting dialogue, and
(if your friend doesn’t mind spoilers) the film’s resolution. If your partner
asks you about your day, you wouldn’t respond with a minute-by-minute
description of each interaction you had with each person complete with an
estimate of the amplitude of their speaking voice, the ambient temperature
throughout the day, and how much your breakfast cost to the penny; you might
instead give a description of your global mood, a basic rundown of what you
did during the day, and talk about anything that stood out as particularly
funny, frustrating, or otherwise noteworthy. And if you wrote a history of an
entire country, or described the evolution of species, or told the story of the
formation of the Planet Earth, you’d have to leave a lot of details out – if you
kept all the details in then you wouldn’t survive all the way to the end.
This page is all about data: how we categorize it and how we summarize it.
We are going to cover some of the different types of data, how we summarize
data, and how what we leave in and what we leave out affect the story we
tell about the data. And, of course, I have made some choices on what to
include and what not to include on the page. There’s only so much we can
cover.
In conversational English, pretty much any number that describes a fact can
be called a “statistic.” In the field of statistics, however, the term has a
specific connotation: a statistic1 is a number that describes a sample or
samples2. Something like the proportion of 100 people polled at random
who answered “yes” to survey question is a statistic. The average reaction
time of 30 participants in a psychology study is a statistic. Something like
the number of people who live in Canada? That’s not technically a statistic.
Because the people who live in Canada is an entire population.3, the
number of them is an example of a parameter4 Thus, statistics describe
samples, and parameters describe populations.
On that second point: scientists are rarely interested only in the subjects used
in their research. There are exceptions – like case studies or some clinical
trials – but usually scientists want to generalize the findings from a sample to
the population. A cognitive researcher isn’t interested in the memory
performance of a few participants in an experiment so much as what their
performance means for all humans; a social psychologist studying the
behavioral effects of prejudice does not mean to describe the effects of
prejudice for just those who participate in the study but for all of society. It is
unrealistic in the vast majority of scientific inquiries to measure that which
they are interested in about the entire population – even researchers who do
work with population-level data are working with things like census data and
are thus somewhat limited in the types of questions they can investigate (that
is, they can only work with answers to questions asked in the census they are
working with).
parameters use Greek letters, such as μ and σ . So, when you see a number 2
Data are bits of information7, and information can take on many forms. The
ways we analyze data depend on the types of data that we have. Here’s a
relatively basic example: suppose I asked a class of students their ages, and I
wanted to summarize those data. A reasonable way to do so would be to take
the average of the students’ ages. Now suppose I asked each student what
their favorite movie was. In that case, it wouldn’t make any sense to report
the average favorite movie – it would make more sense to report the most
popular movie, or the most popular category of movie.
Thus, knowing something about the type of data we have helps us choose the
proper tools for working with our data. Here, we will talk about an extensive
– but not exhaustive – set of data types that are encountered in scientific
investigations.
Discrete data8 are data regarding categories or ranks. They are discrete in
the sense that there are gaps between possible values: whereas a continuous
measurement like length or weight can take on an infinite amount of values
between any two given points (e.g., the distances between 1 and 2 meters
include 1.5 meters, 1.25 meters, 1.125 meters, 1.0675 meters, etc.), a
measurement of category membership can generally only take on as many
values as there are categories; and a measurement of ranks can only take on
as many values as there are things to be ranked.
Figure 2.2: Looks like the red face murdered six people in prison, so maybe
that’s more of an emotional pain.
Celsius. That’s pretty cold! Now imagine that the temperature the next day is
2 Celsius. That’s still cold! You likely would not say that the day when it’s
∘
2 Celsius is twice as warm as the day before when it was 1 out, because it
∘ ∘
wouldn’t be! The Celsius scale (like the Fahrenheit scale) is measured in
degrees because it is a measure of temperature relative to an arbitrary 0.
Yes, 0 isn’t completely arbitrary because it’s the freezing point of water, but
∘
it’s also not like 0 C is the bottom point of possible temperatures (0 Kelvin
∘
is, but we’ll get to that in a bit). In that sense, the intervals between Celsius
measurements are consistently meaningful: 2 C is 1 degree warmer than 1
∘ ∘
not a third as warm as 45∘ C, and −25∘ C is certainly not negative 3 times
as warm as 75 C. ∘
just replace “a kid” with “in stats class” and “quicksand” with “the
difference between interval and ratio data.” The important thing is to know
that both are continuous data and that continuous data are very different
from discrete data.
So, our man Smitty has taken some heat for leaving out some categories, and
subsequently people have proposed alternate taxonomies. I’m not going to
pile on poor Smitty – leaving things out is a major theme of this page! – but
keeping in mind what we said at the beginning of this section about the type
of data informing the type of analysis, there are a couple of additional
categories that I would add. Data of these types all have a place in Stevens’s
taxonomy, but have special features that allow and/or require special types of
analyses.
Cardinal (count)14 data are, by definition, ratio data: counts start at zero,
and a count of zero is a meaningful zero. But, counts also have features of
ordinal (rank) data: counts are discrete (like most of the dialogue in the
show, Two and a Half Men is an unfunny joke15, based on the absurdity of
continuous count data) and their values imply relativity between each other.
As counts become larger and more varied, the more appropriate it is to treat
them like other ratio data, but data with small counts (the actually most
famous example of this is the number of 19th century Prussian Soldiers who
died by being kicked by horses or mules – as you may imagine, the counts
were pretty small) are distributed in specific ways and are ideally analyzed
using different tools like Poisson and negative binomial modeling.
2.2.2.1.3.2 Proportions
Proportions16, like counts, can rightly be categorized as ratio data but have
special features of their own. For example, the variance and standard
deviation of a proportion can be calculated in both the traditional manner
(follow the hyperlinks or scoll down to find the traditional equations), but
also in its own way.17 Because proportions by definition are limited to the
range between zero and one, they tend to arrange themselves differently than
data that are unbounded (that is, data look different when they are squished).
Proportional data are also similar to count data in that they A. are often
treated like ratio data (which is not incorrect) and B. are often analyzed using
special tools. In general, the shape of distributions of proportions are beta
distributions – we will talk about those later.
Binary (dichotomous) data18 are, as the name implies, data that can take
one of two different values. Binary data can be categorical – as in yes or no
or pass or fail – or numeric – as in having 0 children or more than 0
children – but regardless can be given the values 0 or 1. Binary data have a
limited set of possible distributions – all 0, all 1, and some 0/some 1. We
will discuss several treatments of binary data, including uses of binomial
probability and logistic regression.
The other terminology we will use to describe inputs and outputs is in terms
of predictor variable 21 and predicted (outcome) variable. Those terms are
used in the context of correlation and regression (although the terms
independent and dependent are used there as well). The predictor/predicted
terms are similar to the independent/dependent terms in that the latter is
considered to change as some kind of function of changes in the former. They
are also similar in that the former are usually assigned to be the x variable
and the latter are usually assigned to be the y variable.
The histogram22 is both one of the simplest and one of the most effective
forms of visualizing sets of data. They will be covered at length on the page
on data visualization, but a brief introduction here will be helpful tools for
describing the different ways that we summarize data.
values on the x-axis, despite the fact that default values for binwidth and/or
number of bins are built in to statistical software packages that produce
histograms: it is up to the person doing the visualizing to choose the width of
bins that best represents the distribution of values in a data set.
This histogram represents the number of points scored by each player in the
2019-2020 NBA season (data from Basketball Reference). Each bar
represents the number of players who scored the number of points
represented on the x-axis. The number of points are sorted into bins of 50, so
the first bar represents the number of players who scored 0 – 50 points, the
second bar represents the number of players who scored 51 – 100 points,
etc. All of the bars in a histogram like this one are adjacent to each other,
which is a standard feature of histograms that shows that each bin is
numerically adjacent to the next. That layout implies that difference between
one bar and the next is a difference in the grouping of the one variable – gaps
between all of the bars (as in bar charts) imply a categorical difference
between observations, which is not the case with histograms. Apparent gaps
in histograms – as we see in Figure 2.3 between 1900 and 1950 and again
between 2000 and 2300, are really bars with no height. In the case of our
NBA players, nobody scored between 1900 and 1950 points, and nobody
scored between 2000 and 2300 points.24
Again: histograms will be covered in more detail in the page on data
visualization. For now, it suffices to say that histograms are a good way to
see an entire dataset and to pick up on patterns. Thus, we will use a few of
them to help demonstrate what we leave in and what we leave out when we
summarize data.
2.3.2.1 Mean
When we talk about the mean26 in the context of statistics, we are usually
referring to the arithmetic mean of a distribution: the sum of all of the
numbers in a distribution divided by the number of numbers in a distribution.
If x is a variable, x represents the i observation of the variable x, and
i
th
by:
n
∑ xi
i=1
x =
¯ .
n
1 + 2 + 3
x =
¯ = 2.
3
The calculation for a population mean is the same as for a sample mean. In
the equation, we simply exchange x for μ (the Greek letter most similar to
¯
the Latin m) and the lower-case n for a capital N to indicate that we’re
talking about all possible observations (that distinction is less important and
less-frequently observed than the distinction between Latin letters for
statistics and Greek letters for parameters, but I find it useful):
N
∑ i=1 x i
μ = .
N
likely value – in one flip of a fair coin, the expected value would be
1/2 heads and 1/2 tails, which is absurd (a coin can’t land half-
heads and half-tails) – but it is the value you could expect, on average,
in repeated runs of gambles.
2. For any given set of data x, we can take a number y and find the
errors29 between x and y: x − y. The mean of x is the number that
i i
3. Related to points (1) and (2), the mean can be considered the balance
point of a dataset: for every number or number less than the mean, there
is a number or are numbers greater than the mean to balance out the
distance. Mathematically, we can say that:
n
∑ (x i − x) = 0
¯
i=1
For those reasons, the mean is the best measure of central tendency for taking
all values of x into account in summarizing a set of data. While that is often a
positive thing, there are drawbacks to that quality as well, as we are about to
discuss.
The exception was Star Wars Episode V: The Empire Strikes Back, which
made $203,359,628 in 1980 (that doesn’t count all the money it made in re-
releases), nearly twice as much as the second-highest grossing film (9 to 5,
which is a really good movie but is not part of a larger cinematic universe).
The mean gross of 1980 movies was $24,370,093, but take out The Empire
Strikes Back and the mean was $21,698,607, a difference of about $2.7
million (which is more than 13% of 1980-released movies made on their
own). The other measures of central tendency don’t move nearly as much: the
median changes by $63,098 depending on whether you include Empire or
not, and the mode doesn’t change at all.
We’re left with a bit of a paradox: the mean is useful because it can balance
all values in a dataset but can be misleading because the effect of outliers on
it can be outsized relative to other measures of central tendency. So is the
mean’s relationship with extreme values a good thing or a bad thing? The
(probably unsatisfying) answer is: it depends. More precisely, it depends on
the story we want to tell about the data. To illustrate, please review a pair
of histograms. Figure 2.6 is a histogram depicting the daily income in one
month for an imaginary person who works at an imaginary job where they get
paid imaginary money once a month.
Figure 2.6: Daily Expenditures for a Real Month for an Imaginary Person
We can see that for 29 of the 30 days in September, this imaginary person has
negative net expenditures – they spend more money than they earn – and for
one day they have positive net expeditures (the day that they both get paid
and have to pay the rent) – they earn much more than they spend. That day
with positive net expeditures is the day of the month when they get paid.
Payday is a clear outlier – it sits way out from the rest of the distribution of
daily expenditures. But, if we exclude that outlier, the average daily
expenditure for our imaginary person is $-35.19 and if we include the outlier,
the average daily expenditure is $5.99 – the difference between our
imaginary person losing money every month and earning money every
month. Thus, in this case, using the mean with all values of x is a better
representation of the financial experience of our imaginary hero.
Now, let’s look at another histogram, this one with a dataset of 2 people.
Figure 2.7 is a histogram of the distribution of years spent as President of the
United States ofAmerica in the dataset
x = {me, F ranklin Delano Roosevelt}.
Here is a case where using the mean is obviously misleading. Yes, it is true
that the average number of years spent as President of the United States
between me and Franklin Delano Roosevelt is six years. I didn’t contribute
anything to that number: I’ve never been president and I don’t really care to
ever be president. So, to say that I am part of a group of people that averages
six years in office is true, but truly useless. Thus, some judgment is required
when choosing to use the mean to summarize data.
2. This is also going to be true of the median, and to a lesser extent the
mode, but using the mean to summarize data leaves out information
about the shape of the distribution beyond the impact of outliers. In
Figure 2.8, we see three distributions of data with the same mean but
very different shapes.
Figure 2.8: Histogram of Three Distributions with the Same Mean
2.3.2.3 Median
The median32 is the value that splits a distribution evenly in two parts. If
there are n numbers in a dataset, and n is odd, then the median is the
th
(
n
2
+
1
)
2
largest value in the set; if n is even, then the median is the
th th
average of the ( ) and the ( + 1) largest values. That makes it sound
n
2
n
a lot more complicated than it is – here are two examples to make it easier:
if x = {1, 2, 3, 4, 5},
then median(x) = 3
if x = {1, 2, 3, 4},
2 + 3
then median(x) = = 2.5
2
The median is used for a lot of skewed distributions in lieu of the mean not
only because it is more resistant to outliers than is the mean, but also
because it minimizes the absolute errors made by predictions. By absolute
errors we mean the absolute value of the errors |x − y|, where y is the
i
prediction and x is one of the predicted scores. Thus, when we use the
i
2. Like the mean, the median does not tell us much about the shape of a
distribution.
2.3.2.4 Mode
The mode 34 The mode is the most likely value or values of a distribution to
be observed. A distribution is unimodal if it has one clear peak, as in part a
of Figure 2.9. A distribution is bimodal if it has two clear peaks, as in part b.
Distributions with more than one peak are collectively known as multimodal
distributions.
3. Uniquely among the mean, median, and mode, the mode can be used
with all kinds of continuous data and all kinds of discrete data. There is
no way to take the mean or median of categorical data. You can but
probably shouldn’t use the mean of rank data36 (the median is fine to
use with rank data). Because the mode is the most frequent value, it can
be the most frequent continuous value (or range of continuous values,
depending on the precision of the measurement), the most frequent
response on an ordinal scale, or the most frequent observed category.
4. Unlike the mean and the median, the mode can tell us if a distribution
has multiple peaks.
1. Like the other measures of central tendency, the mode doesn’t tell us
anything about the spread, the shape, or the height on each side of the
peak of a distribution. There are some sources out there – and I know
this because I have never said it in class nor written it down in a text but
have frequently encountered it as an answer to the question what is a
drawback of using the mode? – that say that the mode is for some
reason less useful because “it only takes one value of a distribution into
account.” That’s wrong for one reason – there can be more than one
mode, so it doesn’t necessarily take only one value into account – and
misleading for another: the peak of a distribution depends, in part, on
all the other values being less high than the peak. To me, saying that
the peak of a distribution only considers one value is like saying that
identifying the winner of a footrace only takes one runner into account. I
think the germ of a good idea in that statement is that we don’t know
how high the peak of a distribution is, or what the distribution around it
looks like, but that’s a problem with central tendency in general.
2. Although, related to that last part: the mode doesn’t really account for
extreme values, but neither does the median.
2.3.3 Quantiles
Quantiles37 are values that divide distributions into sections of equal size.
We have already discussed one quantile: the median, which divides a
distribution into two sections of equal size. Other commonly-used quantiles
are:
Quantile Name Divides The Distribution Into
Quintiles Fifths
Quartiles Fourths
Deciles Tenths
Percentiles Hundreths
At points where quantiles coincide, their names are often interchanged. For
example, the median is also known as the 50th percentile and vice versa, the
first quartile is often known as the 25th percentile and vice versa, etc.
not a number that is easily divisible by a lot of other numbers, it gets a bit
more complicated because there are all kinds of tiebreakers and different
algorithms and stuff. The things that are important to know about finding
quantiles are:
1. We can use software to find them (for example, the R code is below),
2. Occasionally, different software will disagree with each other and/or
what you would get by counting out equal proportions of a distribution
by hand, and
3. Any disagreement between methods will be pretty small, probably
inconsequential, and explainable by checking on the algorithm each
method uses.
2.3.4 Spread
2.3.4.1 Range
The Range 38 is expressed either as the minimum value of a variable and the
maximum value of a variable (e.g._x is between a and b), or as the
difference between the highest value in a distribution and the lowest value in
a distribution (e.g., the range of x is b − a). For example:
Range(x) = 55–0 = 55
The range is highly susceptible to outliers: just one weird min or max value
can risk gross misrepresentation of the dataset. For that reason, researchers
tend to favor our next measure of spread…
The Interquartile range 39 is the width of the middle 50% of the data in a
set. To find the interquartile range, we simply subtract the 25th percentile of
a dataset from the 75th percentile of the data.
2.3.4.3 Variance
Variance 40 is both a general descriptor of the way things are distributed (we
could, for example, talk about variance in opinion without collecting any
tangible data) and a specific summary statistic that can be used to evaluate
data. The variance is, along with the mean, one of the two key statistics in
making inferences.
There are two equations for variance: one is for a population variance
parameter, and the other is for a sample variance statistic. Both equations
represent the average squared error of a distribution. However, to apply the
population formula to a sample would consistently underestimate the
variance of a sample41 and thus an adjustment is made in the denominator.
2
¯ ∑ (x − x)
2
s =
n − 1
Aside from the differences in symbols between the population and sample
equations – Greek letters for population parameters are replaced by their
Latin equivalents for the sample statistics – the main difference is in the
denominator. The key reason for the difference is related to the nature of μ
and x. For any given population, there is only one population mean – that’s μ
¯
– but there can be infinite values of x (it just depends on which values you
¯
sample). In turn, that means that the error term x − x depends on the mean of
¯
the sample (which, again, itself can vary). That mean is going to vary a lot
more if n is small – it’s a lot easier to sample three values with a mean
wildly different from the population mean than it is to sample a million
values with a mean wildly different from the population mean – so the bias
that comes from using the population mean equation to calculate sample
variance is bigger for small n and smaller for big n.
To correct for that bias, the sample variance equation divides the squared
errors not by the number of observations but by the number of observations
that are free to vary given the sample mean. That sounds like a very weird
concept, but hopefully this example will help:
If the mean of a set of five numbers is 3 and the first four numbers are
{1, 2, 4, 5}, what is the fifth number?
1 + 2 + 4 + 5 + x
3 =
5
15 = 12 + x
x = 3
This means that if we know the mean and all of the n values in a dataset but
one, then we can always find out what that one is. In turn, that means that if
you know the mean, then n − 1 of the values of a dataset are free to vary
except one: that last one has to be whatever value makes the sum of all the
values equal x/n.In general, the term that describes the number of things
¯
that are free to vary is degrees of freedom42, and for calculating a sample
mean, the degrees of freedom – abbreviated df – is equal to n − 1, so that is
the denominator we use for the sample variance.
As the standard deviation is the square root of the variance, there is both a
population parameter for standard deviation – which is the square root of the
population parameter for variance – and a sample statistic for standard
deviation – which is the square root of the sample statistic for variance.
2
∑ (x − μ)
√
σ = √σ =
2
n − 1
As with the variance, the default for statistical software is to give the sample
version of the standard deviation, so multiply the result by √n − 1/√n to
get the population parameter.
2.3.5 Skew
Like the variance, the skew44 or skewness is both a descriptor of the shape
of a distribution and a summary statistic that can be used to evaluate the way
a variable is distributed. Unlike the variance, the skewness statistic isn’t
used in many statistical tests, so here we will focus more on skewness as a
shape and less on skewness as a quantity.
When we talked about the mode, we talked about the peak (or peaks) of
distributions. Here we will introduce another physical feature of
distributions: tails45. A tail of a distribution is the longish, flattish part of a
distribution furthest away from the peak. For a positively skewed
distribution, there is a long tail on the positive side of the peak and a short
tail or no tail on the negative side of the peak. For a negatively skewed
distribution, there is a long tail on the negative side of the peak and a short
tail or no tail on the postive side of the peak. A symmetric distribution has
symmetric tails. We’ll talk lots more about tails in the section on kurtosis.
The term skewness is really most meaningful when talking about unimodal
distributions – as you can imagine, having multiple peaks would make it
difficult to evaluate the relative size of tails. If, for example, you have one
relatively large peak and one relatively small peak, is the small peak part of
the tail of the large peak? It’s best not to get into those kinds of philsophical
arguments when describing distributions: in a multimodal distribution, the
multimodality is likely a more important feature than the skewness.
(for more on what that barely-in-English phrase means, please see the bonus
content below) and σ is the standard deviation.
2.3.6 Kurtosis
The kurtosis statistic is used even less frequently than the skewness statistic
(this section is strictly for the curious reader). For a perfectly mesokurtic
(read: a normal distribution), the kurtosis is 347.
4 N 4
x − μ ∑ (x − μ)
i=1
Kurtosis(x) = E [( ) ] =
2
σ N
(∑ (x − μ) 2 )
i=1
2.3.6.1.2 Sample Kurtosis
The kurtosis of sample data is given by^[The sample excess kurtosis is given
by
n 4
2
n(n + 1)(n − 1) ∑ i=1 (x i − x)
¯ 3(n − 1)
−
(n − 2)(n − 3) s4 (n − 2)(n − 3)
n 4
¯ n(n + 1)(n − 1) ∑ i=1 (x i − x)
sample kurtosis =
4
(n − 2)(n − 3) s
2.4 R Commands
2.4.1 mean
mean()48
example:
x<-c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 7)
mean(x)
## [1] 4
2.4.2 median
median(x)
## [1] 4
2.4.3 mode
There is no built-in R function to get the mode (there is a mode() function,
but it means something different). But, we can install the package DescTools
to get the mode we’re looking for.
install.packages(DescTools)49
library(DescTools) Mode()
## [1] 4
## attr(,"freq")
## [1] 4
2.4.4 quantiles
`quantile(array, quantile)
quantile(x, 0.8)
## 80%
## 5
Example 2: quartiles of x
2.4.5 range
The range() command returns the endpoints of the range.
example:
range(x)
## [1] 1 7
to get the size of the range, you can use either range(x)[2]-range(x)[1]
or max(x)-min(x)
range(x)[2]-range(x)[1]
## [1] 6
max(x)-min(x)
## [1] 6
2.4.6 variance
var()
example:
var(x)
## [1] 2.666667
example:
sd(x)
## [1] 1.632993
install.packages(e1071)50
library(e1071)
skewness(x)
skewness(x)
## [1] 0
kurtosis(x)
## [1] -0.9609375
The mean, variance, skewness, and kurtosis of a distribution are all ways to
describe different aspects of the distribution. They are also mathematically
related to each other: each is derived using the method of moments, a really
old idea (in terms of the history of statistics, which is a relatively young
branch of math) that isn’t actively used much anymore (for reasons we’ll
discuss in a bit). The basic idea of the method of moments is borrowed from
physics: for a object spinning around an axis, the zeroth moment is the
object’s mass, the first moment is the mass times the center of gravity, and the
second moment is the rotational inertia of the object. When applied to a
distribution of data, the rth moment m of a distribution of size n is given by:
n
1 r
mr =
¯ ∑ (x i − x)
n
i=1
The mean is defined as the first moment, and the variance is defined as the
second moment.
The skewness is the standardized third moment. The third moment by itself is
in terms of the units of x – if x is pounds, the third moment is in pounds; if x
is in volts, the third moment is in volts – and that doesn’t make a ton of sense
when talking about the shape of a histogram (it would be weird to say that the
distribution is skewed by three pounds to the left). So, instead, the third
moment is standardized by dividing by the cube (to match the cube in the
numerator) of the standard deviation (which is the variance to the power of
3/2 ). Similarly, the kurtosis is the standardized fourth moment: the ratio of
the fourth moment to the standard deviation to the fourth power (which is
square of the variance, or, the second moment).
6. *there are probably exceptions, but I can’t really think of any right
now↩
7. The word data is the plural of the word datum and should take plural
verb forms but often doesn’t. It’s a good habit to say things like “the
data are” and “the data show” rather than things like “the data is” and
“the data shows” in scientific communication. It’s a bad habit to correct
people when they use data as a singular noun: try to let that stuff slide
because we live in a society.↩
11. Data that can take on infinite values in a limited range (i.e. data with
values that are infinitely divisible).↩
12. Interval data: Continuous data with meaningful mathematical
differences between values.↩
n
↩
.
18. Binary (dichotomous) data: data that can take one of two values; those
values are often assigned the values of either 0 or 1.↩
24. In case you’re interested: the two little bars at 1950 – 2000 points and
at 2300 – 2350 points each represent single players: Damian Lillard
(who scored 1,978 points) and James Harden (who scored 2,335
points). Congratulations, Dame and James!↩
25. Central tendency: a summarization of the overall position of a
distribution of data.↩
27. There are other means than the arithmetic mean, most famously the
geometric mean
1
n n
(∏ x 1 )
i=1
29. Error: the difference between a prediction and an observed value, also
known as residual and deviation↩
30. Outlier: a datum that is substantially different from the data to which it
belongs.↩
31. Information loss is also a major theme of this page – it’s what the
introduction was all about!↩
32. Median: The value for which an equal number of data in a set are less
than the value and greater than the value, also known as the 50th
percentile.↩
33. Let’s put income skew this way: if you put Jeff Bezos in a room with
any 30 people, on average, everybody in that room would be making
billions of dollars a year.↩
37. Quantiles: values that divide the distributions into equal parts↩
38. Range: The difference between the largest and smallest values in a
distribution of data.↩
39. Interquartile Range: The difference between the 75th percentile value
and the 25th percentile value of a distribution of data; the range of the
central 50% of values in a dataset.↩
42. Degrees of freedom (df): the number of items in a set that are free to
vary.↩
43. Standard deviation: The typical deviation from the mean of the data set,
equal to the square root of the variance.↩
44. Skew (or skewness): The balance of a distribution about its center↩
45. Tail(s): the area or areas of a distribution furthest from the peak.↩
47. Some prefer a kurtosis statistic that is equal to zero for a mesokurtic
distribution and so sometimes you will see an excess kurtosis statistic
that is recentered at 0↩
49. You only need to install a package once to your computer. After that,
every time you start a new R session, you just have to call
library(insert package name) to turn it on. The best analogy I
have encountered to describe the process is from Nathaniel D.
Phillips’s book YaRrr! The Pirate’s Guide to R: a package is like a
lightbulb – install.packages() puts the lightbulb in the socket and
then library() turns it on.↩
50. You only need to install a package once to your computer. After that,
every time you start a new R session, you just have to call
library(insert package name) to turn it on. The best analogy I
have encountered to describe the process is from Nathaniel D.
Phillips’s book YaRrr! The Pirate’s Guide to R: a package is like a
lightbulb – install.packages() puts the lightbulb in the socket and
then library() turns it on.↩
3 Visual Displays of Data
3.1 About this Page
All of the original visualizations on this page were made using R. Good
visualization goes far beyond the software used to make it! Good visualization
can be done with a pencil and paper, and it can certainly be done with all
kinds of different packages. However, R happens to be an excellent software
for data visualization because of all of the packages that have developed to
work in R, so all of the packages and code used for the original figures are
visible on this page.
## Figure 3
boxplot1.df<-data.frame(rnorm(1000))
colnames(boxplot1.df)<-"data"
## Figure 4
condition1<-rnorm(1000, 4, 4)
condition2<-rnorm(1000, 8, 6)
values<-c(condition1, condition2)
labels<-c(rep("Condition 1", 1000), rep("Condition 2", 1000))
boxplot.df<-data.frame(labels, values)
## Figure 5
barchart.df<-boxplot.df
## Figure 6
## Figure 7
samplehist.df<-data.frame(rnorm(10000))
colnames(samplehist.df)<-"x"
## Figure 8
x1<-rnorm(10000)
x2<-rnorm(10000, 3, 1)
### 8a
comp.hist.df<-data.frame(x1, x2)
### 8b
comp.hist.long<-data.frame(c(rep("Variable 1", 10000), rep("Va
## Figure 12
N <- 200 ### Number of random samples
## Figure 15
clrs <- fpColors(box="royalblue",line="darkblue", summary="roy
labeltext<-c("Variable", "a", "b", "c", "d", "e")
mean<-c(NA, 0.2, 1.3, 0.4, -2.1, -2.0)
lower<-c(NA, -0.1, 0.7, 0, -2.4, -2.5)
upper<-c(NA, 0.5, 2, 0.8, -1.8, -1.3)
## Figure 16 is a reproduction
## Figure 17
## Figure 18
edges = data.frame(N1 = c("A1", "A1", "A1", "B1", "B1", "B1"),
N2 = c("A2", "B2", "C2", "A2", "B2", "C2"),
Value = c(33, 33, 10, 21, 54, 13),
stringsAsFactors = F)
## Figure 19 is a reproduction
## Figure 20
Values<-c(rchisq(10000, df=1),
rchisq(10000, df=2),
rchisq(10000, df=3),
rchisq(10000, df=4),
rchisq(10000, df=5),
rchisq(10000, df=6),
rchisq(10000, df=7),
rchisq(10000, df=8),
rchisq(10000, df=9),
rchisq(10000, df=10),
rchisq(10000, df=11),
rchisq(10000, df=12))
df<-rep(1:12, each=10000)
small.multiple.df<-data.frame(df, Values)
When we share data, we are teachers of the content of our science, and good
data visualization is one of our most powerful teaching tools. It’s also really
easy to mislead and distract with bad data visualization: the responsibility
lies with us to be effective and honest communicators.
3.3 Essentials of Good Visualization
Modern software is making it increasingly easy to create visualizations of all
kinds of data.1 Regardless of the simplicity or complexity of figures, there are
several principles that apply to all good data visualization.
For example, please see the pair of charts in Figure 3.2. Chart A is based on
the house style of the magazine The Economist. It’s not bad! But, there are a
few unnecessary elements. Please compare Chart A to Chart 1B: the exact
same data are represented, and you lose nothing by removing the background
color, nor by removing the horizontal gridlines, nor by removing the ticks on
the axes, nor by removing the axis lines. In fact, we gain focus on the data in
Chart B by removing all of the unnecessary elements. If possible, remove all
of the elements that you can while maintaining all of the information necessary
to understand the data, and when in doubt on whether to remove an element,
go ahead and try your figure without it: you may find that it wasn’t as
necessary as you thought.
lines<-ggplot(figure1data, aes(x))+geom_histogram(binwidth=0.1
theme_economist()+
ggtitle("A")
nolines<-ggplot(figure1data, aes(x))+geom_histogram(binwidth=0
theme_tufte(ticks=FALSE, base_size=16, base_family="sans")+
ggtitle("B")
keepitclean<-plot_grid(lines, nolines, ncol=2)
keepitclean
Figure 3.2: Keep it Clean!
So, how can we know precisely what the dimensions of our figures are
without gridlines? How do we know exactly how high a bar is, or where
exactly a point lies, without ticks on the axes for reference? Well, here’s the
thing: you don’t need to know any of that stuff, because:
If the reader needs to know precise values, put them in text and/or a table.
The purpose of figures is not to show, for example, the means of two sets of
data but to help people get an idea of the relative magnitudes of those means.
Data visualizations that are meant to elicit careful examination from the reader
– to ask the reader to stare really closely at a figure to discern tiny differences
and distances from ticks and gridlines – are counterproductive because they
make your story harder to understand instead of easier. Precision is important,
but that’s what text and tables are for.
Not only are such figures unnecessary, they’re often wrong: except in the rare
case that a proportion is exactly 9/10, there has to be rounding involved. So,
just use a number! If you want, you can make it really big to get people’s
attention, like this:
90%
People are generally pretty good about understanding single values – there’s
no reason to insult their intelligence (and potentially be inaccurate) with
paper-doll-looking figures.
A close relative of the pie chart – and one with the same fundamental problem
as the pie chart – is the donut chart. Here’s an example that I got from
datavizcatalogue.com:
And here is its even more insidious cousin, the 3D donut chart (from
amcharts.com):
And here is an absolute monstrosity from slidemembers.com, look upon it and
gaze into the face of pure evil:
As mentioned earlier, angles are hard enough for us to process. We gain
nothing from seeing the side of a donut chart, or seeing it from multiple angles,
or torn apart and reassembled. Don’t do any of that stuff. Which brings us to…
3.3.4 Ducks
Duck is another Tufte term – it refers to any kind of ornamentation on a figure
that has no actual relevance to the data.
Figure 3.3: The Big Duck in Flanders, NY
Tufte got the term “duck” from the building pictured above. All of the ducky
elements of the building are functionally useless: they are just for decoration
(it used to be a place that sold ducks and duck eggs, now it’s a tourist
attraction).
3.3.5 Annotations
Not all additions to charts are ducks: some are quite useful. Annotations can
help draw attention to the important parts of a visualization. NBA analyst (and
one-time NBA executive, and geographer by training) Kirk Goldsberry does
really nice work with annotation in his basketball-themed data visualizations,
for example, this chart shows Stephen Curry’s shooting results from the 2012-
2013 NBA season: with annotations on top of the patterns, he provides more
information about the patterns in the data and calls attention to the main points.
The benefit of annotation can be as simple as replacing legends to improve
readability. In Figure 3.4, the chart on the right uses annotations instead of the
legend shown in the chart on the left.
Replacing the legend with annotations reduces the effort the reader has to
make2: their eyes don’t have to leave the lines of data. That may seem like an
extremely small level of effort to be saving, but here’s an important fact about
scientific writing:
Reading scientific writing is exhausting and every little bit of relief helps.
3.3.6 Lying
The classic example of misleading with data visualization is starting a y-axis
at a value other than zero. For example, the following chart uses a y-axis that
starts at 6,000,000 to exaggerate the differences between the two bars:
3.3.7 Colors
There is nothing I can write about the use of colors in data visualization that
could possibly improve on this blog post by Lisa Charlotte Rost. Go read it.
3.3.8 Fonts
As noted above, the readers of scientific content are working hard, and any
little bit you can do to give them a break is a good deed done. Sans serif fonts,
generally speaking, are easier to read than serif fonts because they convey the
same amount of information with less ornamentation.3 So, if given the choice,
sans serif fonts are preferable. However, it’s also nicer-looking if the font in
the figures matches the text of a document. I tend to think that mismatched fonts
are jarring to the extent that it outweighs any benefit conferred by sans-serif
fonts, so if you know your text is going to be, say, written in Times New
Roman], I would use that in your figures as well.
3.4 Types of Visualization
Here we’re going to run through some of the more common and useful forms
of data visualization. This is by no means an exhaustive list; in fact, there
really is no exhaustive list of data visualizations because new ones can
always be created. But, using relatively popular forms like the ones below
(when appropriate) has the advantage of leveraging people’s experience with
these forms for understanding your data, which will likely save your reader
some cognitive effort.
3.4.1 Boxplots
The lines in a boxplot are labeled in Figure 3.5. The horizontal line across the
box is usually the median and the lower and upper sides of the box are usually
the 25th and 75th percentiles, respectively. The whiskers – those lines that
extend on either side from the center of each box – represent a definition of
the range of values not considered to be outliers. Following Tukey’s
recommendations, the default in R is that the length of the top whisker is the
difference between the 75th percentile and the observed value that is
closest to 1.5 times the interquartile range plus the 75th percentile. In other
words, the length of the line is approximately 1.5 times the interquartile range
statistic: it can be a little more or less based on where the observed data lie.
Then, the length of the lower whisker is the difference between the 25th
percentile and the observed value that is closest to 1.5 times the
interquartile range subtracted from the 25th percentile.
ggplot(boxplot1.df, aes(y=data))+
geom_boxplot()+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(x="Dataset", y="Values")+
theme(axis.text.x=element_blank(), axis.title.x=element_blan
annotate("text", x=1.75, y=0, label="median", size=6)+
annotate("text", x=c(-1.75, -1.75), y=c(-0.652, 0.667), labe
annotate("text", x=c(0.8, 0.8), y=c(-2.508464,2.560529), lab
annotate("text", x=c(0.8, 0.8), y=c(-2.508464,2.560529), l
annotate("text", x=c(-0.8, -0.8), y=c(2.973052, -3.03701), l
geom_segment(x=1.2, xend=0.5, y=0, yend=0, arrow=arrow())+
geom_segment(x=-1.2, xend=-0.5, y=-0.652, yend=-0.652, arrow
geom_segment(x=-1.2, xend=-0.5, y=0.667, yend=0.6667, arrow=
geom_segment(x=0.76, xend=0.05, y=2.560529, yend=2.560529, a
geom_segment(x=-0.76, xend=-0.05, y=2.973052, yend=2.973052,
geom_segment(x=-0.76, xend=-0.05, y=-3.03701, yend=-3.03701,
xlim(-2, 2)
Boxplots are useful because they present a simplified view of the shape of a
distribution. For example, if the 25th percentile line is much closer to the
median line than the 75th percentile line is, that’s an indicator that the smaller
values are more bunched together and that the distribution has a positive
skew. Boxplots are even more useful when we compare boxes between
different datasets, as in the example in Figure 3.6. When we have boxes for
multiple groups, we can easily compare the median of one group to another,
the percentiles of one group to another, compare the outliers in each group to
each other, etc.
Like boxplots, bar charts are visualizations of summary statistics. Bar charts
differ from boxplots because they tend to represent fewer summary stats than
do boxplots: typically, they show means and sometimes indicate measures of
variance about the means.
Figure 3.7 is a sample bar chart: each bar represents a mean. These are the
same data used in Figure 3.6, so we can compare the bars to the boxplots
above.
Figure 3.8 shows a common variation on the simple bar chart: the grouped bar
chart. In this type of chart, we group bars together for easy comparison: for
example, in two experiments with two conditions in each experiment, it makes
sense to use two groups of two bars (with error bars on each bar, naturally).
3.4.3 Histograms
Histograms, as described in the page on categorizing and summarizing
information, are simple and effective ways of showing the entire distribution
of a single variable in visual form. The bars of a histogram represent on the y-
axis the frequency or proportion (either is fine but make sure you label which
one you’re using on the y-axis!) of values in ranges – known as bins – defined
on the x-axis. The bars on a histogram are adjacent to each other, indicating
that membership in a bin is based only on the way the bins are defined: values
in one bin are greater than the values in the bin on the left and less than the
values in the bin on the right but are not categorically different (as they are in
bar charts). Apparent gaps in the x-axis indicate that there are no values in the
data that fit that bin (but the bin is still there). An example histogram is shown
in Figure 3.9:
ggplot(samplehist.df, aes(x))+
geom_histogram(binwidth=0.1)+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)+
labs(x="Variable Values", y="Frequency")
Visualizing the data in Figure 3.9 shows us that the distribution of the variable
is roughly symmetric, with a peak around 0 and tails reaching to −4 and 4.
There are relatively many observations between −1 and 1 and relatively few
less than −2 or greater than 2.
set.seed(77)
colnames(comp.hist.long)<-c("Variable", "Values")
hist1<-ggplot(comp.hist.df, aes(x1))+
geom_histogram(binwidth=0.1)+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(x="Values of Variable 1", y="Frequency")+
scale_x_continuous(limits = c(-4, 7))+
ggtitle("A: Stacked")
hist2<-ggplot(comp.hist.df, aes(x2))+
geom_histogram(binwidth=0.1)+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(x="Values of Variable 2", y="Frequency")+
scale_x_continuous(limits = c(-4, 7))+
ggtitle(" ")
There are several options that combine the benefits conferred by the shapes of
histograms with bar charts. One method is known as a jitterplot, where the
individual data points are represented by literal points. Figure 3.11 is an
example using the data used to make the bar chart in Figure 3.7.
The jitterplot gives a visual representation of the dispersion of the data that
may be more salient than error bars. One potential drawback of a jitterplot is
that the horizontal distribution of the data is artificial – if multiple points in a
dataset have the same value, a jitterplot forces them apart horizontally so that
they don’t occupy the same space and all points are visible. Thus, the width
adds another dimension that might overestimate the perception of data
dispersion.
Another option is the violin plot. Like a jitterplot, a violin plot uses width to
indicate concentration of data within a set. Figure 3.12 is an example violin
plot, again using the same data as in Figure 3.7 and Figure 3.11:
3.4.5 Scatterplots
Scatterplots are visualizations of pairs of measurements. Each point in a
scatterplot represents two measurements related to the same element: one
measurement is represented by the x-coordinate and the other is represented
by the y coordinate. Scatterplots are closely affiliated with the statistical
analyses of correlation and regression: a line of best fit (also known as a
least-squares regression line) is often included in scatterplots to highlight the
predominant trend in the data, and that line is determined on the basis of
correlation and regression analysis statistics. Figure 3.14 presents an example
of a scatterplot with a line of best fit based on a linear regression model.
While line charts are fairly straightforward, I would recommend one bit of
caution: watch out for the post hoc ergo propter hoc4 fallacy: just because
event B comes after event A does not mean that A caused B. While time-
based charts help us put events in chronological context, we must always be
aware that other factors may be at play.
3.4.8 Forestplots
Forestplots are kind of like box-and-whisker plots where the boxes are
simpler and the whiskers get most of the attention. They are visualizations of
interval estimates5. For example, Figure 3.17 shows interval estimates for
five variables (labeled A, B, C , D, and E).
3.4.9 Heatmaps
Heatmaps use different hues to indicate patterns of intensity. They are useful
for visualizing data that vary by region – if you were to make a heatmap of the
places in your home where you spend the most time, and you’re like me, you
might have a high-intensity area on your favorite part of the couch and areas of
slightly lower intensity on either side of your favorite part of the couch.
The combination of spatial data and intensity data make heatmaps well-suited
to visualizing functional magnetic resonance imaging (fMRI) data. In mapping
the flow of oxygenated blood to parts of the brain that are active during tasks
of interest, fMRI data shows which regions are most intensely activated
during a task, those which are somewhat less activated, and those that are not
activated at all (relative to baseline activity, that is. For example, Figure 3.18,
taken from an article on brain activity in songbirds indicates the areas of the
brains of finches that respond to auditory stimuli (the article describes how
you get the bird into the brain scanner, if you’re curious). The white areas
indicate more intense activation, and the red areas indicate less intense
responses.
Figure 3.18: Sample Heatmap from fMRI Data
Here is another map, this one of the locations of Hospitals in the USA
according to 2017 data from the American Hospital Association:
Those maps look suspiciously similar! Are they building hospitals near
places with lots of McDonald’s locations? Like, is McDonald’s food so
unhealthy that there is higher demand for health services? IS THERE A
CONSPIRACY BETWEEN BIG FAST FOOD AND BIG HOSPITAL?
No. No, there is not. There is a clear mediating variable, and it’s population
density. Figure 3.19 is a map of population density in the United States:
p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pop_dens,
group = group))
p2
From a weather map, we can learn things about local climates and make
inferences about future weather events; for example: the weather in New York
City at any given time is a good predictor of the weather in Boston a few
hours later.
However, there are lots of things that don’t spread like weather. Below is a
map from the US Census indicating the rate of uninsured individuals by state.
This is a good example of how policy differences aren’t necessarily
predictable between neighboring entities. In the map below, we can see that
Arkansas has a low rate of uninsured people relative to neighboring states,
particularly Oklahoma, Texas, and Mississippi – all of which, unlike
Arkansas, have rejected the expansion of Medicaid, the federally-funded
insurance program for low-income and/or disabled individuals. For
somebody who is quite familiar with the locations of US states on a map, a
map like this might serve as an easy-to-read reference, but it is misleading to
think of the uninsured rate as a feature of geography, as things like climate,
latitude, proximity to bodies of water, etc. are far less relevant than local
policy initatives that alter conditions based on state boundaries.
Figure 3.20: The Mercator Projection, left, and the Gall-Peters Equal Area
Projection, right
Geography can also be misleading in visualizing data about people.
Geographic maps can show us where people live, but if we are interested in,
say, frequencies of person-level data per US state, the relative sizes of states
can obscure the relative populations of states.
In 1869, Charles Minard, a French Civil Engineer and pioneer in the field of
data visualization, published a visual summary of the 1812 French invasion of
and retreat from Russia that combined elements of time-series charts, alluvial
diagrams, and choropleth maps. I’ll let Tufte describe it:
The last category of data visualization tools we will cover on this page is
small multiples. Small multiples are arrangements of small versions of the
same visualization that vary by levels of a variable or variables. The idea is
that the reader can view the entire arrangement of figures at once in order to
easily make comparisons. Figure 3.23 is an example of the use of small
multiples: each of the 12 miniature figures is a histogram of random values
drawn from the same class of frequency distribution (namely, the χ 2
degrees of freedom, abbreviated df). In this case, we can easily follow how
the distribution migrates to the right and becomes decreasingly skewed as the
df increases thanks to the arrangement of the small multiples of histograms.
ggplot(small.multiple.df, aes(Values))+
geom_histogram()+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(title=bquote(Random~Samples~From~Various~chi^2~Distribu
facet_wrap(~df, labeller = label_bquote(cols=italic(df)==.(d
2. Removing the legend also gives you more space for the data, which is
usually desirable in and of itself.↩
3. Sans serif fonts are the data-ink minimizers of the typeface world↩
4. post hoc ergo propter hoc, roughly translated, means “after the thing,
therefore because of the thing”↩
5. The bars in a forestplot are a little like error bars, but…well, interval
estimates are a whole thing and we’ll talk about them later↩
6. Alluvial refers to a river delta, which such plots can resemble. Sankey is
an engineer who used plots to visualize the movement of fluids. The
plots also look like rivers or ribbons to people.↩
All of those claims are equally correct. The five of hearts – which again, is your card – is either red or black, is red,
is a heart, and is the five of hearts. But as we proceed from claim 1 down to claim 4, the statements probably seem
more convincing to you. We may be convinced by, say, claim 4 but not claim 1, is that each claim differs not in
whether it is correct or not but by the boldness of each claim. The bolder of these claims offer more details; with
more details, the probability of being correct by guessing goes down. Let’s evaluate the probability that I could
correctly make each claim if I didn’t have the power to read minds:
All playing cards are either red (hearts or diamonds) or black (clubs or spades). So this claim has to be true. The
probability of getting this right is 100%. This one is dumb. Let’s move on.
One-fourth of the playing cards in a deck are hearts, so the probability of getting this right by guessing randomly is
25%. That might make you say something like, “hey, good guess,” but it’s probably not enough to make you think that
I have mutant powers. Next!
There’s only one five of hearts in the deck. Given that there are 52 cards in a standard deck, the probability of
randomly guessing any individual card and matching the one that was drawn is therefore 1/52: a little less than 2%.
Since there are 51 cards in the deck that are not the five of hearts, the probability that I am wrong is 51/52: a little
more than 98%. In other words, it is 51 times more likely that I would guess incorrectly by naming a single, specific
card than that I would guess correctly. That’s fairly impressive – at this point one might be wondering if it’s a magic
trick or I am otherwise cheating or possibly that I might have an extra sense, but either way one might start ruling out
the possibility that I am guessing purely at random.
Now let’s say that I have made claim 4 – that you have drawn the five of hearts – and that I am correct. You, still the
scientist in this scenario, skeptical though you may be, decide to share your assessment of my unusual (and, to be
clear, fictional) talent with the world. You write up the results of this card-selecting-and-guessing experiment and
have it published in an esteemed scientific journal. In that article, you say that I have the ability to read minds, or, at
least, the part of minds that store information about recently-drawn playing cards. But, you include the eminently
reasonable caveat that you could be wrong about that; it may be a false alarm. In fact, you know that if I were a
fraud and I were merely guessing, that given the assumptions that the deck was standard and every card was equally
likely to be drawn, the likelihood of me being right was 1/52.
On the other hand, there is no way to be 100% certain that I have the ability to read minds. Maybe you’re not
satisfied with the terms of our little experiment and you think they would lead to false alarms too often for your
liking. So, maybe we do the same thing with two decks of cards. The probability of me naming your card from one
deck and repeating with another deck is much smaller than performing the feat with just one deck (we’ll learn about
this later on, but it’s ×
1
52
=
1
52
, or about 0.04%). As we add more decks to the experiment, the probability of
1
2,704
correctly guessing one card from each one approaches zero – but never equals zero.
There is one other implication to this thought experiment that I would like to point out. Often it is not practical,
feasible, or possible to replicate our scientific studies. Let’s say we only have one shot at our mind-reading
experiment: we only have one deck of cards, and for some reason, we can only use it once (so, no replacing the card,
reshuffling, and trying again). If we would not consider identifying a single card correctly as sufficient evidence of
telepathy – and, to be sure, a 1/52 chance is by no means out of the realm of possibility – there is a way to
incorporate our skepticism into the evaluation of the probability of true telepathy with the observation of card
identification. We’ll talk about this in the section on Bayesian Inference.
As you (back to real you, no longer the hypothetical you of our fantastical scenario) may have inferred, this thought
experiment is meant to serve as an extended metaphor for probability, statistics, and scientific inquiry. We are nearly
never certain about the results of our studies: there is always some probability involved, be it a probability that our
hypotheses are correct and supported by the data, a probability that our hypotheses are correct but not supported by
the data, a probability that our hypotheses are incorrect and not supported by the data, and/or a probability that our
hypotheses are incorrect but supported by the data anyway.
Statistics are our way to assess those probabilities. Not every investigation is based on knowable and calculable
probabilities like our example of guessing cards – in fact, almost all of them are not. There are many, many statistical
procedures precisely because most scientific studies are not associated with relatively simple probability structures
like there is a one-out-of-fifty-two chance that the response is just a correct guess.
So: first, we are going to learn the rules of probability and we are going to use a lot of examples like decks of cards
and marbles and jars and coin flips (SO MANY COIN FLIPS) to help us. As we proceed on to learning more
statistical procedures, the fundamental goal will be the same: understanding the probability that the thing that we are
observing is real.
p(A) ∈ R; p(A) ≥ 0
In other words, a probability value can’t be negative. I might say that there is a −99 chance that I am going to go to a
voluntary staff meeting at 6 am, but that is just hyperbole and is not physically possible (the probability is actually
0). Imaginary numbers are impossible for probabilities, too, so if at any point in this semester you find yourself
answering a probability problem with p = √−0.87, please check your work.
4.2.2 2. Normalization
So, as long as multiple events can’t happen at the same time, the sum of the probability of all of the events is one.
∑Ω = 1
The probability of the co-occurrence of two mutually exclusive events3 is equal to the sum of the probabilities of
each event:
and since the principle extends to more than two mutually exclusive events, we could also extend the above equation
to p(A ∪ B ∪ C), and p(A ∪ B ∪ C ∪ D), etc.
This axiom allows us to add the probabilities of events based on the fact that the sum of mutually exclusive events is
the sum of the probabilities: for example, a coin can’t land heads and tails on the same flip, so the probability of
heads or tails is 0.5 + 0.5 = 1.
Relative frequency theory bases the probability of something happening on the number of times it’s happened in the
past divided by the number of times it could have happened. For example, imagine that you had never seen or heard
of a fair coin before, let alone thought to flip one, and did not know that there were two equally probable outcomes
to a coin flip. Without being able to assign equal probability theory, you could come to understand the probability of
heads and tails by flipping the coin (presumably after being shown how to do so) many times. After enough flips, the
relative numbers of heads and tails would converge to an equal number of occurrences.
A subjective probability is a statement of the degree of belief in the probability of events. Any time we put a number
to the chances of an event in our personal lives – the probability that we remembered to lock the door when we left
home, the probability that we will finish a paper by Friday, the probability that we will feel less-than-perfect after a
night of revelry – we are making a statement of subjective probability.
Subjective probability is not based on wild guesses – when we make subjective probability statements about our
own lives they tend to be based on some degree of self-awareness – and in science, subjective probability judgments
tend to incorporate data and some objective measures of probability as well. Meteorologists are a classic example:
a statement like there is a 40% chance of rain is based on a subjective interpretation of weather models that
incorporate terabytes of data on things like climate, atmospheric conditions, prior weather events, etc. A statistician
who uses subjective methods may use their judgment in choosing probabilistic models and starting points, but
ultimately makes their decisions based on objective data.
4.4.1 Intersections
In probability theory (as well as in formal logic and other related fields) the concept of co-occurring events, e.g.,
event A and event B happening, is known as the intersection4 of those events and is represented by the symbol cap
∩. The probability of event A and event B happening is thus equivalently expressed as the intersection probability
p(A ∩ B) = p(A)p(B|A)
In that equation, we have introduced a new symbol: |, meaning given (see sidebar for a collection of symbols for
expressing probability-related concepts)5. The full equation can be expressed as the probability of A and B is
equal to the product of the probability of A and the probability of B given A. In the context of probability theory,
given means given that something else has happened. Generally speaking, a probability that depends on a given
event is a conditional probability6. We will discuss conditional probability at length in the section not-
coincidentally named Conditional Probability.
To illustrate the intersectional probability of dependent events: please imagine a jar with two marbles in it – one
blue marble and one orange marble – and to make the imagining easier, here is a picture of one blue marble and one
orange marble both floating mysteriously in a poorly-rendered jar:
Let’s say that we propose some defined experiments7 regarding pulling a marble out of the jar (without peeking into
the jar, of course). The first is what is the probability of drawing a blue marble out of the jar? Since there are two
marbles in the jar, and, assuming that one draws one of them (I suppose one could reach in and miss both marbles
and come out empty-handed, but let’s ignore that possibility for the purposes of this exercise), we may intuit that the
probability of drawing one specific marble is 1/2, or 0.5, or 50.8 More formally, we can define the sample space
as:
Ω = {Blue, Orange}
since the probability of the sample space is 1 (thanks to the axiom of normalization) and the probability of either of
these mutually exclusive events is the sum of the probabilities of each individual event (thanks to the axiom of finite
additivity):
Assuming equal probability of each event (which is reasonable if we posit that there is no way to distinguish one
marble from the other just be reaching into the jar without looking), then algebra so simple that we will leave out the
steps tells us that:
1
p(Blue) = p(Orange) =
2
Thus, in this defined experiment, the probability of drawing a blue marble out of the jar in a singe draw is 1/2.
Let’s define another experiment: we reach into the jar twice and pull out a single marble each time. Now, we have a
defined experiment with two trials9 What is the probability of drawing the blue marble twice? That probability
depends on one important consideration: do we put the marble from the first draw back into the jar before drawing
again? If the answer is yes, then the two trials are the same:
Figure 4.3: Sampling with Replacement
If we put whatever marble we draw on the first trial back into the jar before the second trial – or, in more formal
terms, we sample with replacement10 from the jar. That means that whatever event is observed in the first trial –
whichever marble was chosen, in this example – has no effect on the probabilities of each event in the second trial –
whichever marble is chosen second, in this example. If we put the first marble back in the jar, then the probability of
choosing a blue marble on the second trial is the same as it was on the first trial and it does not matter which marble
was selected in the first trial. The events drawing a marble in Trial 1 and drawing a marble in Trial 2 are therefore
independent events11 By definition, if events A and B are independent, then the probability of event A happening
given that event B has occurred is exactly the same as the probability of event A happening – since they are
independent, the fact that B may or may not have happened doesn’t matter at all for A and vice versa:
A and B are independent if and only if p(A|B) = p(A) and p(B|A) = p(B)
12Now, we can modify the equation for determining the probability of A ∩ B for independent events:
And back to our marble example: if we are sampling with replacement, then we can call blue on the first draw event
A and blue on the second draw event B, the probability of each is 1/2, and:
1 1 1
p(Blue trial 1 ∩ Blue trial 2) = p(Blue trial 1 )p(Blue trial 2 ) = ( )( ) = .
2 2 4
Now let’s consider the other possible sampling method: what if we don’t put the first marble back in the jar? In that
case, we have sampled without replacement13, and the probabilities associated with events in the second trial
depend on what happens in the first trial:
Figure 4.4: Sampling without Replacement
In this case, the probabilities of the events in Trial 2 are different given what happened in Trial 1 – they depend on
what happened on Trial 1 – and thus are considered dependent events.14 In the case of dependent events,
p(A|B) ≠ p(A) and p(B|A) ≠ p(B). So, based on the observed event in Trial 1, there are two possible sample
spaces for Trial 2. If the blue marble is drawn on Trial 1 and not replaced, then on Trial 2:
and if the orange marble is drawn in Trial 1 and not replaced, then on Trial 215:
So: what is the probability of drawing the blue marble twice without replacement? We can know – and Figure 4.4
can help illustrate – that it is impossible: if we take out the blue marble in Trial 1 and don’t put it back in, there is no
way to draw a blue marble in Trial 2. The probability is 0. The math backs this up: the second draw depends on the
result of the first draw, and the probability of drawing a blue marble in Trial 2 given that a blue marble was drawn
in Trial 1 without replacement. Using the formula for and probabilities16:
1
p(Blue trial 1 ∩ Blue trial 2) = p(Blue trial 1 )p(Blue trial 2 |Blue trial 1 ) = ( ) (0) = 0.
2
The logic of intersection probability for two events holds for the intersection probabilities for more than two events,
although the equations we use to evaluate them look more complex. The intersection probability of two events is
given by the product of the probability of the first event and the probability of the second event given the first
event. The intersection probability of three events is given by the product of the probability of the first event and the
probability of the second event and the probability of the third event given the first event and the second event:
The intersection probability of four events is given by the product of the probability of the first event and the
probability of the second event and the probability of the third event given the first event and the second event and
the probability of the fourth event given the first event and the second event and the third event):
To help us understand the intersection of more than two events, first let’s look at a case where all of the events are
independent of each other. In this example there are four events: A, B, C , and D. Because all of the events are
independent of each other, then by the definition of independent events: p(B|A) = p(B), p(C|A ∩ B) = p(C), and
p(D|A ∩ B ∩ C) = p(D). Thus, the equation for the intersection probability of these events simplifies to:
17
p(A ∩ B ∩ C ∩ D) = p(A)p(B)p(C)p(D)
This example is a personal favorite of mine, because I think it links a somewhat complex intersection probability
with a more intuitive understanding of probability as x chances in y. It’s about gambling.18 Specifically, it’s about
one of the lottery games run by the Commonwealth of Massachusetts: The Numbers Game. In that game, there are
four spinning wheels (pictured in Figure 4.5), each with 10 slots representing each digit from 0 – 9 one time, and
there is a ball placed in each wheel. The wheels spin for a period of time, and when they stop spinning, the ball in
each wheel comes to rest on one of the digits.
Figure 4.5: The Massachusetts Lottery Numbers Game, known in Boston-area locales as THE NUMBAH
The result is a four-digit number, and to win the jackpot, one has to pick all four digits in the correct order.19 Each
wheel is equally likely to land on each of the 10 digits (0 – 9), and each wheel spins independently so that the
outcome on each wheel is literally independent of the outcomes on any of the other wheels. Thus, the probability of
picking the correct four digits in order is the intersection probability of picking each digit correctly. In other
words: it’s the probability of picking the first digit correctly and picking the second digit correctly and picking the
third digit correctly and picking the fourth digit correctly:
Since we know that each digit is an independent event, we need not concern ourselves with conditional
probabilities: the probability any digit is 1/10, and is exactly the same regardless of any of the other digits
(symbolically: p(2nd digit) = p(2nd digit|1st digit), p(3rd digit) = p(3rd digit|1st digit ∩ 2nd digit), and
p(4th digit) = p(4th digit|1st digit ∩ 2nd digit ∩ 3rd digit)). So, the probability of the jackpot is:
Thus, the probability of picking the winning series of numbers is 1/10000: if you bought a ticket for the numbers
8334, then there’s a 1/10000 probability that that set of numbers comes up; if you bought a ticket for the numbers
Another fun feature of using The Numbers Game as an example is that since each of the wheels is the same, spinning
all of the wheels at the same time gives the same expected outcomes that spinning one of the wheels four times would
do – it would just take longer to spin one four times – so effectively it’s an example of sampling with replacement.
Now let’s examine how intersections of more than two events work when events are dependent with an example of
sampling without replacement. For this example, we’ll talk about playing cards again. Let’s say that you are dealt
four (and only four) cards from a well-shuffled deck of ordinary playing cards.21 What is the probability that you
are dealt four aces? Please note: since you are being dealt these cards, they are not going back into the deck: this is
sampling without replacement.
There are four aces in a deck of 52 playing cards, so the probability of being dealt an ace on the first draw is 4/52.
If you are dealt an ace on the first draw (if you aren’t dealt an ace on the first draw, the probability of getting four
aces in four cards is zero so that doesn’t matter), then there will be three aces left in a deck of 51 cards, so the
probability of being dealt an ace on the second draw will be 3/51. If you are dealt aces on each of the first two
draws, then there will be two aces left in a deck of 50 cards, so the probability of being dealt a third ace will be
2/50. Finally, if you are lucky enough to be dealt aces on each of the first three draws, then there will be one ace
left in a deck of 49 cards, so the probability of being dealt a fourth ace will be 1/49. We can express that whole last
paragraph in math-symbol terms like:
4
p(Ace f irst ) =
52
3
p(Ace second |Ace f irst ) =
51
2
p(Ace third |Ace f irst ∩ Ace second ) =
50
1
p(Ace f ourth |Ace f irst ∩ Ace second ∩ Ace third ) =
49
Being dealt four aces out of four cards is equivalent to saying being dealt an ace on the first draw and being dealt
an ace on the second draw and being dealt an ace on the third draw and being dealt an ace on the fourth draw:
it’s the intersection probability of those four related events. Using the equation
p(A ∩ B ∩ C ∩ D) = p(A)p(B|A)p(C|A ∩ B)p(D|A ∩ B ∩ C ∩ D) and substituting the probabilities of each
Ace-drawing event outlined above, the probability of four consecutive aces from a well-shuffled 52-card deck is:
4 3 2 1 24
p(4 Aces) = ( )( )( )( ) = ≈ 0.00000369
52 51 50 49 6497400
which is a very small number. Drawing four consecutive aces is not very likely to happen. And yet: it’s exactly as
likely as drawing four consecutive jacks or four consecutive 9’s, and 24 times more likely than the combination of
any four specific cards (like, the queen of spades and the five of hearts] and the 10 of clubs and the 2 of diamonds22
I noted above that keeping track of which events may depend on others is important (far more important than which
event you call A and which event your call B). The examples of the lottery game (independent events) and of dealing
aces (dependent events) are relatively simple ones. It can get extremely complicated to keep track of not only what
depends on what but also the ways in which those dependencies change sample spaces – that is: what the probability
of one event is given other events. Probability trees are visualizations of relationships between events that I find
very helpful for both keeping track of things and for calculating complex probabilities, and we will get to those, but
first, we need to talk about unions
4.4.2 Unions
The total occurrences of events, e.g., event A or event B happening, is known as the union23 The probability of
event A or event B happening is thus equivalently expressed as the union probability of A and B, as p(A) ∪ p(B),
and as p(A ∪ B).
We know already from the finite additivity axiom of probability that the total probability of mutually exclusive
events is the sum of of the probabilities of each of the events. Now that we are more familiar with terminology and
symbology, we can write that axiom as:
where the term p(A ∩ B) = 0 represents the condition of A and B being mutually exclusive (and thus the
probability of A and B both happening being 0).
You may have be wondering, “what happens if the events are not mutually exclusive?” and even if you are not
wondering that, I will tell you anyway. Let’s say we are interested in the combined probability of two events A and
B that can co-occur – for example, the probability that, in two flips of a fair coin, the coin will turn up heads both
times. In that example, the coin could turn up heads on the first flip and turn up heads on the second flip and turn up
heads on both flips. If we incorrectly treated the events H eads f lip1 and H eads as independent events, we
f lip2
would incorrectly conclude that the probability of observing heads on either of two flips is
p(H eads f lip 1) + p(H eads ) = 0.5 + 0.5 = 1. That conclusion, in addition to being wrong, doesn’t make
f lip 2
sense: it obviously is possible to observe 0 heads in two flips of a fair coin. If we want to get even more absurd
about it, we can imagine three flips of a fair coin: surely the probability of three consecutive heads is not
p(H eads f lip 1) + p(H eads ) + p(H eads
f lip 2 ) because that would be 0.5 + 0.5 + 0.5 = 1.5 and thus violate
f lip 3
What is causing that madness of seemingly overinflated union probabilities? The union probability of A and B can
be thought of as the probability of A or B, but it can also be considered the total probability of A and B, which
means that it’s the probability of:
When we add the probabilities of events that are not mutually exclusive, we are effectively double-counting the
probability of A ∪ B. Consider the following problem: two people – Alice and Bambadjan (conveniently for the
math notation we have been using, we can call them A and B) – have consistent and different sleep schedules. Alice
goes to sleep every night at 10 pm and wakes up at every morning at 6 am. Bambadjan goes to sleep every night at 4
am and wakes up every day at 12 pm. During any given 24-hour period, what is the probability that Alice is asleep
or Bambadjan is asleep?
The probability of the co-occurrence of two events – whether the events are mutually exclusive are not – is given by
the equation:
Let’s visualize what this equation means by representing the sleep schedules of Alice (A) and Bamnadjan (B) as
timelines:
Figure 4.6: The Sleep Schedules of Alice and Bambadjan
Hopefully, the figure makes it easy to see that if we are interested in the times that either Alice or Bambadjan is
asleep, then then the time when their sleep overlaps does not matter. More than that: that overlap will mess up our
calculations. If we add up the times that A. Alice is asleep, B. Bambadjan is asleep, and C. both Alice and
Bambadjan are asleep, we get
8 hours + 8 hours + 2 hours = 18 hours.
If you add to that result to the 10 hours per day when neither Alice nor Bambadjan is asleep (between 12 pm and 10
pm), you get a 28-hour day, which is not possible on Planet Earth.
Figure 4.7: only 24 hours per day here
So, we have to account for the double-counting of the time when both are asleep – the intersection of event A (Alice
being asleep) and event B (Bambadjan being asleep). Using the above equation for intersection probabilities for
non-exclusive events:
8 8 2 14
p(A ∪ B) = p(A) + p(B) − p(A ∩ B) = + − =
24 24 24 24
When two events are mutually exclusive, calculating their union probability is simple: since there is no intersection
probability of the two events, the intersection term of the equation drops out and we are left with
p(A ∪ B) = p(A) + p(B). That simplicity holds when calculating the union probability of more than two mutually
exclusive events:
For example, let’s consider the probability of a person’s birthday falling on one day or another. Assuming that the
probability of having a birthday on any one given day of the week is 1/7,24 what is the probability that a person’s
birthday falls on a Monday or on a Wednesday or on a Friday? These three events (1. birthday on Monday, 2.
birthday on Wednesday, 3. birthday on Friday) are mutually exclusive, so, the calculation for that probability is
straightforward:
1 1 1 3
p(M onday ∪ T uesday ∪ W ednesday) = + + =
7 7 7 7
And, of course, the probability that a person’s birthday falls on any day of the week is
1/7 + 1/7 + 1/7 + 1/7 + 1/7 + 1/7 + 1/7 = 1.
As you might imagine, the union probabilities of more than two events when some or all are not mutually exclusive is
a bit more difficult. Here is a diagram representing the union probability of two intersecting events; it is similar to
the one we used to examine the union probability of Alice and Bambadjan being asleep, but a little more generic:
In Figure 4.9, we add a third intersecting event C to intersecting events A and B. The range of outcomes for the three
events separately are represented in part A, and the four intersections between the events – A ∪ B, A ∪ C , B ∪ C ,
and A ∪ B ∪ C – are shown in Figure 4.9, part B.
In Figure 4.9, part C, we see what happens when we add up the events and subtract the pairwise intersections A ∩ B
, B ∩ C , and A ∩ C : we end up under-counting the three-way intersection A ∩ B ∩ C . That happens because the
sum of A and B and C has two areas of overlap in the A ∩ B ∩ C section (that is, in that area, you only need one of
the three but you have three, so, two are extra) and when we subtract the three pairwise intersections from the sum –
which we have to do to account for the double-counting between pairs of events – we take three away from that
section, leaving it empty. Thus, we have to add the three-way intersection (the blue line in part C) back in to get the
whole union probability shown in part D.
Thus, the general equation for the union probability of three events is:
p(A ∪ B ∪ C) =
+p(A ∩ B ∩ C)
and25 for any combinations of those events that are mutually exclusive, the interaction term for those combinations is
0 and drop out of the equation.
Just like knowing which events depend on which other events and how in calculating intersection probabilities can
be tricky, keeping track of which events intersect with each other and how in calculating union probabilities can also
be tricky, and keeping track of all of those things at the same time can be multiplicatively tricky. In the next section,
we will talk about a visual aid that can help with all of that trickiness.
The expected value of a defined experiment is the long-term average outcome of that experiment over multiple
iterations. It is calculated as the mean outcome weighted by the probability of the possible outcomes. If x is the i
value of each event i among N possible outcomes in the sample space, then:
N
E(x) = ∑ x i p(x i )
i=1
For example, consider a roll of a six-sided die. The sample space for such a roll is Ω = {1, 2, 3, 4, 5, 6}: thus,
x = 1,
1 x = 2,
2 x = 3,3 x = 4, 4x = 5, and x = 6. The probability of each event is 1/6, so
5 6
p(x ) = p(x ) = p(x ) = p(x ) = p(x ) = p(x ) = 1/6. The expected value of a roll of a six-sided die is
1 2 3 4 5 6
therefore:
1 1 1 1 1 1 21
E(x) = 1 ( ) + 2( ) + 3( ) + 4( ) + 5( ) + 6( ) = = 3.5.
6 6 6 6 6 6 6
Note that you can never roll a 3.5 with a six-sided die! But, in the long run, that’s on average what you can expect.
Next, let’s consider a roll of two six-sided die. Each die has a sample space of Ω = {1, 2, 3, 4, 5, 6}, so the sample
space of the two dice combined is:
Die 2
Die 1 1 2 3 4 5 6
1234 5 6 7
2345 6 7 8
3456 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
The expected value of a roll of two six-sided dice is the sum of the values in the table (it’s 252) divided by the total
number of possible outcomes (36), or 7.
We could also arrive at the same number by defining the sample space as Ω = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, and
noting that there are more ways to arrive at some of those values than others. The number 7, for example, can happen
6 different ways, and so p(7) = 6/36 = 1/626 Multiplying each possible sum of the two dice by their respective
probabilities gives us:
1 2 1 252
E(x) = 2 ( ) + 3( ) + ... + 12 ( ) = = 7.
36 36 36 36
i=1
Probability Trees, also known as Tree Diagrams, are visual representations of the structure of probabilistic events.
There are three elements to a probability tree:27
2. Branch: the connection between nodes and events; we indicate on branches the probability of the events for a
given trial.
3. Event (or Outcome): an event (or outcome). Given that we’ve already defined event (or outcome), this
definition is pretty straightforward and a little anticlimactic.
Figure 4.10 is a probability tree for a single flip of a fair coin with labels for the one node, the two branches, and the
two possible outcomes of this trial.
We can think of the node as a starting point for the probabilistic paths that lead to events, with the probability of
taking each path written right on the branches. In the case of a coin flip, once the coin is flipped (the node), we can
see from the diagram that there is a probability of 0.5 that we will take the path to heads and a probability of 0.5 that
we will take the path that ends in tails. Using a tree diagram to map out the probabilities associated with a single
coin flip is perfectly legitimate, but not terribly necessary – it’s a pretty simple example. Probability trees can be –
and usually are – more complex (but tend to be less complex than algebraic representations of the same problems,
otherwise, that would defeat the point of using a tree to help us understand the math). A node can take on any number
of branches, and each event can, in turn, serve as a node for other branches and events. For example, in Figure 4.11,
we have a probability tree representation of two flips of a fair coin.
Connected branches represent intersections. In Figure 4.11, the path that goes from the first node to Heads and then
continues on to Heads again represents the probability p(H eads ∩ H eads ). The sum of probabilities of
f lip 1 f lip 2
events are unions. The event of getting exactly one heads and one tails in two flips is represented in two ways in
Figure 4.11: one could get heads first and then tails or get tails first and then heads. We can also see from the
diagram that these events are mutually exclusive (on neither flip can you get heads and tails), so, the probability of
one heads and one tails in two flips is the sum of the intersection probabilities p(H eads ∩ T ails ) andf lip 1 f lip 2
p(1 H eads, 1 T ails) = p(H eads f lip 1 ∩ T ails f lip 2 ∪ T ails f lip 1 ∩ H eads f lip 2) = (0.5)(0.5) + (0.5)(0.5) = 0.5.
In general, then, the rule for working with probability trees is multiply across, add down.
The probabilities on the branches of tree diagrams can take on any possible probability value and therefore can be
adjusted to reflect conditional probabilities. For example: let’s say there is a jar with 3 marbles in it: 1 red marble,
1 yellow marble, and 1 green marble. If, without looking into the jar, we took one marble out at a time without
replacement, then the probability of drawing each marble on each draw would be represented by the tree diagram in
Figure 4.12.
Figure 4.12: Probability Tree Depicting Sampling Without Replacement
We can use the diagram in Figure 4.12 to answer any probability questions we have about pulling marbles out of the
jar. What is the probability of pulling out marbles in the order Red, Yellow, Green? We can follow the topmost path
to find out: the probability of drawing Red first is 1/3. Given that we have drawn Red first, the probability of
drawing Yellow next is 1/2. Finally, given that Red and Yellow have been drawn, the probability of drawing Green
on the third draw is 1. Multiplying across the path (as we do), the probability of drawing marbles in the order Red,
Yellow, Green is (1/3)(1/2)(1) = 1/6. What is the probability of drawing Yellow last? That is given by adding
the products of the two paths that end in Yellow: adding down (as we do), the probability is
(1/3)(1/2)(1) + (1/3)(1/2)(1) = 2/6 = 1/3. We can use the tree to examine probabilities before the
terminal events as well. For example: what is the probability of drawing Green on the first or second draw? We can
take the probability of Green on draw 1 – 1/3 – and add the probabilities of drawing Red on the first draw and
Green on the second draw – (1/3)(1/2) = 1/6 and of drawing Yellow first and Green second – also
(1/3)(1/2) = 1/6 and we get (1/3) + (1/6) + (1/6) = 4/6 = 2/3.
Jean d’Alembert (1717 – 1783) was a French philosophe and mathematician. He was a co-editor (with Denis
Diderot) of the Encyclopedia and contributed to it over a thousand articles, one of which was Heads or Tails, an
essay about games of chance. In that article, he gives a sophisticated and ahead-of-its-time analysis of The
St. Petersburg Paradox, one of the critical probability problems in the field of behavioral economics. However, that
article isn’t known for that at all. It’s known for one of the most famous errors in probabilistic reasoning ever made
by a big-deal mathematician in print.
Here’s why d’Alembert gets d’unked on. Say we’re playing a coin-flipping game (imagine it’s mid-18th-century
France and this is the kind of thing we do for fun) where I flip a coin twice and you win if it comes up heads on
either flip. d’Alembert argued that if you got heads on the first flip then the game would stop there, thus, there were
three possible outcomes of the game: heads on the first flip (and no second flip), tails on the first flip and heads on
the second flip, and tails on the first flip and tails on the second flip. In sample-space terms, d’Alembert was
describing the game as Ω = {H , T H , T T } and – here’s the big problem – said that each of those outcomes was
equally probable (p = 1/3 each). We can pretty easily show why that conclusion is wrong (and we can refer to
Figure 4.11 – to help): the probability of getting heads on the first flip is 1/2 regardless of what happens after that
flip,28 and the probability of both heads after tails and tails and tails is 1/4.
1. I find it comforting to know that a historically great mathematician can make mistakes, too, and I hope you do as
well, and
2. Without meaning to, d’Alembert’s calculation conflated elementary events29 and composite (or compound)
events30
An elementary event is an event that can only happen in one specific way. If I were to ask what is the probability
that I (Dan, the person writing this right now) will win the gold medal in men’s singles figure skating at the next
Winter Olympics?, then I would be asking about the probability of an elementary event: there is only one me, that is
only one Olympic sport, and there is only one next Winter Olympics. Also, that probability would be 0, but that’s
beside the point (If it weren’t for the axiom of nonnegativity, the probability of me winning a skating competition
would be negative – I can barely stand up on skates). If I were to ask what is the probability that an American wins
the gold medal in men’s figure skating at the next Winter Olympics, I would be asking about the probability of a
composite event: there are multiple ways that could happen because there are multiple men in America, each of
whom represents a possible elementary event of winning the gold in that particular sport in that particular Olympics.
Back to our man d’Alembert and the problem of two coin flips: in two flips of a fair coin, we might say that there are
three possible outcomes:
Ω = {2H , 1H , 0H }
That’s all technically true, but misleading, because one of those outcomes is composite while the other two are
elementary. That is: there are two ways to get one heads in two flips: you can get heads then tails or you can get
tails then heads. There is only one way to get two heads – you must get heads then heads again – and there is only
one way to get zero heads – you must get tails then tails again – so there are twice as many ways to get one heads
than two get either two heads or zero heads.
Thus, a better, less misleading way to characterize the outcomes and the sample space of two coin flips is to do so in
terms of the four elementary outcomes:
{ }
Ω = {H H , H T , T H , T T }
The probability of each of those elementary events is equal31 and by breaking down events into their elementary
components, we avoid making the same mistake as d’Alembert did.
4.7.1 Permutations
Here’s an example to describe what we’re talking about with complex sample spaces: imagine five people are
standing in a line.
Figure 4.14: Pictured: Five People Standing in Line in a Stock Photo that I did not Care to Pay For.
How many different ways can those five people arrange themselves? Let’s start with the front of the line. There are
five options for who stands in the first position. One of those five goes to the front of the line, and now there are four
people left who could stand in the second position. One of those four goes behind the first person, and then there are
three left for the third position, then there will be two left for the fourth position, and finally there will be only one
person available for the end of the line.
Let’s call the people A, B, C , D, and E, which would be rude to do in real life but they’re fictional so they don’t
have feelings. For each of the five possible people to stand in the front of the line, there are four possibilities for
people to stand behind them:
That’s 5 × 4 = 20 possibilities of two people in the first two parts of the line. For each of those 20 possibilities,
there are three possible people that could go in the third position, so for the first three positions we have
5 × 4 × 3 = 60 possibilities. The pattern continues: for the first four positions we have 5 × 4 × 3 × 2 = 120
possibilities, and for each of those 120 possibilities there is only one person left to add at the end so we end up with
a total of 5 × 4 × 3 × 2 × 1 = 120 possibilities for the order of five people standing in a line.
In general, the number of possible orders of n things is, as in our example of five people standing in a line,
n × (n − 1) × (n − 2) × ... × 1). That expression is equivalent to the factorial of n, symbolized as n! .
32 In our
example, there were n = 5 people standing in line so there were 5! = 120 possible orders. If you had two items to
put on a shelf, there would be 2! = 2 ways to arrange them; if you had six items to put on a shelf, there would be
6! = 720 ways to arrange them.
Now let’s say that we had our same five people from above but only three could get in the line. How many ways
could three people selected out of the five people stand in order? The math starts out the same: there are five
possibilities for the first position, and four possibilities for the second for each of the five in the first, and three
possibilities for each of the five in the first and four in the second: 5 × 4 × 3 = 60 possibilities. What happens
next? Nothing. There are no more spots in the line, so whomever is left out doesn’t get a spot in the line (which
could be sad, but again: fictional people don’t have emotions so don’t feel too bad for them).
At this point – and this may be overdue – we can define the term permutation.33 Permutations are combinations of
things where the order of those things matters. The term includes orders of n objects, but also combinations of fewer
than n of those objects with order.
Note what happened when we removed two possible positions from the line where five people were to stand. We
took the calculation of the number of possible orders – 5! = 5 × 4 × 3 × 2 × 1 = 120 – and we removed the last
two terms (because we ran out of room in the line). We didn’t subtract them, but rather, we canceled them out. To do
that mathematically, we use division:
5 × 4 × 3 × 2 × 1
60 = = 5 × 4 × 3
2 × 1
5!
60 = .
2!
Let’s briefly look at another example: imagine you had 100 items and you had a small shelf with space for only two
of those items; how many different ways could that shelf look? For the first spot on the shelf, you would have 100
options. For the second spot, whatever choice you made for the first spot would leave 99 possibilities for the second
spot. And then you would be out of shelf space, so the total number of options would be 100 × 99 = 9, 900. Another
way to think of that is that from the 100! possible orders of all of your items, you canceled out the last 100 − 2 = 98
possibilities:
100!
9, 990 = 100 × 99 =
98!
In general, then, we can say that the number of permutations of n things given that they will be put in groups of size r
– or as we officially say it, n things Permuted r at a time – and as we symbolically write it P – is:
n r
n!
nP r =
(n − r)!
That equation works just as well when r – the number of things being permuted at a time – is the same as n – the
number of things available to permute. In our original example, we had five people to arrange themselves in a line of
five. The number of permutations is given in that case by:
5! 5!
5 P5 = = = 120
(5 − 5)! 1
keeping in mind the fact noted in the sidenote above and in the bonus content below that the factorial of 0 is 1.
4.7.2 Combinations
Permutations are arrangements of things for which the order matters. When things are arranged in ways and the order
doesn’t matter, those arrangements are called combinations.34 Above, we calculated the number of possible
permutations for five people standing in a line: P = 5! /0! = 120. But what if order doesn’t matter – what if it’s
5 5
just five people, standing not in a line, but just scattered around? How many ways can you have a combination of
five people given that you started with five people? Just one. If their relative positions don’t matter, there’s only one
way to combine five people in a group of five. The number of combinations is always reduced by a factor of the
number of possible orders relative to the number of permutations. As mentioned in the previous section, the number
of orders is the factorial of the number of things in the group: r! . Thus, we multiply the permutation formula by 1/r! ,
putting r! in the denominator, to get the combination formula for n things Combined r at a time (also known as the
combinatorial formula):
n!
nCr =
(n − r)! r!
Thus, while there are 5! /0! = 120 possible permutations of five people grouped five at a time, there is just
5! /(5! 1! ) = 1 possible combination of five people grouped five at a time.
Here’s another example: imagine that you are going on a trip and you have five books that you are meaning to read
but you can only bring three in your bag. How many combinations of three books can you bring?35 The order of the
books doesn’t matter, so we have five things combined three at a time:
5!
5 C3 = = 10.
2! 3!
The combination is going to be super-important later on when we talk about binomial probability, so you can go
ahead and start getting excited for that.
4.8 Odds
Odds36 are an expression of relative probability. We have three primary ways of expressing odds: odds in favor,
odds against, and odds ratios. The first two – odds in favor and _odds against – are conceptually the same but,
annoyingly, are completely opposite.
Let’s assume that we have two possible events: event A and event B. The odds in favor of event A is expressed as
the numerator of the unsimplfied ratio of the probabilities of A and B, followed by a colon, followed by the
denominator of the ratio of the probabilities of A and B. That probably sounds more complicated than it is – for
example, if A is twice as probable as B, then the odds in favor of A are 2 : 1. And, odds are almost always
expressed in terms of integers, so if A is 3.5 times as probable as B, then the odds in favor of A are 7 : 2.
Another way of thinking of odds in favor is that it is the number of times that A will happen in relation to the number
of times that B will happen. For example, if team A is three times as good as team B, then in the long run team A
would be expected to win three times for every one time that team B wins, and the odds in favor of A are 3 : 1.
Yet another way of thinking about it – and odds are a total gambling thing so this is likely the classic way of thinking
about it – is that the odds are related to the amount of money that each player should bet if they are wagering on two
outcomes. Imagine two gamblers are making a wager on a contest between team A and team B, where team A is
considered to be three times as good as team B. Each gambler will put a sum of money into a pot; if team A wins,
then the gambler who bet on team A to win takes all of the money in the pot, and if team B wins, then the gambler
who bet on team B to win takes all of the money in the pot. It would be unfair for both gamblers to risk the same
amount in order to win the same amount – with team A being three times as good, somebody who bets on team A is
taking on much less risk. Thus, in gambling situations, to bet on the better team costs more for the same reward and
to bet on the worse team costs less to win the same reward. A fairer setup is for the gambler betting on team A to
pay three times as much as the gambler betting on team B.
Figure 4.15: Making things even more annoying is that phrase the odds being in your favor has nothing to do with
odds in favor of you – it means that the odds are favorable, that is, that you are likely to win.
Now, this is kind of stupid, but odds against are the _exact opposite as odds in favor. The odds against is the
relative probability of an event not happening to the probability of an event happening. If the odds in favor of event
A are 3 : 1, then the odds against event A are 1 : 3.
Since we’ve been talking about gambling, I feel the need to point out here that gambling odds are expressed as odds
against. #### Odds and probabilities
If the odds in favor of an event are known, the probability of the event can be recovered. Recall the example of a
contest in which the odds in favor of team A are 3 : 1, and that team A would therefore be expected to win three
contests for every one that they lose. If team A wins three for every one they lose, then they would be expected to
win 3 out of every 4 contests, so the probability of team A winning would be 3/4.
Another way of expressing odds is in terms of an odds ratio. For a single event, the odds ratio is the odds in favor
expressed as a fraction:
p(A)
OR f or A vs.¬A =
1 − p(A)
When we have separate sample spaces, then odds ratio are the literal ratio of the odds in favor of each event
(expressed as ratios). For example, imagine two groups of 100 students each that are given a statistics exam. The
first group took a statistics course, and 90 out of 100 passed the exam. The second group did not take the statistics
exam, and 75 out of 100 did not pass. What is the odds ratio of passing the course between the two groups?
The odds in favor of passing having taken the course, based on the observed results, are 9 : 1: nine people passed
for every one that failed. The odds in favor of passing having not taken the course are 1 : 3: 1 person passed for
every 3 who failed. The odds ratio is therefore:
9/1
OR = = 27
1/3
Thus, the odds of passing the course are 27 times greater for people who took the course.
six-sided die is rolled is much different than the probability of rolling a 5 given that the twenty-sided die is rolled. In
other words, the conditional probability of rolling a 5 with a six-sided die differs from the conditional probability
or rolling a 5 with a twenty-sided die (the two conditional probabilities are 1/6 and 1/20, respectively).
Conditional probabilities are also called likelihoods37. Any time you see the word “likelihood” in a statistical
context, it is referring to a conditional probability.38
Sometimes, we can infer likelihoods based on the structure of a problem, as in the case of the six-sided vs. the
twenty-sided dice. Other times, we can count on (ha, ha) mathematical tools to assist. Two of the most important
tools in probability theory and – more importantly for our purposes – for statistical inference are Binomial
Probability and Bayes’s Theorem.
Let’s talk some more about flipping coins (I promise this will lead to non-coin applications of an important
principle). The probability of each outcome of a coin flip is a conditional probability: it is conditioned on the
assumption that heads and tails are equally likely. The probability associated with multiple coin flips is also a
conditional probability: it is conditioned on the assumption of equal probability of heads and tails and the number
of flips. For example: the likelihood of getting one heads in one flip of a fair coin is the probability of heads given
that the probability of heads in each flip is 0.5 and also given that there is one flip is 0.5. Let’s write that out
symbolically and introduce the notation π, defined as the probability of an event on any one given trial (and not the
ratio of the circumference of a circle to its diameter) and use N to mean the total number of trials:
The probability of getting one heads in two trials is the probability of getting heads on the first flip and tails on the
second flip or of getting tails on the first flip and heads on the second flip:
The probability of getting one heads in three trials is the probability of getting heads on the first flip and tails on the
second flip and tails on the third flip or of getting tails on the first flip and heads on the second flip and tails on the
third flip or of getting tails on the first flip and tails on the second flip and heads on the third flip:
We could calculate the probability of any set of outcomes given any number of flips of coins by identifying the
probability of each possible outcome given the number of flip, identifying all of the ways we could get to those
outcomes, and adding all of those probabilities up. But, there is a better way.
A flip of a fair coin is an example of a binomial trial39. A binomial trial is any defined experiment where we are
interested in two outcomes. Other examples of binomial trials include mazes in which an animal could turn left or
right and survey questions for which an individual could respond yes or no. A binomial trial could also involve
more than two possible outcomes that are arranged into two groups: for example, a patient in a clinical setting could
be said to improve, stay the same, or decline; those data could be treated as binomial by arranging them into the
binary categories improve and stay the same or decline. Similarly, an exam grade for a student can take on any value
on a continuous scale from 0 – 100, but the value could be treated as the binary pair pass or fail.
The probability π can be any legitimate probability value: it can range from 0 to 1. Figure 4.17 adapts Figure 4.11 –
which was a probability tree depicting two flips of a fair coin – to generally describe two consecutive binomial
trials. Instead of the probability value of 0.5 that was specific to the probabilities of flipping either heads or tails,
we’ll call the probability of one outcome π, which makes the probability of the other outcome 1 − π. Instead of
heads and tails, we’ll use s and f , which stand for success and failure. Yes, “success” and “failure” sound
judgmental. But, we can define either of a pair of binomial outcomes to be the “success,” leaving the other to be the
“failure” – it’s a totally arbitrary designation and it just depends on which outcome we are interested in.
2
p(0s) = (1 − π)
In general, the likelihood of a set of outcomes is the probability of getting to each outcome times the number of ways
to get to that outcome. The probability of each path is known as the kernel probability40 and is given by:
s f
kernel probability = π (1 − π)
Given the probability of each path, the overall probability is the sum of the paths. In other words, the binomial
likelihood is the product of the kernel probability and the number of possible combinations represented by the kernel
probability. The number of possible combinations is given by the combination formula:
N!
N Cs =
s! f !
N! s f
p(s|N = s + f , π) = π (1 − π)
s! f !
The binomial likelihood function makes it easy to find the likelihood of a set of binomial outcomes knowing only (a)
the probability of success on any one trial and (b) the number of trials. For example:
trials.
10! 7 3
p(s = 7|π = 0.5, N = 10) = (0.5) (0.5) = 0.1171875
7! 3!
2. What is the probability of drawing three consecutive blue marbles with replacement41 from a jar containing 1
blue marble and 4 orange marbles?
We’ll call blue a success, we know the probability of drawing a blue marble from a jar with one blue marble and
four orange marbles is 1/5 = 0.2, and N = 5 trials.
5! 3 2
p(s = 3|π = 0.2, N = 5) = (0.2) (0.8) = 0.0512
3! 2!
3. The probability of winning a game of craps at a casino is approximately 49%. If 15 games are played, what is the
probability of winning at least 12?
We’ll call winning a success (not much of a stretch there), we are given that π = 0.49, and there are N = 15 trials.
This question specifically asks for the probability of at least 12 successes, which means we are looking for the
probability of winning 12 _or 13 or 14 _or 15 games. In other words, we have union probabilities of mutually
exclusive events (you can’t win 12 and 13 games out of 15), so we add them.
15! 15!
12 3 13 2
p(s ≥ 12|π = 0.49, N = 15) = (0.49) (0.51) + (0.49) (0.51)
12! 3! 13! 2!
15! 14 1
15! 15 0
+ (0.49) (0.51) + (0.49) (0.51) = 0.01450131
14! 1! 15! 0!
A nice property of binomial probability is that the expected value and the variance are especially simple to find. We
can still use the typical equation for the expected value:
N
E(x) = ∑ x i p(x i )
i=1
but consider that binomial outcomes are binary data and can be assigned values of 1 and 0. For example, let’s
consider 2 flips of a fair coin where heads is considered success and is assigned a value of 1. The sample space for
s is Ω = {0, 1, 2}. As we have noted elsewhere, the probability of 0 heads in 2 flips is 0.25, the probability of 1
heads in 2 flips is 0.5, and the probability of 2 heads in 2 flips is 0.25. Thus, the expected value of s is:
Thus, in two flips of a fair coin, we can expect one heads. In general, the expected value of a set of N binomial
trials with a p(s) = π where s = 1 and f = 0 is:
E(x) = N π
The variance is similarly easy. The variance of a set of N binomial trials with a p(s) = π where s = 1 and f = 0
is:
V (x) = N π(1 − π)
To illustrate using the general variance formula for the case of two flips of a fair coin (where we have already
shown that E(x) = 1:
2 2 2
V (x) = (0 − 1) (0.25) + (1 − 1) (0.5) + (2 − 1) (0.25) = 0.25 + 0.25 = 0.5
Figure 4.18: The Setup to a Brief Series of Relatively Easy Probability Problems
Please examine the contents of the two poorly-drawn jars in Figure 4.18: Jar 1 is filled exclusively with orange
marbles and Jar 2 is filled exclusively with blue marbles. Then, please consider the following questions regarding
conditional probabilities:
1. What is the probability of drawing an orange marble given that you have Jar 1?
2. What is the probability of drawing a blue marble given that you have Jar 1?
3. What is the probability of drawing an orange marble given that you have Jar 2?
4. What is the probability of drawing a blue marble given that you have Jar 2?
If we are drawing from Jar 1, we can only draw orange marbles, so there is a 100% chance of drawing orange given
that we have Jar 1 and a 0% chance of drawing blue given that we have Jar 1. The reverse is true for Jar 2: there is
a 100% chance of drawing blue and a 0% chance of drawing orange given that we have Jar 2. In this example, the
conditional probability of drawing an orange marble or a blue marble depends entirely on what jar we have. In more
formal terms, the sample space of the possible events is conditional on what jar we have: if we have Jar 1, then the
sample space is {Orange, Orange, Orange, ...}; if we have Jar 2, then the sample space is
{Blue, Blue, Blue, ...}. Yet another fancy way of saying that is to say that the choice of jar reconditions the
sample space.
Now let’s add a fifth question about these jars. Imagine that all of the marbles in Jar 1 are still orange and all of the
marbles in Jar 2 are still blue, but now the jars are opaque and a marble is drawn without looking inside.
5. Given that a blue marble is drawn, what is the probability that we have Jar 2?
Since there are no blue marbles in Jar 1 and there are blue marbles (and nothing but) in Jar 2, we must conclude that
there is a 100% chance that we have the now-much-less-transparent Jar 2.
Things would be a little trickier if there were a mixture of orange and blue marbles in each jar, or if one jar was
somehow more probable to have than the other before drawing, or if there were three, four, or more jars. Fortunately,
we have math to help us. Specifically, there is one equation that helps us calculate conditional probabilities, and it’s
a pretty important one.
Figure 4.19: Hilariously, this may or may not be a picture of the Reverend Thomas Bayes. Even more hilariously,
Thomas Bayes may or may not have been the one who first derived Bayes’s Theorem. The Bayes brand is uncertainty
and it is strong
The equation that gives the conditional probability of one event given the other is known as Bayes’s Theorem.
Bayes’s theorem follows directly from the definition of conditional probability mentioned abote, that is:
The probability of A and B happening p(A ∩ B) is the product of the probability of A given B and the
probability of B:
p(A ∩ B) = p(A|B)p(B)
Since the designation of the names A and B is arbitrary, we can also rewrite that as:
p(A ∩ B) = p(B|A)p(A)
and since both p(A|B) and p(B|A) are equal to p(A ∩ B), it follows that:
p(A|B)p(B) = p(B|A)p(A)
p(B|A)p(A)
p(A|B) =
p(B)
which is Bayes’s Theorem, although it’s more commonly written with the top two terms of that numerator switched:
p(A)p(B|A)
p(A|B) =
p(B)
Let’s bring back our pair of marble jars to demonstrate Bayes’s Theorem in action. This time, let’s stipulate that
there are 40 marbles in each jar. Jar 1 contains 10 orange marbles and 30 blue marbles; Jar 2 contains 30 orange
marbles and 10 blue marbles:
Figure 4.20: The Setup to a Brief Series of Conditional Probability Problems
In this set of examples, let’s assume that one of these jars is chosen at random, but which jar is chosen is unknown.
That means that we can reasonably assume that the probability of choosing Jar 1 is the same as the probability of
choosing Jar 2. As above, a marble is drawn – without looking inside of the jar – from one of the jars. Here is the
first of a new set of questions:
1. If a blue marble is drawn, what is the probability that the jar that was chosen is Jar 1?
This problem is ideally set up for using Bayes’s Theorem.42 The (slightly rephrased) question:
is equivalent to asking what p(A|B) is, where event A is choosing Jar 1 and event B is drawing a blue marble. To
use Bayes’s Theorem to solve for p(A|B), we are going to need to first find p(A), p(B|A), and p(B).
p(A) :
–
Event A is Jar 1, so p(A) is the probability of choosing Jar 1. We have stipulated that the probability of choosing
Jar 1 is the same as the probability of choosing Jar 2. Because one or the other must be chosen (i.e., the sample
space Ω = {J ar 1, J ar 2} and the probability of choosing Jar 1 or Jar 2 is one (i.e., p(J ar 1 ∪ J ar 2) = 1, or
equivalently, ∑ Ω = 1), then:
p(A) p(A) 1
p(A) = = = = 0.5
p(A) + p(B) 2p(A) 2
p(B|A) :
–
Event B is (drawing a) blue (marble), so p(B|A) is the probability of blue given Jar 1. If Jar 1 is given, then the
sample space for drawing a blue marble is restricted to the probability of drawing a blue marble from Jar 1: Jar 2
and the marbles in it are ignored. Because there are 30 blue marbles and 10 orange marbles in Jar 1 (i.e.,
Ω = {30 Blue, 10 Orange}), the probability of B|A – drawing a blue marble given that the jar is Jar 1, is:
30 3
p(B|A) = = = 0.75
40 4
p(B) :
–
Again, event B is drawing a blue marble. There is no conditionality to the term p(B): it refers to the overall
probability of drawing a blue marble from either jar. Finding p(B) is going to take a bit of math, and we can use a
probability tree to help us, too.
Figure 4.21: Probability Tree for Choosing Jars and Then Drawing Orange and Blue Marbles From Those Jars
There are two ways to get to a blue marble: (1) drawing Jar 1 with 50% probability and then drawing a blue marble
with 75% probability, and (2) drawing Jar 2 with 50% probability and then drawing a blue marble with 25%
probability. As we do with probability trees, we multiply across to find the intersection probabilities of
J ar 1 ∩ Blue and J ar 2 ∩ Blue
43 and add down to find the union probabilities of the two paths that lead to blue
marbles to get the overall probability of drawing a blue marble (p(B)):
Now that we know p(A), p(B|A), and p(B), we can calculate p(A|B): the probability that Jar 1 was chosen given
that a blue marble was drawn:
p(A)p(B|A) (0.5)(0.75)
p(A|B) = = = 0.75
p(B) 0.5
2. What is the probability that Jar 2 was chosen given that a blue marble was drawn?
We can approach this question in two different ways. First, we can use the same steps that we did to solve the first
question, with Jar 2 now being event A and drawing a blue marble still being event B. Since we know that
p(J ar 1) = p(J ar 2), p(A) is still 0.5. p(B|A) is now the probability of drawing a blue marble from Jar 2, so
p(B|A) in this case is equal to 10/40 = 1/4 = 0.25. p(B) is still the overall probability of drawing a blue
marble, which remains unchanged: p(B) = 0.5. Thus, the probability of Jar 2 is:
p(A)p(B|A) (0.5)(0.25)
p(A|B) = = = 0.25
p(B) 0.5
The other way we could solve that problem – the much easier way – is to note that if the probability that we have Jar
1 is 0.75, then the probability that we don’t have Jar 1 is:
3. After drawing a blue marble from one of the jars, the marble is replaced in the same jar, the jar is shaken, and a
second marble is drawn. The second marble drawn is also blue. What is the probability that the jar is Jar 1?
To reduce confusion, let’s add some subscripts to A and B: we can add 1 to refer to anything that happened on the
first draw, and to refer to what’s happening on the second draw.
2
p(A 2 ) :
–
Event A is again Jar 1, but this time, p(A) is different. Having already drawn a blue marble from that same jar, we
must update the probability that we have Jar 1 based on that new information. In the last problem, we found the
probability that we have Jar 1 given that a blue marble was drawn. That probability is our new p(A). Thus, our new
p(A) – denoted p(A )
45 – is p(A |B ): the updated probability of having Jar 1 _in this second draw given that a
2 2 1
p(B 2 |A 2 ) :
–
Event B is still (drawing a) blue (marble), and the relative numbers of marbles in Jar 1 is still the same (because
we replaced the marble), so p(B|A) is still 30/40 = 0.75.
p(B 2 ) :
–
The denominator of Bayes’s Theorem, like the p(A) term, also needs to be updated because we no longer believe
that having either jar is equally probable. The change happens between the first and second nodes of the probability
tree:
Figure 4.22: Probability Tree for Having a Particular Jar and Then Drawing Orange and Blue Marbles From That
Jar Having Once Drawn a Blue Marble From That Jar
There are still two ways to get to a blue marble, but the probability associated with each path has changed: (1)
drawing Jar 1 with 75% probability and then drawing a blue marble with 75% probability, and (2) drawing Jar 2
with 25% probability and then drawing a blue marble with 25% probability. As before, we multiply across to find
the intersection probabilities of J ar 1 ∩ Blue and J ar 2 ∩ Blue46 and add down to find the union probabilities of
the two paths that lead to blue marbles to get the updated overall probability of drawing a blue marble (p(B)):
With new values for each term of Bayes’s Theorem – p(A ), p(B |A ), and p(B ) – we can calculate
2 2 2 2 :
p(A 2 |B 2 )
the probability that Jar 1 was chosen given that a blue marble was drawn on consecutive draws:
Now we are more certain that we have Jar 1 in light of additional evidence: the probability was 0.75 after drawing
one blue marble from the jar, and now that we have drawn two consecutive blue marbles (with replacement) from
that jar, the probability is 90%. In other words, drawing two marbles leads to greater confidence that we have the
blue-heavy jar in our hands. Additionally, we know that if there is a 90% probability that we have Jar 1, then there
is a 1 – 90% = 10% chance that we have Jar 2: it is now nine times more likely that we have Jar 1 than Jar 2. Let’s
ask one more question – this one a two-parter – before we move on:
4. What is the probability that the jar is Jar 1 if the next draw (after replacing the blue marble) is:
a. a blue marble?
b. an orange marble?
I will leave going through all of the steps again as practice for the interested reader and skip to the Bayes’s Theorem
results. For part (a), if another blue marble is drawn, then A is having Jar 1 given the two previous draws and B
3 3
With three consecutive draws of blue marbles, the probability is high – 96%, to be precise – that we have the
majority-blue jar.
Let’s address part (b) of the question: what happens if we draw an orange marble? If we, in fact, were drawing from
Jar 1, drawing an orange marble would be far from impossible – there are 10 orange marbles in Jar 1. But, since
orange marbles are much more likely to come from Jar 2, having Jar 1 would be less certain than if we had drawn a
third consecutive blue marble.
For this part of the question, event B is drawing an orange marble on the third draw, and event A is having Jar 1
3 3
given the information we got from the first two draws. Again, we’ll leave all of the steps as an exercise for the
interested reader and skip to the final Bayes’s Theorem equation:
(A 3 )(B 3 |A 3 ) (0.9)(0.25)
p(A 3 |B 3 ) = = = 0.75
p(B 3 ) (0.9)(0.25) + (0.1)(0.75)
…and so the probability that we have Jar 1 takes a step back.47 Updating is an important feature of conditional
probabilities calculated using Bayes’s Theorem. Drawing marbles with replacement from a jar doesn’t change the
probabilities associated with those marbles: if the jar contains a lot more orange marbles than blue marbles and we
pick three blue marbles in a row, we’d just say “huh, that’s weird” and move on with our lives – low-probability
events are not impossible events. But, if we know the relative numbers of marbles in several jars but which jar we
have is unknown, then the chances of having one of those jars can constantly be updated as we obtain more
information.
If you are still reading this, you may have noticed that I’ve belabored the discussion of jars and marbles to an extent
that might suggest that I’m talking about jars and marbles but at the same time I’m not really talking about jars and
marbles.
In the labored analogy above, the jars represent hypotheses about the world and the marbles represent data that can
support or contradict those hypotheses. A hypothesis can represent a model of how a scientific process works and/or
parameters that cannot be directly measured. In the Bayesian framework of evaluating scientific hypotheses, testing
different hypotheses is like choosing an opaque, unlabeled jar: there are probabilities associated with each
hypothesis being correct (like the probabilities of choosing different jars), we have some idea of the probability of
what the data would look like given each hypothesis (like the probability of drawing a kind of marble given the jar),
and an idea of the overall probability of the data (like the overall probability of drawing a kind of marble).
When we’re using Bayes’s Theorem to refer to scientific inference, we change the notation in two subtle but
important ways: instead of A we use H to mean hypothesis and instead of B we use D to mean data. The resulting
equation is:
p(H )p(D|H )
p(H |D) =
p(D)
Each of the terms of Bayes’s Theorem has a special name that reflects its role in the inferential process.
p(H ) , is the probability of the hypothesis; more specifically, it is the prior probability48 The prior probability
(often referred to simply as the prior) is the probability of a given hypothesis before data are collected in a given
experiment or investigation. We may initially have no reason to believe that one hypothesis is any more likely than
the others under consideration: in that case, prior probabilities can be based on equal assignment.49 On the other
hand, we may have reason to consider some hypotheses as much more likely than others based on our scientific
beliefs and/or existing evidence. And, just as we updated p(A) for jars based on marbles, we can use the results of
one Bayesian investigation to update p(H ) for subsequent investigations – more on that below.
4.9.2.1.1.2 Likelihood
p(D|H ) is the probability of the data given the hypothesis, and is known as the likelihood. Any conditional
probability is a likelihood – the left side of our Bayes’s Theorem equation (p(H |D)) is also a likelihood, but it has
its own name so we almost never refer to it as such – hence the name. The likelihood informs us how likely or
unlikely the observed data are if a hypothesis is true. The likelihood of the data is typically determined by a
likelihood function: if the observed data are close to what the likelihood function associated with a hypothesis
would predict, then the likelihood of the data given the hypothesis will be relatively high; if the observed data are
way off from the predictions that the likelihood function makes, the likelihood of the data given the hypothesis will
be relatively low. How do we know what the likelihood function for a hypothesis should be? Well, sometimes it’s
obvious, and sometimes it’s tricky, and sometimes there isn’t one so we use computer simulations to get the
likelihood by brute force. In any event, it’s usually not as simple as the marble-and-jar analogy might make it seem,
but we will cover that at length in later content.
p(D) is the overall probability of the data and is known as the base rate.^[_Base rate:_The unconditioned
probability of observed events.<> In theoretical terms, the base rate – like the probability of drawing a certain type
of marble from all jars – is the combined probability of observing the data under all possible hypotheses. In
practical terms, the base rate is the value that makes the results of Bayes’s Theorem (as discussed just below, the
left-hand side of Bayes’s Theorem is known as the posterior probability) for all of the hypotheses sum to one. That
is, if we have n hypotheses, and we have a prior and a likelihood for each of hypotheses i, then the base rate is:
n
∑ (p(H i )p(D|H i )
i=1
Please note that this is precisely the calculation we used for Bayes’s Theorem denominator in the marbles-in-jars
problems, just using different terms:
∑ (p(H i )p(D|H i )
i=1
The final term to discuss is the left-hand side of Bayes’s Theorem: the probability of the hypothesis given the data
p(H |D), which is known as the posterior probability
50.The posterior probability (often just called the posterior)
is the updated probability of a hypothesis after taking into account the prior, the likelihood, and the base rate. In the
Bayesian framework of scientific inference, it is the answer we have after conducting an investigation. However, it
is not necessarily our final answer: as illustrated in our marbles-in-jars problems, we can update probabilities by
obtaining more evidence. In the scientific context, this means that we can use our posterior for one study as the prior
for the next study.
Monte Carlo methods are named after the Casino de Monte-Carlo (pictured above). One of the developers of the
technique had an uncle with a gambling problem who habitually frequented the Monte Carlo Casino.
The practice of using repeated random sampling to solve problems in statistics (and other fields, including physics
and biology) is known as using Monte Carlo methods or, equivalently, as using Monte Carlo simulations or Monte
Carlo sampling. Here, we will introduce Monte Carlo methods with some relatively straightforward examples.
4.10.1 Random Walk Models
Sticking with the casino theme: let’s imagine a gambler playing craps for $10 per game. On any given game, she has
a 49% chance of winning (in which case her total money will increase by $10) and a 51% chance of losing (she will
lose $\10)51. How can we project how her money will be affected by playing the game?
We know that, over the long run, she will win 49% of her games, and we can calculate that the expected value of
each gamble is ($10)(0.49) + (−$10)(0.51) = −$0.20, meaning that she can expect to lose 20 cents per game. But,
gambling doesn’t really work that way: if the gambling experience consisted of periodically handing 20 cents to the
casino, nobody would gamble (I think). Our gambler is going to win some games, and she is going to lose some
games. It’s possible that she will, over a given stretch, win more games than she loses and will leave the craps table
when she hits a certain amount of winnings. It’s also possible that she will lose more games than she wins and will
leave the craps table when she is out of money. Or, she might not win as much as she wants nor lose all her money,
but get tired after a certain amount of games.
We can model our hypothetical gambler’s monetary journey using a random walk model52. A random walk model
(sometimes more colorfully called the drunkard’s walk) is a (usually visual) representation of a stochastic53
process, and has been applied in psychology to the process of arriving at a decision, in physics to the motion of
particles, in finance to the movement of markets, and in other fields.
In the case of our gambler, let’s say she has $100 with her. She will stop playing if:
We can see how one simulation based on Monte Carlo sampling plays out in Figure 4.23.
On this specific walk, our gambler starts strong, winning her first four games and nine of her first 11 to go up $70 to
a peak of $70. The rest of her session doesn’t go quite so well, but she is still not broke after 200 games: she walks
away with $70, presumably to be spent on overpriced adult beverages by the pool.
But, it is important to note that this is just one possible set of circumstances. To get a better idea of what she could
expect, we would want to simulate the experience lots of times. The expected value here (still −$0.20) is less
interesting a prediction than how many times she leaves having hit her goal earnings, how many times she loses all
her money, and how many times she ends up somewhere inbetween. A small set of multiple random walk
simulations is visualized in Figure 4.24.
In these 9 simulations, our gambler leaves the game up $100 three times, loses her whole bankroll four times, and
leaves after 200 games twice.
Typically, simulations are run in greater numbers – just as one run can appear to be lucky or unlucky, a handful of
simulations is rarely compelling. Thus, let’s try a greater number of random walks. In a run of 1,000 simulations
(don’t worry, we won’t make a figure for this one), in 424 of them (42.4%), our gambler left with $200, in 449
(44.9%) of them, she left with $0, and in 127 (12.7%), she left the table after 200 games having neither hit her goal
nor having lost all of her money. Please note how much different this is than what would be predicted using the
expected value of each gamble: if she lost $0.20 every time she played (which, of course, is impossible, because she
could either win $10 or lose $10), then she would be guaranteed to leave with $60. The simulation results are likely
a more meaningful result because it describes likely outcomes based on realistic behavior.
One psychological application of Markov Chain models is in learning and memory. For example, in William Estes’s
(1953) 3-state learning model54 there are three possible states for an item to be learned: the item could be in the
unlearned state, in a short-term state (where it may be temporarily memorable but not for long), or in a long-term
state (that is relatively permanent but from which things can be forgotten).
According to the Estes model (see Figure 4.26), during the course of learning, an item starts out in an unlearned state
– which we will call 1 – and will stay in state 1 with probability a. Given that it is assumed that the item stays state
1 with probability a, it follows that it will leave that state with probability 1 − a. If the item leaves the unlearned
state, it has two places to go: the short-term state – state 2 – and the long-term state – state 3. We can designate the
probability of going to state 2 as b and the probability of going to state 3 as 1 − b: therefore the probability of an
item leaving state 1 and going to state 2 as (1 − a)(b) and the probability of an item leaving state 1 and going to
state 3 as (1 − a)(1 − b). We can then continue, designating the probability of an item in state 2 and staying there as
c and of an item leaving state 2 as 1 − c, etc., etc., as shown in Figure 4.26.
Just as we designated the probabilities of winning and losing in the random walk, we can designate the probabilities
a, b, c, d, e, and f and use those probabilities to repeatedly simulate the stochastic motion of items between memory
Far more important for us, though, is that we can use Markov Chain Monte Carlo methods to generate probability
distributions about scientific hypotheses by moving numeric estimates from beginning states to increasingly more
probable states. We’ll talk about that later.
Event (or Outcome - we will use them interchangeably): A thing that happens. Roll two dice and they come up 7 –-
that’s an event (or, an outcome). Flip a coin and it comes up heads – that’s an outcome (or, an event)
p(A) : the probability of event A. For example, if event A is the probability of a coin landing heads, we can write:
p(A) = 0.5
or we can spell out heads instead of using A:
p(heads) = 0.5
¬ A: a symbol for not A, as in “event A does not happen.” If event A is a coin landing heads, we can say
Trial: A single occurrence where an event can happen. “One flip of a coin” is an example of a trial.
Defined experiment: An occurrence or set of occurrences where events happen with some probability. “One flip of a
coin” is a defined experiment. “Three rolls of a single sided die” is a defined experiment. “On Tuesday” is a defined
experiment if you’re asking, “what’s the probability that it will rain on Tuesday.” Please note that a defined
experiment could be one trial or it could comprise multiple trials.
Sample space (symbolized by Ω or S ): All possible outcomes of a defined experiment. For example, if our defined
experiment is “one flip of a coin,” then the sample space is one heads and one tails, which we would symbolize as:
ω = H, T
Elementary event: An event that can only happen one way. If we ask, “what is the probability that a coin lands heads
twice in two flips, there’s only one way that can happen: the first flip must land heads and the second flip must also
land heads.
Composite (or compound) event: An event that can happen multiple ways. If we ask, “what is the probability that a
coin lands heads twice in three flips, there are three ways that can happen: the flips go heads, heads, tails, the flips
go heads, tails, heads, or the flips go tails, heads, heads.
Mutual exclusivity: A condition by which two (or more) events cannot both (or all) occur. In one flip of a coin,
heads and tails are mutually exclusive events.
Collective exhaustivity: A condition by which at least one of a set of events must occur in a defined experiment. For
one flip of a coin, heads and tails are said to be collectively exhaustive events because one of them must happen.
Independent events: Events where the probability of one event does not affect the probability of another. For
example, the probability that your birthday is July 20 is totally unaffected by the probability that a randomly chosen
person that is unrelated to you is also July 20.
Dependent events: Events where the probability of one event does affect the probability of another. The probability
of a twin having a birthday on July 20 is very closely related to the probability of her twin sister having a birthday
on July 20.
Union (symbolized by ∪): or. As in, “what is the probability that it will rain on Tuesday or Friday of next week?”
We may rephrase that question as “what is the union probability of rain on Tuesday or Friday?”
Intersection (symbolized by ∩): and. As in, “what is the probability that it will rain on Tuesday and Friday of next
week?” or “what is the intersection probability of rain on Tuesday and Friday?” Intersection probability is also
referred to as conjoint or conjunction probability.
4.11.2 Formulas
Probability of A and B:
If A and B are independent, then p(B|A) = p(B)
p(A ∩ B) = p(A)p(B|A)
n!
n
Pr =
(n − r)!
n
Cr can also be written:
(s + f )!
s+f C s =
s! f !
n!
n Cr =
r! (n − r! )
The Binomial Likelihood Function (the probability of s successes and f failures in N = s + f trials):
N!
s f
p(s|N = s + f , π) = π (1 − π)
s! f !
Bayes’s Theorem:
also known as
p(H )p(D|H )
p(H |D) =
p(D)
p(A)p(B|A)
p(A|B) =
p(B)
1. Algebraic answer:
The factorial of any number is that number multiplied by the factorial of that number minus one:
4! = 4 × 3 × 2 × 1 = 4 × 3!
3! = 3 × 2 × 1 = 3 × 2!
2! = 2 × 1 = 2 × 1!
n! = n(n − 1)!
Thus:
1! = 1 × 0!
1! 1 × 0!
=
1 1
and thus:
1 = 0!
2. Calculus answer
The factorial function n! is generalized to the gamma function Γ (n + 1). R software has both a built-in factorial
function and a built-in gamma function: you can use both to see the relationship, for example:
gamma(6)
## [1] 120
factorial(5)
## [1] 120
gamma(1)
## [1] 1
factorial(0)
## [1] 1
Plugging in n = 0, t 0
= 1 , and thus:
∞
∞
−t −t −t −∞ 0
Γ (1) = ∫ e dt = e = [−e ] = −e + e = 0 + 1 = 1
0
0
4.12.2 Excerpts from Statistics for Everybody by D. Barch, reprinted with permission from
the author
The tendency to believe that event A is less likely to occur after a long run of event A occurring is known as the
Gambler’s Fallacy. That fallacy (a fallacy is an illogical line of reasoning) takes its name from the tendency of
gamblers to think that after a long string of losses that they are due for a win. Well, they are due for wins – if they
continue to play the game forever (and have the means to do so) – but not on any one given trial. So, after ten
consecutive heads, would you put a lot of money on tails? You should not change your betting behavior in any way (if
anything, shouldn’t you suspect that the coin is messed up and bet on heads?).
In the early 1980s, Cognitive Psychologist Daniel Kahneman, working with his longtime collaborator Amos Tversky,
handed a flyer to 88 undergraduates at the University of British Columbia (UBC). In that flyer read the following
description of a fictional woman named Linda:
“Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was very
concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.”
(Tversky & Kahneman, 1983)
Below that passage was a list of 10 statements about Linda, including “Linda is a teacher in an elementary school,”
and “Linda works in a bookstore and takes Yoga classes,” and the students were asked to rank-order the probability
of each of the statements. Tversky and Kahneman were really only interested in responses these three of the 10
statements:
As Tversky and Kahneman had hypothesized, the majority (85%) of students ranked “active in the feminist
movement” above “bank teller and is active in the feminist movement” and, most importantly, ranked both above
“bank teller.” The researchers pointed out that such a ranking was impossible according to the rules of intersection
probability. Let’s let “active in the feminist movement” be event A and “bank teller” be event B. The intersection (or
conjunction, as noted above in the definition for intersection probability) probability of events A and B is:
p(A ∩ B) = p(A)p(B)
Since probabilities are values between 0 and 1, there are no two probabilities that you can multiply together to get a
number greater than any one of those probabilities. Thus, it is always the case that:
With the only ways that p(A ∩ B) can be equal to either p(A) or p(B) is if the probability of one or both of the
events is equal to 1. This is known as the conjunction rule.
Tversky and Kahneman pointed out that by ranking the probability that Linda was a feminist bank teller higher than
the probability that Linda was simply a bank teller, the UBC undergraduates were violating the conjunction rule and
committing the conjunction fallacy: assessing the probability of the conjunction as greater than the probability of one
of the events. This, they argued, was irrational based on the rules of probability theory, and showed that people were
making judgments based on heuristics – quick rules of thumb that we may rely on in place of analytic reasoning.
Linda, Tversky and Kahneman argued, was so representative of a person that would be active in the feminist
movement that participants in the experiment thought, “well, she doesn’t seem like the bank teller type, but if she is,
surely she is active in the feminist movement in her spare time and is not just a bank teller.”
Since the seminal paper from Tversky and Kahneman, other research has been conducted to explain why people
commit the conjunction fallacy. The principal antagonist to the Tversky and Kahneman viewpoint has been Gerd
Gigerenzer, who has argued (Gigerenzer, 1996; Hertwig & Gigerenzer) that those same heuristics help us make quick
and relatively smart decisions about the world around us. Still, for our purposes, let’s keep in mind that, statistically
speaking, the conjunction rule should not be violated.
1. Event (or Outcome - we will use them interchangeably): A thing that happens. Roll two dice and they come up
7 –- that’s an event (or, an outcome). Flip a coin and it comes up heads – that’s an outcome (or, an event)↩
2. Sample space (symbolized by Ω or S ): All possible outcomes of a defined experiment. For example, if our
defined experiment is “one flip of a coin,” then the sample space is one heads and one tails, which we would
symbolize as: ω = H , T ↩
3. Mutual Exclusivity: A condition by which two (or more) events cannot both (or all) occur. In one flip of a coin,
heads and tails are mutually exclusive events.↩
4. Intersection (∩): the co-occurence of events; the intersection probability of event A and event B is the
probability of A and B occurring.↩
5. Probability Symbols: p(x): the probability of x. ∩ (cap): and; intersection. ∪ (cup): or, union. : not. |:
¬
given↩
6. Conditional Probability: a probability that depends on an event, a set of events, or any other restriction of the
sample space↩
7. Defined experiment: An occurrence or set of occurrences where events happen with some probability.↩
8. Please note that there are several ways to express probabilities – as fractions, as proportions, or as percentages
– and all are equally fine.↩
11. Independent Events: Events where the probability of one event does not affect the probability of another.↩
12. The concepts of sampling with replacement and independent events are closely related: we can define
sampling with replacement as sampling in such a way that each sample is independent of the last.↩
13. Sampling without Replacement: The observation of events that change the sample space↩
14. Dependent Events: Events where the probability of one event affects the probability of another.↩
15. Sampling without replacement and dependent events are related in the same way that sampling with
replacement and independent events are: sampling without replacement implies dependence between events.↩
16. note that we cannot use the simplification p(B|A) = p(B) because these are not independent events↩
17. In general, the intersection probability of independent events is the product of their individual probabilities:
f or independent A, B, C :
18. There are going to be several gambling-related examples in our discussion of probability theory, and I really
don’t mean to encourage gambling. But much of our understanding of probability theory comes from people
trying to figure out how games of chance work (that and insurance). When we consider how much of the rest of
the field of statistics was derived to justify and promote bigotry, though, promoting gambling isn’t bad by
comparison. ↩
19. One can win smaller amounts of money for picking all four digits in any order, or for picking the first three
digits in the correct order, or for picking the last three digits in the correct order…it gets complicated. I know
all this because my maternal grandparents loved The Numbers Game (almost as much as they hated each other)
and their strategies were frequent topic of conversation whenever I was at their house. Anyway, for
simplicity’s sake, let’s just focus on the jackpot.↩
20. We know from studying probability theory that outcomes that seem more random are not more random at all:
“3984” and “7777” are precisely as likely as each other to happen. So why do numbers like “7777” or “1234”
stand out? It’s the cognitive illusion caused by salience: there are just a lot more number combinations that
don’t look cool than number combinations that do, and the cool ones stick out while all the uncool ones don’t
get remembered.↩
21. “A well-shuffled deck of ordinary playing cards” is a term that teachers are legally required (I think – don’t
google that) to use when teaching probability theory – it just means that each card in the deck has an equal (
1/52) chance of being dealt.↩
22. The value of card combinations in games like poker mostly track with the probabilities of those combinations –
for example, a pair of aces is more likely and also less valuable than three aces – and sometimes are based on
arbitrary rules to improve game play – for example, a pair of aces is more valuable but just as probable to get
as a pair of kings, but having relative ranks of cards limits ties and improves gameplay.↩
23. Union (∪): the total sum of the occurrence of events; the union probability of event A and event B is the
probability of A or B occurring.↩
24. The days and times that people are born are not really random but let’s temporarily assume they are for the sake
of this example.↩
25. The union probability of three mutually exclusive events p(A) + p(B) + p(C) uses the exact same equation,
it’s just that all of the intersection terms are 0.↩
26. As the most likely outcome, 7, in addition to being the expected value, is also the mode. And, since there are
precisely as many outcomes less than 7 as greater than 7, it’s also the median.↩
27. Only one of the three elements is named after an actual tree part, a fact that I never thought about before now but
it kind of bums me out.↩
28. for more on similar cognitive illusions regarding probability, see the bonus content below↩
29. Elementary Events: events that can only happen in a single way.↩
30. Composite (or Compound) Events: events that can be broken down into multiple outcomes; collections of
elementary events (note: in this context composite and compound are identical and will be used
interchangeably).↩
31. That is not to say that the probabilities of all elementary events in a sample space are always equal to each
other – for example, the probability of the elementary events me winning gold in men’s figure skating at the
next Winter Olympics and an actually good skater winning gold in men’s figure skating at the next Winter
Olympics are most certainly not the same – but in this case, they are.↩
32. This is a good time to point out that 0! = 1. You can live a long, healthy, and scientifically fruitful life just
taking my word for it on that, but for two (hopefully compelling) explanations please see the bonus content
down below.↩
35. This was, verbatim, a question that I got right for my team at pub trivia mainly because I had just taught a lecture
on probability theory and the combination formula was fresh in my head. Also, unless a trip is going to be like
years long, anything more than one book is too ambitious for me↩
38. In common conversation, the terms probability, likelihood, and odds are used interchangeably and without too
much confusion. In statistics, the distinctions between the terms are meaningful: probability can refer to any
probability, while likelihood refers specifically to a conditional probability and odds refers specifically to a
relative probability.↩
39. Binomial Trial: a defined experiment with two possible outcomes; also known as a Bernoulli trial↩
41. Binomial trials are always based on the assumption of sampling with replacement.↩
42. The term inverse probability has been used – mostly in the past and originally derisively – for the kinds of
broader probability problems to which this problem is an analogy. If what is the probability that marble x is
drawn from jar y is a probability problem, then what is the probability that the jar is jar y given that marble
x was drawn from it is the inverse of that. But, today, such problems are more appropriately referred to as
Bayesian probability problems.↩
43. We could also find the intersection probabilities of J ar 1 ∩ Orange and J ar 2 ∩ Orange , but those aren’t
really that important for this problem right now.↩
44. Another way of saying the probability is three times greater is the odds in favor of having Jar 1 are 3:1, and
another way of saying that is the odds ratio for having Jar 1 is 3. But, we’ll get to that in the section on odds.↩
p(A) = 0.75,
keeping in mind the p(A) has been updated. Mathematical notation is supposed to clarify problems rather than
to make them more difficult, so whichever approach is more understandable for you is best.↩
47. For this particular example, that step goes back precisely to where the probability was after the first draw of a
single blue marble, but that’s not necessarily the case: the probability will go down, but not necessarily by the
same amount that it went up before – that just happened because of the symmetry of 0.75 and 0.25 around 0.5.↩
48. Prior Probability: The probability of a hypothesis before being conditioned on a specified set of evidence.↩
49. Equal-assignment-based prior probabilities are fairly common and are known as flat priors↩
51. Dear gambling aficonados: for the purposes of this example, we’ll ignore all the side bets and all that.↩
52. Random walk model: a representation of a series of probabilistic shifts between states.↩
In one such example, the question of the respective probabilities that a drawn
blue marble came from one of two jars (see Figure 1 below) was posed.
The probability that a blue marble was drawn from Jar 1 is 0.75 and the
probability that it was drawn from Jar 2 is 0.25: it is three times more probable
that the marble came from Jar 1 than from Jar 2.
Now, let’s say we have a jar with a more unusual shape, perhaps something like
this…
…and the marbles we were interested in weren’t randomly mixed in the jar but
happened to sit in the upper corner of the jar in the space shaded in blue…
Figure 5.3: Figure 3. Areas of Residence for Interesting Marbles in An
Unusually-shaped Jar
…and somehow it were equally easy to reach every point in this jar (the analogy
is beginning to weaken). If you were to draw an orange marble from this strange
jar, you might think nothing of it. If you were to draw a blue marble, you might
think the idea that it came from this jar rather odd. You might even reach a
conclusion like the following:
The probability that the blue marble came from that mostly-orange jar is so
small that – while there is a small chance that I am wrong about this – I
reject the hypothesis that this blue marble came from this jar in favor of the
hypothesis that the blue marble came from some other jar.
If we turn that jar upside-down, and we put it on a coordinate plane, then we will
have something like this:
Figure 5.4: Figure 4. Metamorphosis of an Unusually-shaped Jar
Swap out the concept of drawing a marble for observing a sample mean, and
this is now a diagram of a one-tailed t test.
The above thought exercise was meant to be the statistical version of this:
This page provides some connection between the marbles and jars (and the coin
flips and the dice rolls and the draws of playing cards from well-shuffled decks)
and statistical analyses. Frequency distributions are tools that will help us
understand the events and sample spaces needed to make probabilistic inferences
for statistical analysis.
N! s f
p(s|N − s + f , π) = π (1 − π)
s! f !
was derived in the page on Probability Theory. The binomial likelihood gives the
probability of s events of interest (which we have called successes) in N trials
where the probability of success on each trial is π and f represents the number of
trials where success does not occur (i.e., f = N − s).
For example, imagine that the manager of a supermarket wants to know what the
probability is that shoppers will choose an item placed in the middle of a display
(box B in Figure 5.5 below) of three items rather than the items on either side (A
or C ):
If each item were equally likely to be chosen, then the probability of a person
choosing B is p(B) = 1/3, and the probability of a person choosing not B is
p(¬B) = p(A ∪ C) = 1/3 + 1/3 = 2/3. Suppose, then, that 10 people walk
by the display. What is the probability that 0 people choose item B? That is, what
is the probability that B occurs 0 times in 10 trials? According to the binomial
likelihood function:
0 10
1 10! 1 2
p (0|π = , N = 10) = ( ) ( ) = 0.017
3 0! 10! 3 3
Were we so inclined – and we are – we could construct a table for each possible
number of s given π = 1/3 and N = 10:
s p(s|π, N )
0 0.01734
1 0.08671
2 0.19509
3 0.26012
4 0.22761
5 0.13656
6 0.05690
7 0.01626
8 0.00305
9 0.00034
10 0.00002
∑ 0 → 10 1
Binomial distributions are positively skewed, like the one in Figure 5.6 above,
whenever π is between 0 and 0.5, negatively skewed whenever π is between 0.5
and 1, and symmetrical whenever π is exactly 0.5.2 The effect of changing π is
shown below in Figure 5.7. For each distribution represented in Figure 5.7,
N = 20, and the only thing that changes is pi.
Figure 5.7: The Binomial Distribution for N = 20 and Various Values of π
Figure 5.9 shows a series of binomial distributions each with π = 0.5 and with
N ranging between 1 and 20.
Figure 5.9: Binomial Distributions for π = 0.5 and Various Values of N
As N gets bigger, the binomial distribution looks decreasingly like the blocky
figure representing a coin flip and increasingly like a curve: specifically like a
normal curve. That is not a coincidence. In fact, there are meaningful connections
between the binomial distribution and the normal distribution, as discussed
below
E(x) = ∑ x i p(x i )
i=1
2
V (x) = ∑ (x i − x) p(x i )
i=1
E(x) = ∑ x i p(x i ) = N π
i=1
It’s a little more complex to derive, but in the end, the variance of a binomial
distribution has a similarly simple form:
N
2
V (x) = ∑ (x i − x) p(x i ) = N π(1 − π)
i=1
binomial data – it’s just something special about the binomial. Please also note
that these equations will come in handy later.
That is, for a discrete probability distribution, we get the cumulative probability
at event s by adding up all the probabilities s and each possible smaller s. We
can replace each value in Table ?? – which showed the discrete probabilities for
each possible value of s for a binomial distribution with N = 10 and π = 1/3 –
with the cumulative probability for each value of s. In Table ??, the cumulative
probability for s = 0 equals the discrete probability for s = 0 because there are
no smaller possible values of s than 0. The cumulative probability for s = 1 is
equal to p(s = 1) + p(s = 0), the cumulative probability for s = 2 is equal to
p(s = 2) + p(s = 1) + p(s = 0), etc., until the largest possible value of s –
s p(s ≤ s|π, N )
0 0.01734
1 0.08671
2 0.19509
3 0.26012
4 0.22761
5 0.13656
6 0.05690
7 0.01626
8 0.00305
9 0.00034
10 0.00002
∑ 0 → 10 1
N! s f
p(s|N , π) = π (1 − π)
s! f !
as:
(factorial(N)/(factorial(s)*factorial(f)))*(pi^s)*(1-pi)^f
But R also has a built-in set of commands to work with several important
probability distributions, the binomial included. This set of commands is based
around a set of prefixes – indicating the feature of the distribution to be evaluated
– a set of roots – indicating the distribution in question – and parameters inside
of parentheses. There are a number of helpful guides to the built-in distribution
functions in R, and we will have plenty of practice with them.
For example, if we want to know p(s = 3|N = 10, π = 1/3), we can enter:
dbinom(3, 10, (1/3)) which will return 0.2601229.
Here is the form of the function that describes the normal distribution:
1 −
1
(
x−μ
)
2
f (x|μ, σ) = e 2 σ
σ√ 2π
As equations go, it’s kind of a lot. Please feel free not to memorize it: you won’t
need to produce it in this course, and when and if you need to use it in the future,
it comes up after a pretty quick internet search. But, while you’re here, I would
like to point out a couple of features of it:
1. The left side of the equation – f (x|μ, σ) – indicates that the normal
distribution is a function of x given μ and σ.
2. x, μ and σ are the only variables in the equation (π is the constant 3.1415…
and e is the constant 2.7183…).
Points (1) and (2) together mean that μ and σ (or σ – it doesn’t really matter
2
because if you know one, then you know the other) are the sufficient statistics for
the normal distribution. All you need to know about the shape of a normal
distribution are those two statistics. You might ask: what about the skewness and
the kurtosis? It’s a good question and the answer is quasitautological6: if the
distribution has any skew and/or any excess kurtosis, then it’s not a normal
distribution by definition.
3. The equation describes the probability density at each point on the x-axis,
not the probability of any point on the x-axis.
This is one of the more counterintuitive facts about probability theory – I think
it’s counterintuitive because we seldom think about numeric values as area-less
points in a continuous range. Here’s a hopefully-helpful point (ha, ha) to
illustrate: please imagine that you are interested in finding the probability that a
person selected at random is 30 years old. Based on the proposition described in
the previous sentence, we wouldn’t naturally think that the probability is
extremely small or even zero: there are lots of people who are 30 years old.
Now imagine that you were interested in finding the probability that a person
selected at random is 37 years, 239 days, 4 hours, 12 minutes, and 55.43218
seconds old: we would naturally think that that probability would be
infinitesimally small, and as the age of interest got even more precise (i.e., more
and more digits got added to the right of the decimal on the number of second)
What’s the difference between those two questions? When we’re not in stats
class, the phrase 30 years old implies between 30 years and 30 years and
364.9999 days old: the implication is of a range. A range of values can have a
meaningful probability. A range of values has an area under the curve, which
corresponds to meaningful probability, sort of like how the physical density of an
object can be translated into a meaningful mass (or weight) when we know how
large the object is. If that’s not clear yet (or even if it is), an illustration of the
field of plane geometry might be helpful. In plane geometry, a point has no area.
A line has no area, either. But a shape does have an area:
binomial distribution have width as well as height: the width of each bar
associated with a value of s is 1; so the area is equal to the height of each bar
times 1, which simplifies to the height of each bar. We will revisit that fact when
we discuss the connections between the normal and the binomial below.
Given that meaningful probabilities for continuous variables are associated with
areas, it is reasonable to infer that we should talk about how to find areas under
the normal curve, and therefore how to find probabilities. That is true, but there
is just one problem:
Typically, if we want to find the area under a curve, we would use integral
calculus to come up with a formula. But, the formula for a normal curve –
x−μ 2
1
f (x|μ, σ) =
1
e
−
2 – is one of those formulas that simply does not have
(
σ
)
σ√ 2π
But that won’t stop us! Despite not being able to use a formula for the precise
integral, we can approximate the area under a curve to a high level of precision.
There are two tools available to us to find areas: using software and _calculating
areas based using standardized scores.
For example, imagine a standardized test where the average score is 1000 and
the standard deviation of the scores is 100: the distribution of scores for this
hypothetical test is visualized in Figure 5.12. What is the probability that
somebody who took the test that is selected at random scored between 1000 and
1100?
Figure 5.12: Distribution of Scores for a Hypothetical Standardized Test With
μ = 1000 and σ = 100
First, let’s do this with the benefit of modern technology. The absolute easiest
way to answer the question is to use an online calculator like this one, where you
can enter the mean and standard deviation of the distribution and the value(s) of
interest. Slightly, but I don’t think much more, difficult, is to use built-in R
functions.
Recall from the discussion of using R’s built-in distribution functions for the
binomial that the cumulative probability for a distribution is given by the family
of commands with the prefix p. The root for the normal distribution is norm, so
we can use the pnorm() function to find probabilities of ranges of x values. The
pnorm() function takes the parameter values (x, mean, sd), where x is the
value of interest and mean and sd are the mean and standard deviation of the
distribution. As with the pbinom() command, the default value for pnorm() is
lower.tail=TRUE, so without specifying lower.tail, pnorm() will return the
cumulative probability from 0 to x8 The mean of this distribution (as given in the
description of the problem above) is 1000 and the standard deviation is 100.
Thus, to find the probability of finding an individual with a score between 1,000
and 1,100 where the scores are normally distributed with a mean of 1,000 and a
standard deviation of 100, we can use pnorm(1100, 1000, 100) to find the
cumulative probability of a score of 1,100 and pnorm(1000, 1000, 100) to
find the cumulative probability of a score of 1,000, and subtract the latter from
the former:
[1] 0.3413447
Figure 5.13: Visual Representation of Finding the Area Between 1,000 and
1,1000 Under a Normal Curve with μ = 1000 and σ = 100
The second way to handle this problem – the old-school way – is to solve it
using what we know about the area under one specific normal curve: the
standard normal distribution. which merits its own section header:
here:
The mean and the standard deviation of the standard normal distribution mean
that the x-axis for the standard normal represents precisely the number of
standard deviations from the mean of any value of x: x = 1 is 1 standard
deviation above the mean, x = −2.5 is 2.5 standard deviations below the
mean, etc.. For a long time, the standard normal distribution was the only normal
distribution for which areas under the curve were easily available. There are
infinite possible normal distributions: one for each possible combination of
infinite possible means and infinite possible standard deviations (all greater than
1, but still infinite), and where there is no formula to determine the area under a
normal curve, it makes little sense to try take the time to approximate the
cumulative probabilities associated for any set of those infinite possibilities.
Instead, the cumulative probabilities associated with the standard normal
distribution were calculated9: if one wanted to know the area under the curve
between two points in a non-standard normal distribution, one needed only to
know what those two points represented in terms of distance from the mean of
that distribution in units of the standard deviation of that distribution. Then,
one could (and still can) consult the areas under the standard normal curve to
know the area between the x values corresponding to the numbers of standard
deviations from the mean. To use our example of standardized test scores –
where the mean of the scores is 1,000 and the standard deviation is 100 – a score
of 1,100 is one standard deviation greater than the mean, and a score of 1,000 is
0 standard deviations from the mean (it’s exactly the mean). For the standard
normal distribution, the area under the curve for x < 1 – corresponding to a
value 1 standard deviation greater than the mean – is 0.84, and the area under the
curve for x = 0 – corresponding to a value at the mean – is 0.5. Thus, since the
area under the standard normal curve for x = 0 → x = 1 is 0.84 − 0.5 = 0.34,
the area under the normal curve with μ = 1000andσ = 100 between the mean
and a value 1 standard deviation above the mean is also 0.34.
5.3.1.1 z-scores
The value of knowing how many standard deviations from a mean a given value
lies in a normal distribution is so important that the distance between any x and
the mean in terms of standard deviations is given its own name: the z-score. A z-
score is precisely the number of standard deviations a value is from the mean of a
distribution. This is reflected in the z-score equation:
x − μ
z =
σ
x − x
z =
s
when we have the mean and standard deviation of a sample and not the entire
population. In other words, we take the difference between any value x of
interest and the mean of the distribution and divide that difference by the standard
deviation to know how many standard deviations away from the mean that x is.
The z-score is important for finding areas under the normal curve using tables
like this one, but by “important,” I mean “sort of important but much less
important in light of modern technology.” A key note on tables like the one
linked: to save space and improve readability, such tables usually don’t include
all z-scores and full cumulative probabilities, but rather z-scores greater than 0
and the cumulative probability between z and the mean and sometimes the
cumulative probability between z and the tail (which is to say, to infinity (but not
beyond)). To read those tables, one must keep two things in mind: (1) the normal
distribution is symmetric and (2) the cumulative probability of either half –below
the mean and above the mean – of the standard normal is 0.5 (and thus the total
area under the curve is 1).
For example – sticking yet again with the standardized test example – the z-score
for a test score of 1,100 is:
x − μ 1, 100 − 1000
z = = = 1
σ 100
In the linked table, we can look up 1.00 in Column A. Column B has the Area to
mean for that z-score: it’s 0.3413. That means that – for this distribution –
34.13% of the distribution is between the mean and a z-score of 1, or, between
1,000 and 1,100. To find the cumulative probability of all values less than or
equal to 1,100, we have to add the area that lives under the curve on the other
side of the mean, which is 0.5. Thus, the cumulative probability of a score of
1,100 is 0.3413 + 0.5 = 0.8413. The cumulative probability of scores greater
than 1,100 is given in Column C: in this case, it’s 0.1587. Please note that for
each row in the table, the sum of the values in Column B and in Column C is 0.5
because the area under the curve on either side of the mean is 0.5. For that
reason, some tables don’t include both columns, because if you know one value,
you know the other.
The negative sign on the z-score indicates that the score is less than the mean.
The linked table is one of those that doesn’t include negative z-scores (as the
maker of the table: my deepest apologies). But that’s not a problem because the
normal distribution is symmetric: we check Column A for positive 1, and again
the area to the mean is 0.3413 and the area to the tail is 0.1587. In this case, the
area to the mean is the area under the curve from −1 up to the mean, and the area
in the tail is the cumulative probability between −1 down to negative infinity.
In addition to finding areas under curves – which as noted earlier isn’t quite as
important as it used to be given that we can use software to find those areas more
easily – the z-score is an important tool for making comparisons. Because the z-
score is a measure for a value of how many standard deviations of its own
distribution from the mean of its own distribution, we can use z-scores to show
how typical or unusual a value is relative to its peer values.
For example, imagine two Girl Scouts – Kimberly from Troop 37 and Susanna
from Troop 105 – who are selling cookies to raise funds to facilitate scouting-
type activities. Kimberly sells $200 worth of cookies and Susanna sells $175
worth of cookies. We might conclude that Kimberly was the better-performing
cookie-seller. However, perhaps Kimberly and Susanna are selling in different
markets – maybe Kimberly’s Troop 37 is in a much less-densely populated area
and she has fewer potential buyers. In that case, it would make sense to look at
the distribution of sales for each troop:
$25
$10
You may have noticed that while the sales numbers for the above examples – the
respective sales for Kimberly and Susanna, the means, and the standard
deviations – had dollar signs indicating the units of sales, the z-scores had no
units. In the z-score formula, any units that are used cancel out in the numerator
and in the denominator. Thus, we can make comparisons between values from
distributions with different units, for example: height and weight, manufacturing
output and currency value, heart rate and cholesterol level, etc, which really
comes in handy when analyzing data with correlation and regression.
The tails of the normal curve come really, really close to touching the x-axis but
they never touch it. The statistical implication of this is that although the
probability of observing values many standard deviations away from the mean
can be extremely small, it is never impossible: the normal distribution excludes
no values.
Figure 5.16: The Tails Don’t Touch the Axis
Figure 5.17: Pictured: Kevin Garnett, moments after the conclusion of the 2008
NBA Finals, describing the range of x values in a normal distribution.
The 68-95-99.7 Rule is an old rule-of-thumb that tells us the approximate area
under the normal curve within one standard deviation of the mean, within two
standard deviations of the mean, and within three standard deviations of the
mean, respectively. With all we know about calculating areas under the curves,
the 68-95-99.7 rule is less important (we can always look those values up or
calculate them with software), but still helpful to know offhand.
Figure 5.19: The 68-95-99.7 Rule
For example, the average height of adult US residents assigned male at birth is 70
inches, with a standard deviation of 3 inches, and the distribution of heights is
normal. If we wanted to know the percentile of height among adult US residents
assigned male at birth for somebody who is 6’2 – or 74 inches – tall, we would
take the cumulative probability of 74 in a normal distribution with a mean of 70
and a standard deviation of 3 either by using software:
pnorm(74, 70, 3)
[1] 0.9087888
74 − 70
z 74 = = 1.33
3
T otal area = Area to mean + Area beyond mean = 0.4082 + 0.5 = 0.9082
Then, we multiply the area by 100 and round to the nearest whole number (these
things are typically rounded) to get 91: an adult US resident assigned male at
birth who is 6’2" is in the 91st percentile for height. They are taller than 91% of
people in their demographic category.
As noted earlier, the normal distribution and the binomial distribution are closely
related. In fact, the limit of the binomial distribution as N goes to infinity is the
normal distribution. But, before we get to infinity, the normal distribution
approximates the binomial distribution. The rule-of-thumb is that the normal
approximation to the binomial can be used when N π > 5 and N (1 − π) > 5:
that is, the binomial distribution looks enough like the normal given some
combination of large N and π in the middle of the range (small N makes the
binomial blocky; pi near 0 or 1 makes the binomial more skewed).
Figure 5.21: Illustration of the Normal Approximation to the Binomial
the binomial distribution as a normal distribution with those values for the mean
and standard deviation.
50.5 − 50 49.5 − 50
= Area ( ) − Area ( ) = 0.0797.
5 5
The exact probability for 50 heads in 100 flips is dbinom(50, 100, 0.5) =
0.796, so this was a pretty good approximation.
Sample means have a remarkable property: if you take enough sample means,
then the distribution of those sample means will have a mean equal to the mean of
the population that the data were sampled from and a variance equal to the
variance of the original population divided by the size of the samples. That
property is codified in the central limit theorem.
x represents sample means. The tilde (~) in this case stands for the phrase is
distributed as. The N means a normal with its mean and variance indicated in
the parentheses to follow. μ is the mean of the population the samples came
from, σ is the variance of the population the samples came from, and n is the
2
number of observations per sample. Put together, the central limit theorem says:
As the size of sample means becomes large, sample means are distributed as
a normal with the mean equal to the population mean and the variance equal
to the population variance divided by the number in each sample.
And here’s a really wild thing about the central limit theorem:
Regardless of the shape of the parent, the sample means will arrange themselves
symmetrically. For positively-skewed distributions, there will be lots of samples
from the heavy part of the curve but the values of those sample will be small;
there will be few samples from the light part of the curve but the values of those
samples will be large. It all evens out eventually.
x − μ x − μ
t = =
σ
2
σ
√ √n
n
The shape of the t distribution – the model for the distribution of the t statistic –
is given by the following formula:
ν+1
ν+1 −
2
Γ ( ) t
2
2
f (t) = (1 + )
ν
√νπ Γ ( ) ν
2
which, super-yikes, but it does show us that there is only one variable other than t
in the equation: ν , also known as degrees of freedom (df ). That means that the
df is the only sufficient statistic for the t distribution.
11 The mean of the t
distribution is 0, and the variance of the t is undefined for df ≤ 1, infinite for
1 < df ≤ 2, and f or df > 2.
df
df +2
The χ distribution is used to model squared differences from means. Figure 5.23
2
freedom we cite when talking about shapes of the t distribution: both are
representations of things that are free to vary. When we are talking about sample
means, the things that are free to vary are the observed values in a sample: if we
know the sample mean, and we know all values of the sample but one and the
sample mean, then we know the last value and thus n − 1 values in the sample
are free to vary but the last one is fixed. When we are talking about the χ 2
2
⎧
f (x; ν) = ⎨
⎪ x
0
ν
2
k
2
−1
Γ(
e
−
)
if x > 0
if x ≤ 0
Like the t distribution, the only sufficient statistic for the χ distribution is
ν = df . The mean of the χ distribution is equal to its df , and the variance of the
2
2
∑(x−x)
variance is n−1
. In both equations, the numerator is a sum of squared
differences from a mean value. Thus, the variance is a statistic that is
modeled by a χ distribution.
2
2. The chi statistic. Seems kind of obvious that the chi statistic is modeled
2 2
we’ll go over the details of those tests later, but look at that numerator! It’s a
squared difference from an expected value, and that is the χ ’s time to shine,
2
baby!
s + 1
mean(β) =
s + f + 2
The relative frequency of wins is low for teams that score fewer than 75 points,
rises for teams that score between 75 and 125 points, and at 150 points, victory
is almost certain. This type of curve is well-modeled by the logistic distribution:
Figure 5.30: The Logistic Distribution
Figure 5.32 is a visualization of interval estimates for the central 95% under a
standard normal distribution, a beta distribution with s = 5 and f = 2, a χ 2
To find the central 95% interval under each curve, we can use the distribution
commands in R with the prefix q for quantile. In each q command, we enter the
quantile of interest plus the sufficient statistics for each distribution (we can
leave out the mean and sd for the normal command because the default is the
standard normal).
1. *Unless your research happens to be about marbles, the jars in which they
reside, and the effect of removing them from their habitat, in which case:
you’ve learned all you need to know, and you’re welcome.↩
2. If π is exactly 0 or exactly 1, there really is no distribution: you would
either have no s or all s.↩
5. The fact that you have to use s − 1 for cumulative probabilities of all values
greater than s is super-confusing and takes a lot of getting used to.↩
6. Strong chance I just made up the word “quasitautological” but let’s go with
it.↩
7. Between this and the method of moments, it should be apparent that early
statisticans were super into physics and physics-based analogies.↩
9. As with any mention of R.A. Fisher, it feels like a good to remind ourselves
that Fisher was a total dick.↩
10. Well, now we can compute it. The normal approximation was a bigger deal
when computers weren’t as good and using formulas with numbers like 100!
was impossible to do with calculators.↩
11. There are variations on the t that have other parameters. Let’s not worry
about them for now.↩
12. Fisher, who again, was a total dick, proposed that samples of at least 30
could be considered to have their means distributed normally, and statistics
texts since then have reported 30 as the magic number. It’s just an
approximate rule-of-thumb designed for a world without calculators or
computers, and should be taken as a guideline but not a rule.↩
16 of them show signs of mutation (the fun kind not the sad kind).
Figure 6.2: Experimental Results (n = 4)
16
π̂ = = 0.8
20
And what we can say about the population-level rate is at the heart of the fight
between proponents of relative-frequency-based statistical inference and
proponents of subjective-probability-based statistics – or, more simply,
between Classical Statisticians and Bayesian Statisticians.
Figure 6.3: Artist’s Rendering
This is both one of the simplest and one of the least believable inferences we
could make. It is simple because it technically does precisely what we are
trying to do with scientific experimentation – to study a sample and generalize
to the population – in the most straightforward possible way. However, this
approach would fail to withstand basic scientific scrutiny because (among
other reasons, but mainly) the sample of participants is highly unlikely to be
representative of the entire population. And, even if the sample were
somehow perfectly representative, it would also fail from the perspective of
probability theory. Consider, yet again, a coin flip.3 The probability of
flipping a perfectly fair coin 20 times and getting other than exactly 10 heads
is:
20! 10 10
p(s ≠ 10|N = 20, π = 0.5) = 1 − (0.5) (0.5) = 0.8238029
10! 10!
The second approach takes the fact that a rate 80% was observed in one
sample of size N = 20 and extrapolates that information based on what we
know about probability theory to make statements about the true rate. In this
specific case, we could note that 80% is a proportion, and that proportions
can be modeled using the β distribution. We can use the observed s and f –
16 and 4, respectively – as the sufficient statistics, also known in the case of
the β (and other distributions) as shape parameters, for a beta distribution.
The shape parameters for a β distribution are s + 1 and f + 1, also known as
α and β where α = s + 1 and β = f + 1.
4 Based on our experiment, we
would expect this to be the probability density of proportions:
Based on the β distribution depicted in Figure 6.4, we can determine that the
probability that the true rate is exactly 80% is 0 and this approach is useless.
No! That’s just a funny funny joke about the probability of any one specific
value in a continuous probability distribution. What fun!
Thus, we would be about 90% sure that the true value of π is within 0.15 of
0.8, and over 99.5% sure that the true value of π is greater than 0.5.
We could also use Bayes’s Theorem to determine various values of p(H |D) –
in this case, p(π|Data) – and would get the same results as using the β.
Approach #2 is the Bayesian Approach.
3. The true rate is a single value that cannot be known without measuring
the entire population: there is a chance that our results behaved
according to the true rate and a chance that our results were weird.
Therefore, the results of our experiment cannot prove or disprove
anything about the true rate. We can rule out – with some uncertainty –
some possibilities for the true rate if our results are unlikely to have been
observed given those possible true rates.
In this approach, the true rate is unknowable based on the results of a single
experiment, so we can’t make any direct statements about it. Instead, the
scientific question focuses on the probability of the data rather than the
probability of the hypothesis. This approach starts by establishing what the
expectations of the results would be in a scenario where nothing special at
all is happening. If the experiment results in a situation where anything at least
as likely as the observations are extremely unlikely to have happened given
our hypothetical nothing-special-at-all assumption, then we have evidence to
question the nothing special assumption,5 and support for the idea that
something special is indeed happening.
Going back to our example: we might reasonably assert that a state of affairs
in which the radioactive ooze is just as likely to cause a mutation as not to
cause a mutation. In that case, the p(mutate) = p(¬mutate) = 0.5:
essentially a coin flip. As above, we collect the data and find that 16 out of
the 20 turtles show signs of mutation. Given our assumption that the rate is
0.5, then the probability of observing 16 successes in 20 trials is:
20! 16 4
p(s = 16|N = 20, π = 0.5) = (0.5) (0.5) = 0.004620552.
16! 4!
That’s pretty unlikely! But, it can be misleading: for one, the probability of
any specific s is pretty small and vanishingly small for increasing N and it’s
not like if we instead observed 17, 18, 19, or 20 mutations that we would
change our scientific theories about the experiment. So, we are instead going
to be interested in the probability of the number of observed s or more given
our assumption that π = 0.5:
4. We cannot make statements about the true rate, but based on the observed
results and what we know about the distribution of the data (in this case,
the binomial distribution), we can estimate an interval for which a given
proportion of theoretical future samples will produce similar interval
estimates that contain the true rate (whatever it may be).
As in approach #3, we assume that the true rate is a fixed value that cannot
vary and cannot be known without measuring the entire population at all
points in time. But, given that we observed 16 successes in 20 trials, we can
make some statements about what future replications of the experiments will
show. For example, we know that the true rate is not 0 or 1: otherwise, all of
our trials would have resulted in failure (if π = 0) or in success (if π = 1).
Let’s say, for example, that we want to produce an interval such that 95% of
future repetitions of our experiment will produce 95% intervals that include
the true rate (whatever it is). The lowest possible value of π in that interval is
going to be the value that leads to the cumulative likelihood of the observed s
or greater being 2.5%, and the highest possible value of π is going to be the
value that leads to the cumulative likelihood of the observed s or less being
2.5% . Thus, the cumulative probability of the data in the middle is
1 − 2.5% − 2.5.
For N > 20, there are algebraic methods for estimating the width of this
interval,6, but if software is available, the best method regardless of the size
of N is known as the Clopper-Pearson Exact Method. Without software, the
Exact Method is an interminable ordeal of trial-and-error to get the lower and
upper values of π, but it’s easy enough with software:
library(binom)
binom.confint(16, 20, 0.95, methods="exact")
The lower limit of the interval is 0.563386, and the cumulative probability of
s ≥ 16 is:
pbinom(15, 20, 0.563386, lower.tail=FALSE) = 0.025
The upper limit of the interval is 0.942666, and the cumulative probability of
s ≤ 16 is:
The confidence interval approach is, like Null Hypothesis Testing, a Classical
approach.
The likelihood is the conditional probability of the observed data given the
hypothesis. Typically, the likelihood is not a single value but a likelihood
function, as Bayesian analyses are often based on evaluating a range of
possibilities that need to be fed into a function. In an example like the
mutating-turtle experiment described above where the hypothesis is about a
rate parameter, the likelihood function would be the binomial likelihood
function. In other studies, the likelihood function might be based on other
probability density functions like that of the normal distribution. Like the
prior, there is some expertise required to choose the proper likelihood
function (in my experience, this is the part that Bayesians argue the most about
with other Bayesians).
The base rate is the overall probability of the data under the conditions of all
hypotheses under consideration. It’s actually the least controversial term of
Bayes’s Theorem: once the prior and the likelihood are established, the base
rate is just the normalizing factor that makes the sum of all the posterior
probabilities of all the hypotheses under consideration equal to 1.
Classical approaches focus on the likelihood of the data and typically use
likelihood functions (that give rise to probability distribuions) that describe
the distribution of the data given certain assumptions. The most common such
assumption is that data are sampled from normal distributions. For example,
we may assume that the data are sampled from a standard normal distribution.
If an experiment results in an observed value with a z-score of 5, then the
likelihood of sampling that value from the standard normal is pnorm(5,
lower.tail=FALSE)=2.8665157 × 10 : not impossible, but seriously
−7
unlikely. In fact, that is so unlikely that we would reject the assumption that
this value came from the standard normal in favor of the idea that the value
came from some other normal distribution.
Classical approaches have been around so long and have become so widely
used that the procedures have been made computationally simple relative to
Bayesian approaches.9 Part of that relative simplicity is due to the
assumptions made about the probability distributions of the data. Another part
of the relative simplicity is that most of the most complicated work – for
example, all of the calculus that goes into estimating areas under the standard
normal curve – has been done for us. If you ever happen to read some of the
seminal papers in classical statistics (1/10 I can’t recommend it), you may
note the stunning mathematical complexity of the historical work on the
concepts. Classical statistical methods have been made easier (again,
relatively speaking) to meet the demands of the overwhelming growth of
applications of the field over the past century or so. Bayesian methods might
get there eventually.
The decades of internecine fighting among statisticians belies the fact that the
results usually aren’t wildly different between Bayesian and Classical
analyses of the same data. Bayesian analyses tend to produce more precise
estimates, either because they don’t have to content with the sample size
required to reject a null hypothesis or because they provide more concise
interval estimates.
Bayesian approaches are more flexible in that one can construct their own
evaluation of posterior probabilities to fit their study design needs. Classical
approaches are in a sense easier in that regard because their popularity has
led to the development of tools for almost any experimental design that a
behavioral scientist would use.
The null and alternative hypotheses are both statements about populations.
They are, more specifically than described above, statements about whether
nothing (in the case of the null) or something (in the case of the alternative) is
going on in the population. In a case of an experiment designed to see if there
are differenes between the mean results under two conditions, for example,
the null hypothesis might be that there is no difference between the population
means for each condition, stated as μ − μ = 0 (note the use condition 1 condition 2
of Greek letters to indicate population means). That means that if the entire
world population were subjected to each of these conditions and we were to
measure the population-level means μ and μ , that the condition 1 condition 2
difference between the two would be precisely 0. However, even if the null
hypothesis μ − μ were true, it would be extraordinary
condition 1 condition 2
samples from the same distribution, it’s almost impossible that the mean of
both of those groups will be exactly equal to each other.
Thus, it is pretty much a given that two samples will neither be completely
uncorrelated with each other nor that two samples will be completely
identical to each other. But, based on a combination of how correlated to or
how different from each other the samples are, the size of the samples, and the
variation in the measurement of the samples, we can generalize our sample
results to determine whether there are correlations or differences on the
population level.
The null, and by extension, the alternative hypotheses we choose can indicate
either a directional hypothesis or a point hypothesis. A directional
hypothesis is one in which we believe that the population parameter will be
either greater than the value indicated in the null hypothesis or less than the
value indicated in the null hypothesis but not both. For example, consider a
drug study where the baseline rate of improvement for the condition to be
treated by the drug in the population is 33%. In that case, 33% would be a
reasonable value for the rate in the null hypothesis: if the results from the
sample indicate that the success rate of the drug in the population would be
33%, that would indicate that the drug does nothing. If, in a large study with
lots of people, the observed success rate were something like 1%, there might
be a very small likelihood of observing that rate or smaller rates, but the
scientists developing that drug certainly would not want to publish those
results. Instead, that would be a case suited for a directional hypothesis that
would lead to rejection of the null hypothesis if the likelihood of the observed
successes or more was less than would be expected with a rate of 33%. The
null and alternative hypotheses for that example might be:
H 0 : π ≤ 0.33
H 1 : π > 0.33
H 1 : π ≠ 0.5
The basic logic of the six-step procedure is this: we start by defining the null
and alternative hypotheses and laying out all of the tests we are going to do
and standards by which we are going to make decisions.
Rejecting the null hypothesis means that we have assumed a world where
there is no effect of whatever we are investigating and have analyzed the
cumulative likelihood of the observed data in that context and come to the
conclusion that it’s unlikely – based on the observed data – that there really is
no effect of whatever we are investigating. Given that we have rejected the
null, then the alternative is all that is left. In cases where we reject the null
hypothesis, we may also say that there is a statistically significant effect of
whatever it is we are studying.
Please note that in the preceding paragraphs there is no mention of the terms
prove or accept or any form, derivation, or synonym of either of them. In the
Classical framework (and in the Bayesian framework, too), we never prove
anything, neither do we ever accept anything. That practice is consistent with
principles of early-20th century philosophy of science in that no scientific
theory is ever considered to be 100% proven, and it’s also consistent with the
notion that statistical findings are always, in some way, probabilistic rather
than deterministic in nature. The rejection of a null hypothesis, for example, is
not so much there’s no way the null can be true as the null is unlikely
enough given the likelihood threshold we have set.
6.1.2.1 p-values
The α-rate is also known as the false alarm rate or the Type I error rate. It is
the rate of long-run repetitions of an experiment that will falsely reject the null
hypothesis. No research study in the behavioral sciences produces definitive
results: we never prove an alternative hypothesis. Thus, we have to have
some standard for how small a p-value is to declare the data (cumulatively)
unlikely enough to reject the null hypothesis. Whatever value for that threshold
that we choose is also going to be the rate at which we are wrong about
rejecting the null.
Let’s say we choose an α rate of 0.05. Figure 6.6 is a representation of a
standard normal distribution with the most extreme 5% of distribution
highlighted in the right tail. The distribution represents a null hypothesis that
the population mean and standard deviation are 0 and 1, respectively, and the
shaded area indicates all values that have a cumulative likelihood of 0.05 or
less. Any result that shows up in the shaded region (that is, if the value of
greater than or equal to 1.645) will lead to rejecting the null hypothesis: the
shaded area is what is known as a rejection region.
was partly informed by how easy it made calculations that had to be done
using a slide rule.15 It’s still common today largely because it has been
handed down from stats teacher to stats teacher through the generations.
Choosing α = 0.05 means that, on average, 5% of results declared
statistically significant will be false alarms. It may not surprise you to learn
that psychologists are increasingly tending towards choosing more stringent –
i.e., smaller – α rates.
The complement to the Type-I error is the type-II error, also known as the β
error, or a miss. A type-II error occurs when there is a real, population-level
effect of whatever is being tested. Figure 6.7 depicts a situation where the
population distribution is different from the distribution represented by the
null hypothesis. Whenever a value is sampled from this population
distribution that is less than the values indicated by the rejection region under
the null distribution, we will continue to assume the null hypothesis and be
wrong about it.
Figure 6.7: Normal Distribution with an Upper-tail 5% Rejection Region and
an Alternative Distribution with Highlighted Misses
The long-term rate of misses is known as the β rate. The complement of the β
rate – or 1 − β, is the rate of times when the null hypothesis will be correctly
rejected. In the sense that it is the opposite of the miss rate we might call that
the hit rate, but it is usually referred to as the power.
The table below indicates the matrix of possibilities based on whether or not
there is a real, population-level effect (columns) and what the decision is
regarding H (rows).
0
Real Effect?
Yes No
Correct Rejection of
Decision Reject H 0 False Alarm
H0
4. Identify a rule for deciding between the null and the alternative
hypotheses
6. Make a decision
In step 1, we state the null and the alternative. In the running example for this
page (turtles and ooze), we will adopt a directional hypothesis: we will reject
the null if the likelihood of the observed data or larger unobserved data is less
than our α rate (but not if the observed data indicate a rate less than what is
specified in the null). Thus, our null and alternative are:
H 0 : π ≤ 0.5
H 1 : π = 0.5
In step 2, we define our α-rate. Let’s save the breaks from tradition for a less
busy time and just say α = 0.05.
In step 3, we indicate the statistical test we are going to use. Since we have
binomial data, we will be doing a binomial test.
In step 4, we lay out the rules for whether or not we will reject the null. In this
case, we are going to reject the null if the cumulative likelihood of the
observed data or more extreme (in this case, larger values of s) data is less
than the α-rate (which, as declared in step 2, is α = 0.05). Symbolically
speaking, we can write out our step 4 as:
In step 5, we get the data and do the math. For our example, the data are:
s = 16, N = 20
The term 1 − α% refers to the false alarm rate used in a statistical analysis. If
the false alarm rate α = 0.05, then the Classical interval estimate is a
1 − 0.05 = 95% confidence interval. If α = 0.01, then the estimate is a
1 − 0.01 = 99% confidence interval. Just as 0.05 is the most popular α rate,
95% is the most popular value for the Classical confidence interval. The
value of the lower limit of the 95% confidence interval (0.563386) is the
value of π for which p(≥ 16) = 2.5% . The value of the upper limit of the
95% confidence interval (0.942666) is the value of π for which
p(s ≤ 16) = 2.5% .
p(D)
As described in the page on probability theory, there are four parts to Bayes’s
Theorem: the prior probability, the likelihood, the base rate, and the posterior
probability. Using the turtle-mutating ooze experiment example, the next
section will illustrate how we use Bayes’s Theorem to investigate a scientific
hypothesis.
prior probability p(H ) distribution that makes all 11 equally likely, as shown
in Figure 6.8:
Figure 6.8: Prior Probability Distribution for π i
Next, let’s calculate the likelihood p(D|H ) for each of the 11 possible values
of π using the binomial likelihood function:
20! 16 4
p(s = 16|N = 20, π = pi i ) = (π i ) (1 − π i )
16! 4!
The base rate p(D) is the value that will make all of the posterior
probabilities that we calculate for each π sum to 1. To get that, we take the
i
11
i=1
likelihood and divide each value by p(D) (which is the constant 0.0434) to
get the posterior probability distribution shown in Figure 6.10:
Figure 6.10: Posterior Probability Distribution for π i
And, wouldn’t you know it? That looks an awful lot like the β distribution for
s = 16 and f = 4 (see Figure 6.4 above). In fact, if we divided the [0, 1]
more and took 101 values of π , it would look more like the β. And if we had
i
conjugate function for binomial data, meaning that it’s the result of Bayes’s
thorem when you have binomial data. As a result, binomial data are pretty
much the easiest thing to analyze with Bayesian methods.
prime”) and f ′′ (“f double prime”). In our example, we want to start with a
prior probability that indicates that everything is equally likely. To do so, we
will use a flat prior (the Bayesian term for a uniform prior distribution),
specifically, the β distribution with s′ = 0 and f ′ = 0.17
To our prior β shape parameters s′ and f ′, we add our experimental data s
f ′′ = f ′ + f = 0 + 4 = 4
π(1 − π)
var(π) = = 0.0061
s′′ + f ′′ + 3
π(1 − π)
sd(π) = √ = 0.078.
s′′ + f ′′ + 3
To calculate the HDI (or credible interval, if you prefer. I don’t.), we’re going
to need some software help.19 We can install the R package HDInterval to
find HDI limits:
## lower upper
## 0.6000603 0.9303378
## attr(,"credMass")
## [1] 0.95
Thus, the tightest possible range that defines 95% of the area under the β
curve for s = 16, f = 4 has a lower limit of 0.60 and an upper limit of 0.93.
By convention, we report Bayesian intervals as:
In those cases, Bayesians often turn to Markov Chain Monte Carlo methods to
estimate posterior probability distributions on parameters. The most broadly
useful Monte Carlo method for estimating probability distributions is the
Metropolis-Hastings Algorithm.
Figure 6.11: Can’t stop, won’t stop
0.5 (can’t be too far off to start if we start in the middle of the range).
p(D|π 2 )
r =
p(D|π 1 )
between the probability density using the new parameter and the old
parameter. For our data, the probability densities are given by:
20!
16 4
p(D|π i ) = π (1 − π)
16! 4!
new π . We’ll write it down (with software, of course) before we move on.
1
b. Then, if the ratio r is > u, then we will accept the new parameter. π is 2
Let’s posit a hypothetical model H . The odds in favor of the prior H are
1 1
p(H 1 )
. If there are two models – H and H , then the prior probability of
1−p(H 1 )
1 2
p(H 1 )
p(H 2 )
.
p(H 1 |D)
Likewise, the posterior odds in favor of H given the data are
1
1−p(H 1 |D)
, the
prior probability of H given the data D is equal to 1 − p(H
2 1 |D), and the
p(H 1 |D)
p(H 2 |D)
Using Bayes’s Theorem, we can derive the following relationship between the
prior odds and the posterior odds (note that p(D) cancels out in the numerator
and denominator – that’s why it’s missing):
The Bayes Factor is the factor by which the likelihood of model H increases
1
the posterior odds from the prior odds relative to the likelihood of model H 2
p(D|H 1 )
B.F . =
p(D|H 2 )
All things being equal, the integrated likelihood of a more complex model
will be smaller than the integrated likelihood of a simpler model – a model
with more parameters will stretch its likelihood across a much larger space.
That means that the Bayes Factor naturally favors simpler models.
the larger likelihood goes in the numerator so that Bayes Factor is always
reported as being ≥ 1: so the larger the Bayes Factor, the greater the evidence
in favor of the more likely model. There are two sets of guidelines for
interpreting Bayes Factors. Both are kind of arbitrary. I don’t have a
preference.
and, because the likelihood functions of the two hypotheses (both binomial
likelihood functions) differ only by the parameter value, a simple likelihood
ratio gives the Bayes Factor:
20! 16 4 16 4
p(D|H 1 ) (0.8) (0.2) (0.8) (0.2)
16! 4!
B.F . = = =
20! 16 4 16
p(D|H 2 ) (0.5) (0.5) (0.5 (0.5) 4 )
16! 4!
B.F . = 47.22366
1. Please note that the little hat thing (ˆ) is supposed to be right on top of π.
Html, for some reason, doesn’t render that correctly.↩
6. Two such methods that we will talk about in future content are the
Wilson Score Interval and the Normal Approximation (or Asymptotic)
Method.↩
10. On the relatively rare occasions in which Classical analyses can’t rely
on assumptions about data distributions, they can become super
complex↩
11. The term hypothesis testing is generally broad and can include any
scientific inquiry where an idea is tested with observations and/or
experimentation. In the context of statistics, however, hypothesis testing
has come to be synonymous with classical null hypothesis testing and is
used as a shorthand for that approach.↩
12. I should say not etched in stone yet but keep refreshing this link.↩
14. I have to confess that all I know about turtle mutation comes from
watching cartoons (and movies and also playing both video games) as a
child and based on that I would guess that the mutation rate of
radioactive ooze is 100%.↩
15. If you have a moment, you’re gonna wanna click on that slide rule link.↩
16. We use π = 0.5 because that is the number in the statement of the null
hypothesis.↩
18. Or in the other terms used to describe the shape parameters of the β,
α = s′′ + 1 = 17, β = f ′′ + 1 = 5.↩
19. There are some HDI tables, but as with all tables, they are limited to a
certain amount of possible parameter values.↩
The above chart comes from Tyler Vigen’s extraordinary Spurious Correlations
blog. It’s worth your time to go through the charts there and see what he has
found in what must be an enormous wealth of data.
Correlation does not imply causation is literally one of the oldest principles in
the practice of statistics. It is parroted endlessly in introductory statistics
courses by me and people like me. It is absolutely true if we look through
enough pairs of variables, we will find some pairs of things that are totally
unrelated – like cheese consumption and civil engineering PhD completion –
that appear to be connected.
It is also true – and probably more dangerous than coincidence – that there are
some variables that correlate not because one causes the other but because a
third variable causes both. Take, for example, the famous and probably
apocryphal example of the correlation between ice cream sales and violent
crime: the idea there is that ambient temperature is correlated with both ice
cream consumption and crime rates. Now, the ice cream part of it is most likely
just humorous conjecture, and there have been real studies that link heat with
aggression, but it is less likely that the link between crime rates with outdoor
temperatures is due to heat causing violence, but rather due to higher rates of
social interaction in warmer weather that create more opportunities for
interpersonal violence.
The decline of sales of digital cameras over the past decade is a well-
researched trend. Here, here’s the research:
Or maybe not. But something led to the decline of sales of digital cameras.
There is another annual trend that correlates inversely with sales of digital
cameras:
Again, correlation does not imply causation is a true statement. But it is
simultaneously true that when there is causation, there is also correlation.
Correlation techniques – more specifically, regression techniques – are the
foundation of most classical and some Bayesian statistical approaches, and as
such are widely used in controlled experimentation: the gold standard of
establishing causation. So, yes, correlation doesn’t by itself establish cause-
and-effect, but it is a good place to start looking.
7.2 Correlation
7.2.1 The Product Moment Coefficient r
The absolute value of r is a measure of how closely the two variables are
related. Absolute values of r that are closer to 0 indicate that there is a loose
relationship between the variables; absolute values of r that are closer to 1
indicate a tighter relationship between the variables. Jacob Cohen1
recommends the following guidelines for interpreting the absolute value of r:
H0 : r ≤ 0
H1 : r > 0
H 1 : r < 0.
H1 : r ≠ 0
Critical values of r can be found by consulting tables like this one. The critical
values of r are the smallest values given the df – which for a correlation is
equal to n − 2 – that represents a statistically significant result given the
desired α-rate and type of test (one- or two-tailed).
Like most parametric tests in the Classical framework, the results of parametric
correlation are based on assumptions about the structure of the data (the
assumptions of frequentist parametric tests will get it’s own discussion later.
The main assumption to be concerned about3 is the assumption of normality.
The normality assumption is that the data are sampled from a normal
distribution. The sample data themselves do not have to be normally
distributed – a common misconception – but the structure of the sample data
should be such that they plausibly could have come from a normal distribution.
If the observed data were not sampled from a normal distribution, the
correlation will not be as strong as it would if they were. If there are any
concerns about the normality assumption, the best and easiest way to proceed is
to back up a parametric correlation with one of the nonparametric correlations
described below: usually Spearman’s ρ, sometimes Kendall’s τ , or, if the data
are distributed in a particular way, Goodman and Kruskal’s γ.
The secondary assumption for the r statistic is that the variables being
correlated have a linear relationship. As covered below, r is the coefficient for
a linear function: that function does a much better job of modeling the data if the
data scatter in a linear function about the line.
If the data are not sampled from a normal distribution and/or linearly related,
then r just won’t work as well. It’s not that the earth beneath the analyst opens
up and swallows them whole when assumptions of classical parametric tests
are violated: it’s just that they don’t wory as intended, leading to type-II errors
(if anything).
Where were we? Right, the data we are going to use for our examples. It’s a
generic set of n = 10 observations numbered sequentially from 1 → 10, 10
observations for a generic x variable, and 10 observations for a generic y
variable.
n x y
1 4.18 1.73
2 4.70 1.86
3 6.43 3.61
4 5.04 2.09
5 5.25 2.00
6 6.27 4.07
7 5.34 2.56
8 4.21 0.33
9 4.17 1.49
10 4.67 1.46
Taking the sum of the rightmost column and dividing by n − 1 , we get the r
statistic:
n
∑ i=1 z x i z y i
= 0.9163524
n − 1
We can check that math using the cor.test() command in R (which we could
have just done in the first place):
cor.test(x, y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 6.4736, df = 8, p-value = 0.0001934
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6777747 0.9803541
## sample estimates:
## cor
## 0.9163524
where
n
2
SS x = ∑ (x − x)
i=1
and
n
2
SS x = ∑ (x − x)
i=1
2
SS x = ∑(x − x) = 6.01504
2
SS y = ∑(y − ȳ ) = 10.4878
The r statistic is the standard value used to describe the correlation between
two variables. It is based on the assumptions that the data are sampled from a
normal distribution and that the data have a linear relationship. If the data
violate those assumptions – or even if we are concerned that the data might
violate those assumptions – nonparametric correlations are a viable
alternative.
x y
11
22
33
44
55
66
Please notice that x and y are perfectly correlated: in fact, they are exactly the
same. What is the probability of that happening? To answer that, we need to
know how many possible combinations of x and y there are, so we turn to
permutation.
To find the number of combinations of x and y, we can leave one of the
variables fixed as is and then find out all possible orders of the other variable:
that will give us the number of possible pairs. That is: if we leave all the x
values where they are, we just need to know how many ways we could shuffle
around the y values (or we could leave y fixed and shuffle x – it doesn’t
matter). The number of possible orders for the y variable (assuming we left x
fixed) is given by:
6!
nP r = 6 P6 = = 720
0!
So, there are 720 possible patterns for x and y. The probability of this one
pattern is therefore 1/720 = 0.0014. If we have a one-tailed test where we are
expecting the relationship between x and y to be positive, then the observed
pattern of the data is the most extreme pattern possible (can’t get more positive
agreement than we have), and 0.0014 is the p-value of a nonparametric
correlation (specifically, the Spearman correlation).
##
## Spearman's rank correlation rho
##
## data: 1:6 and 1:6
## S = 0, p-value = 0.001389
## alternative hypothesis: true rho is greater than 0
## sample estimates:
## rho
## 1
##
## Spearman's rank correlation rho
##
## data: 1:6 and 1:6
## S = 0, p-value = 0.002778
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 1
7.2.3.1 Spearman’s ρ
The ρ is the r of the ranks: rho is also known as the rank correlation. There are
fancy equations to derive ρ from the ranks, but I have no idea why you would
need to use them instead of just correlating the ranks of the data
Returning to our example data, we can rank each value relative to the other
values of the same variable from 1 to n. Usually the ranks go from smallest to
largest, but it doesn’t really matter so much as the same convention is followed
for both the x and the y values. In the case of ties, take the average of the ranks
above and below the tie cluster. If, for example, the data were {4, 7, 7, 8}, the
ranks would be {1, 2.5, 2.5, 4}.
x y rank(x) rank(y)
4.18 1.73 2 4
4.70 1.86 5 5
6.43 3.61 10 9
5.04 2.09 6 7
5.25 2.00 7 6
6.27 4.07 9 10
5.34 2.56 8 8
4.21 0.33 3 1
4.17 1.49 1 3
4.67 1.46 4 2
We then could proceed to calculate the correlation r between the ranks. Or,
even better: we can calculate the ρ correlation for the observed data:
cor.test(x, y, method="spearman")
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 20, p-value = 0.001977
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8787879
And note that the ρ is the same value as if we calculated r for the ranks5
##
## Pearson's product-moment correlation
##
## data: xrank and yrank
## t = 5.2086, df = 8, p-value = 0.0008139
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5577927 0.9710980
## sample estimates:
## cor
## 0.8787879
7.2.3.2 Kendall’s τ
The τ for the example data (the original example data, not the data for Figure
7.3) is:
cor.test(x, y, method="kendall")
##
## Kendall's rank correlation tau
##
## data: x and y
## T = 39, p-value = 0.002213
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.7333333
There is not a lot said about the differences between Classical vs. Bayesian
approaches to correlation and regression on this page, mainly because this is an
area where Bayesians don’t have many snarky things to say about it. Bayesians
may take issue with the assumptions or with the way that p-values are
calculated – and both of those potential problems are addressed in the Bayesian
regression approach discussed below – but they are pretty much down with the
concept itself.
Kendall’s τ :
τ + 1
ϕc =
2
and Figure 7.4 shows how the β distribution of ϕ relates to scatterplots of data
c
Algebra Grade
A B C D F
Social Studies Grade A 6 2 1 2 0
Social Studies Grade B 2 6 2 1 1
Social Studies Grade C 1 1 7 1 1
Social Studies Grade D 1 1 0 6 3
Social Studies Grade F 0 0 0 0 5
As an example, the data represented in Figure 7.5 are the same example data
that we have been using throughout this page, except that the y variable has been
exponentiated: that is, we have raised the number e to the power of each value
of y
We can still calculate r for these data despite the relationship now looking
vaguely non-linear: the correlation between x and e is 0.86. However, that
y
obscures the relationship between the variables.8 In this case, we can take the
natural logarithm ln e to get y back, and then calculate an r of 0.92 as we did
y
before. We would then report the nonlinear correlation r, or report r as the
correlation between one variable and the transformation of the other.
The only thing that we can’t do is transform a variable, calculate r, and then
report r without disclosing the transformation. That’s just cheating.
A couple of caveats before we move on: some patterns aren’t as easy to spot as
logarithms (or, squares or sine waves), and most nonlinear correlations are
better handled by multiple regression (sadly, beyond the scope of this course).
7.3 Regression
Correlation is an indicator of a relationship between variables. Regression
builds on correlation to create predictive models. If x is our predictor variable
and y is our predicted variable, then we can use the correlation between the
observed values x and y to make a model that predicts unobserved values of y
based on unobserved values of x.
z
ˆy = rz x
Figure 7.6 is a scatterplot of the z-scores for the x and y variables in the
example introduced above. The blue line is the least-squares regression line in
standardized (z-score) form. Please note that when z = 0, then rz = zˆ = 0,
x x y
ŷ = ax + b
where a is the slope of the raw-score form of the least-squared regression line
and b is the y-intercept of the line (that is, the value of the line when x = 0 and
the line intersects with the y-axis).
We get the raw-score equation by converting the z-score form of the equation.
Wait, actually, 99.9% of the time we get the raw-score equation using software
in the first place, but if we did need to calculate the raw-score equation by
hand9, we would first find r, which would give us the standardized equation
zˆ = rz , which we would then convert to the raw-score form via the
y x
following steps:
The intercept (of any equation in the Cartesian Plane) is the point at which a
line intersects with the y-axis, meaning that it is also the point on a line where
x = 0. Thus, our first job is to find what the value of y is when x = 0. We don’t
yet have a way to relate x = 0 to any particular y value, but we do have a way
to connect z to z : the standardized regression equation zˆ = rz .
x y y x
First, we can find the value of z for x = 0 using the z-score equation z =
x
x−x
sd
z
ˆy = rz x = (0.9163524)(−6.147867) = −5.633613
And then, using the z-score formula again and knowing that the mean of y is
2.12 and the sd of y is 1.0794958, we can solve for y:
y − 2.12
−5.633613 =
1.079496
y = −3.961463
We know then that the raw-score least-squares regression line passes through
the point (0, −3.961463), so the intercept b = −3.961463.
It takes two points to find the slope of a line. We already have one – the y-
intercept (0, −3.961463) – and any other one will do. I prefer finding the point
of the line where x = 1, just because it makes one step of the math marginally
easier down the line.
Δx
):
Δy −2.751461 − −3.961463
a = = = 1.210002
Δx 1 − 0
ŷ = 1.21x − 3.9615
Figure 7.7: Scatterplot of x and y Featuring Our Least-Squares Regression Line
in Raw-Score Form
But, if all of that algebra isn’t quite your cup of tea, this method is quicker:
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.80264 -0.22414 0.00656 0.33794 0.63366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.9615 0.9505 -4.168 0.003133 **
## x 1.2100 0.1869 6.474 0.000193 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
##
## Residual standard error: 0.4584 on 8 degrees of freedom
## Multiple R-squared: 0.8397, Adjusted R-squared: 0.8197
## F-statistic: 41.91 on 1 and 8 DF, p-value: 0.0001934
The jaunty cap sported by y and z in this section indicates that y and z are
y y
they did, then all of the points on the scatterplot would be right on the regression
line. There is always error in prediction, hence, the hat
ˆ.
2 2
R = r
R as a statistic has special value. It is known – in stats classes only, but known
2
the predictive advantage conferred by using the model to predict values of the y
variable over using the mean of y (ȳ ) to make predictions.
explained by the model. Recall that the least-squares regression line minimizes
the squared distance between the points in a scatterplot and the line. The line
represents predictions and the distance represents the prediction error (usually
just called the error). The smaller the total squared distance between the points
and the line, the lower the error, and the greater the value of R . 2
by:
2
SS total − SS error
R =
SS total
by the model at each point x and the observed y predicted by the model for each
x:
2
∑(y pred − y obs )
Obs x y (y − ȳ ) 2
y pred (y pred − y obs )
2
2
SS total − SS error
R = = 0.8397017
SS total
If there are no ties among the x values nor among the y values in the data, we
can calculate the sample Kendall τ̂ statistic using the following formula:
2(n c − n d )
τ̂ = .
n(n − 1)
When making pairwise rank comparisons, ties in the data reduce the number of
valid comparisons that can be made. For the purposes of calculating τ̂ when
there are tied data, there are three types of ties to consider:
3. xy ties: pairs where the x observation and the y observation are each
members of a tie cluster (that is, the pair is doubly tied)
For example, please consider the following data (note: this dataset has been
created so that both the x variables, the y variables, and the alphabetical pair
labels are all in generally ascending order for simplicity of presentation: note
that this does not need to be the case with observed data):
pair x y rank x rank y
A 1 11 1.0 1.0
B 2 12 3.0 2.0
C 2 13 3.0 3.5
D 2 13 3.0 3.5
E 3 14 5.0 5.0
F 4 15 6.0 6.0
G 5 16 7.5 7.0
H 5 17 7.5 8.0
I 6 18 9.0 9.5
J 7 18 10.0 9.5
1. x ties: In the example data, there are two tied-x clusters. One tied-x
cluster comprises the x observations of pair B, pair C, and pair D: each
pair has an x value of 2, and those three x values are tied for 3rd lowest of
the x set. The other tie cluster comprises the x observations of pair G and
pair H, and those x values are tied for 7.5th lowest. Thus, there are two x
tie clusters: one of 3 members and one of 2 members. Let t denote the xi
3(2 − 1) 2(2 − 1)
TX = + = 4
2 2
2. y ties: In the example data, there are also two tied-y clusters. One tied-y
cluster comprises the y observations of pair C and pair D: each has an y
value of 13, and those two y values are tied for 3.5th lowest of the y set.
The other tie cluster comprises the y observations of pair I and pair J, each
with an observed y value of 18 and tied for 9.5th highest position. Thus,
there are two y tie clusters, each with 2 members. Let t denote the yi
2(2 − 1) 2(2 − 1)
TY = + = 2
2 2
3. xy ties: There are two data pairs in the example data that share the same x
rank y rank – pair C and pair D – and the pairs in each are said to be
doubly tied or, that they constitute an xy cluster. Therefore, for this dataset
there is one xy cluster with two members – the x and y values of pair C
and the x and y values of pair D. Let t denote the number of members
xyi
t
xy
of each xy cluster i and let m denote the number of xy clusters. We
xy
are removed from the total possible comparisons used to calculate the τ̂
statistic as such:
m xy
t xyi (t xyi − 1)
T XY = ∑
2
i=1
2(2 − 1)
T XY = = 2
2
n(n − 1)
n max = − T X − T Y + T XY
2
The above equation removes the number of possible comparisons lost to ties by
subtracting T and T from the total possible number
n(n−1)
x y , and then adjusts 2
With the number of available comparisons now adjusted for the possible
presence of ties, we may now re-present the equation for the τ̂ statistic as:
nc − nd
τ̂ =
n max
To this point, the example datasets we have examined have been so small that
identifying and counting concordant-order comparisons, discordant-order
comparisons, tie clusters, and the numbers of members of tie clusters has been
relatively easy to do. To calculate the Kendall τ̂ with larger datasets invites a
software-based solution, but there is a problem specific to the τ̂ calculation that
can lead to inaccurate estimates when there are tied data in a set.
That problem stems from the specific correction for ties proposed by Kendall
(1945), following Daniels (1944), in an equation for a correlation known as the
Kendall tau-b correlation τˆ : b
nc − nd
τ
ˆb =
n(n−1) n(n−1)
√( − T X )( − TY )
2 2
denominator simplfies to √(
n(n−1) n(n−1) n(n−1)
2
)(
2
) = , which is the same
2
calculation and dividing by the proper τ̂ calculation (in other words: multiply
by the wrong denominator to get rid of it and then divide by the correct
denominator):
n(n−1) n(n−1)
τ̂ b √ ( − T X )( − TY )
2 2
τ̂ = .
n(n−1)
− TX − TY + TX Y
2
Given that we may be working with larger datasets in using this correction, it
may seem that we have traded one problem for another: the above equation may
give us the desired estimate of τ̂ , but how do we find T , T , T Y , n , and n
X Y X c d
x<-c(5, 5, 5, 8, 8, 13)
table(x)
## x
## 5 8 13
## 3 2 1
We can restrict the table to a subset with only those members of the array that
appear more than once:
table(x)[table(x)>1]
## x
## 5 8
## 3 2
And then, we can remove the header row with the unname() command, leaving
us with an array of the sizes of each tie cluster in the original array x:
unname(table(x)[table(x)>1])
## [1] 3 2
We then can use the count of values in that resulting array to represent the
number of tie clusters in the original array and the values in that resulting array
to represent the number of members in each tie cluster in the original array, a
procedure we can use to calculate T and T . X Y
a minor modification is needed. We can take two arrays (for example, x and y),
pair them into a dataframe, and then the rest of the procedure is the same. Thus,
given the data from Example 4, :
x<-c(1, 2, 2, 2, 3, 4, 5, 5, 6, 7)
y<-c(11, 12, 13, 13, 14, 15, 16, 17, 18, 18)
xy<-data.frame(x,y)
unname(table(xy)[table(xy)>1])
## [1] 2
we reach the same conclusion as we did when counting by hand: there is one xy
cluster with two members in this dataset.
However we count the numbers of tie clusters and the count of members in each
tie cluster, once T , T , and T Y are determined, there are algebraic means
X Y X
n(n−1) n(n−1)
n(n − 1) TX TY TX Y ˆb √ (
τ − T X )( − TY )
2 2
nc = − − + +
4 2 2 2 2
n d = n max − n c
1. Cohen, J. (2013). Statistical power analysis for the behavioral sciences.
Academic press.↩
4. The binomial test used as the Classical example of Classical and Bayesian
Inference is an example of a nonparametric test. In that test, the patterns
being analyzed were the number of observed successes or more in the
number of trials given the assumption that success and failure were equally
likely. In the Classical approach to that example, we were not concerned
with, parameters like the mean number of successes or the standard
deviation of the number of successes in the population.↩
5. The p-values are different, though, as they are based on permutations. For
significance, use the value given by software or feel free to consult this
old-ass table↩
Here’s a related example: radar operators have to look at radar screens and
be able to identify little lights that represent things like airplanes among other
little lights that represent things like rainstorms.
The tones in a hearing test and the lights that indicate aircraft on a radar
screen are known as signals. The fuzzy sounds in a hearing test and the
atmospheric events on a radar screen are known as noise.
Here’s an example of how factors other than relative strengths of signal and
noise can influence whether or not we see – or at least report seeing – signals
in noise. Let’s say I am an experimenter and you are a signal-detecting
participant and we run two experiments with monetary awards associated
with it. In each experiment, I give you a signal detection task: could be
hearing tones, seeing lights, feeling temperature changes, or any other kind of
perceptual thing of your choosing (it doesn’t matter to me – it can be whatever
helps you understand the example). In Experiment 1, I will give you 1 US
Dollar for every time you correctly identify a signal with no penalty for being
wrong, and you could earn up to $20 if you get them all. In Experiment 2, I
will give you $20 at the start and take away $1 every time you incorrectly
identify a signal with no bonus for being right. Far be it from me to assume
how you would behave in each of those experiments, but if I had to guess, I
would imagine that you might be more likely to identify signals in Experiment
1 than you would be in Experiment 2. It would only be natural to risk being
wrong more easily in Experiment 1 and to be more cautious about being
wrong in Experiment 2. From a strictly monetary-reward-maximizing
perspective, the best strategy would be to say that you are seeing signals all
the time in Experiment 1 and to say that you are never seeing signals in
Experiment 2. Of course, those are two extreme examples, but we could – and
as we’ll explore later in this chapter, do – tweak the schedule of rewards and
penalties so that the decisions to make are more difficult.
Decision<-rep("Operator Response", 2)
rejectaccept<-c("Signal Present", "Signal Absent")
Yes<-c("Hit", "Miss")
No<-c("False Alarm", "Correct Rejection")
Decision<-rep("Diagnosed", 2)
rejectaccept<-c("Yes", "No")
Yes<-c("True Positive", "False Alarm")
No<-c("Miss", "True Negative")
kable(data.frame(Decision, rejectaccept, Yes, No), "html", esc
kable_styling() %>%
add_header_above(c(" "=2, "Do They Have the Condition?"=2))
collapse_rows(1)
Decision<-rep("Decision", 2)
rejectaccept<-c("Reject $H_0$", "Continue to Assume $H_0$")
Yes<-c("Correct Rejection of $H_0$", "False Alarm")
No<-c("Miss", "$H_0$ Correctly Assumed")
Real Effect?
Yes No
Correct Rejection of
Decision Reject H 0 Miss
H0
Real Effect?
Yes No
Continue to Assume H Correctly
Decision False Alarm 0
H0 Assumed
set.seed(123)
n <- 100
mu <- c(x = 0, y = 0)
Rstrong <- matrix(c(1, 0.95,
0.95, 1),
nrow = 2, ncol = 2)
The (x, y) pairs with population-level correlations of 0.95 and 0.15 led to
samples that have correlations of 0.94 and 0.06, respectively. The former
correlation is statistically significant at the α = 0.05 level (p < 0.001); the
latter is not (p = 0.5423). In the signal-detection metaphor, the models
indicated by the least-squares regression lines represent the signal and the
distances between the lines and the dots represent the noise.
variance explained by the model. The proportion of the variance that is not
explained by the model – that is, 1 − R , is the error. In the SDT framework,
2
89.1% of the variance in the y data. It’s an enormous sample size. In SDT
terms, it is a signal so strong that it would be visible despite the strength of
any noise (or: when the correlation is that strong on the population level, it is
nearly impossible to sample 100 observations at random that wouldn’t lead to
rejecting the null hypothesis). The error associated with the model is
1 − R = 0.109. If we take the error to be the noise, then the signal-to-noise
2
ratio is4 nearly 9:1. The model in part B of Figure 8.2, by contrast, represents
an R of 0.062 = 0.0038, therefore explaining less than four-tenths of one
2 2
The central tenet of signal detection theory is that the decisions that are made
by operators under different conditions are all products of underlying strength
distributions of the signal and of the noise. The strength of a signal and the
strength of noise occupy more than single points in an operator’s mind: if they
did, detecting signals would be deterministic, and operators would always
choose based on the stronger of the signal and the noise. We know that
operators don’t behave like that, and so the basic theory is that the strength of
the signal and the strength of the noise are represented by distributions. We
assume that both distributions are normal distributions, as depicted in Figure
8.3. And, we assume that noise is always present, so the distribution usually
referred to as the signal distribution is sometimes (and technically more
theory-aligned) referred to as the signal plus noise distribution.
We don’t really know what the x and y values are for the signal distribution
and the noise distribution and it honestly doesn’t matter. All we are
interested in is the shapes of the curves relative to each other so that we can
learn about how people make decisions based on their relative perceived
strength. Since the placement doesn’t matter, we can set one of those curves
wherever we want and then measure the other curve relative to the set one.
And since we can set either one of the curves wherever we want, to make our
mathematical lives easier, it is a very good idea to set the noise distribution to
have a mean of 0 and a standard deviation of 1. In other words, we assume
that the perception of noise follows a standard normal distribution.5
8.3 Distinguishing Signal from Noise
The point of signal detection theory is to understand the underlying perception
of signal strength with respect to noise and how operators make decisions
given that perception. Of all of the measurements that come out of signal
detection frameworks, three statistics are most frequently used: d′, which
measures the discriminability – in terms of the curves, it’s the distance
between the peak of the noise curve and the signal (plus noise) curve –
between signal and noise, β, which measures the response bias – the tendency
to say yes or no at any point, and the C-statistic<The “C” in “C-statistic” is
the only statistical symbol that is not abbreviated in APA format and I have
never found a good explanation for why that is.], which measures the
predictive ability of an operator by taking into account both true positives and
false alarms.
Of those three, the β statistic is the least frequently used – it’s more in the
bailiwick of hardcore SDT devotees – but we’ll talk about it anyway. It will
give us perhaps our only opportunity in this course to use probability density
as a meaningful measurement, just like the last sentence gave me perhaps my
only opportunity to use the word “bailiwick” in a stats reading.
8.3.1 d′
It’s relatively easy to pick out a signal when it is on average much stronger
than the noise. In a hearing test, it’s easy to pick out the tones if those tones are
consistently much louder than the background noise; on a radar screen, it’s
easier to pick out the planes when those lights are consistently much brighter
than the atmospheric noise. In terms of the visual display of the underlying
signal and noise distributions, that sort of situation would be represented by a
signal distribution curve with its center further to the right on the strength axis
than the center of the noise distribution.
Figure 8.5: Relatively Large and Small Values of d′
There is one scenario that would consistently produce a negative value of d′.
In the 2018 film Spider-Man: Into the Spider-Verse, high school student (and
budding Ultimate Spider-Man) Miles Morales intentionally answers every
question on a 100 true-false question test incorrectly.
Figure 8.6: Absolute Masterpiece
8.3.2 β
The β statistic (not to be confused with the β distribution or any of the other
uses of the letter β in this course) is a measure of response bias: whether an
operator’s response is more likely to indicate signal or to indicate noise at
any point. The β statistic – of which there can be one or there can be many
depending on the experiment – is the ratio of the probability density of the
signal plus noise curve to the density of the noise curve at a criterion point. If
a criterion is relatively strict, then an operator is likely to declare that they
perceive a signal only when they have strong evidence to believe so: there
will be few hits at a strict criterion point and few false alarms as well. If a
criterion is relatively lenient, then an operator is generally more likely to
declare that they perceive signals: there will be relatively many hits at a
lenient criterion point and many false alarms as well. Stricter criteria, as
illustrated in Figure 8.7, are further to the right on the strength axis where the
probability density of the signal distribution is high relative to the probability
density of the noise distribution (that is, the strength line is higher than the
nosie line) than more lenient criteria where the probability density of the
noise distribution is higher relative to the probability density of the noise
distribution.
The term response bias may imply that it describes a feature of a given
operator, but that is not necessarily the case. The response bias is largely a
feature of the criterion that the operator has adopted, which in turn can vary
based on circumstances. For example, in a signal-detection experiment where
an operator receives a reward (monetary or otherwise) for each hit that they
register with no penalty for false alarms, the operator has motivation to adopt
a more lenient criteria – they really should say that the signal is present all the
time. Conversely, in a situation where an operator is penalized for false
alarms and not rewarded for hits, then the operator may be motivated to adopt
a more stringent criterion – they might say that the signal is never present.
Signal-detection experiments often take advantage of the variability of criteria
in order to measure more of the underlying noise and signal distributions by
observing more data points.
8.3.3 C-statistic
The area under the ROC curve for d′ = 1 in Figure 8.9 – that is, the C-
statistic – is 0.76, the C-statistic for d′ = 2 is 0.921, and the C-statistic for
d′ = 3 is 0.76 (see Figure 8.10)
Figure 8.10: C-statistics for Three ROC Curves with d′ = 3, d′ = 2, and
d′ = 1
Figure 8.11: Empirical ROC Curve for the Green & Swets (1966) Data.
The hit rate at each point is the frequency at which the individual correctly
identified a signal. In terms of the Noise/Signal+Noise Distribution
representation, that rate is taken to be the area under the signal+noise
distribution curve to the right of a given point on the strength axis: everything
to the right of the point (indicating greater strength) will be identified by the
operator as a signal.
The false alarm rate at each point is the frequency at which the individual
misidentifies noise as a signal. In terms of the Noise/Signal+Noise
Distribution representation, that rate is the area under the noise distribution
curve to the right of a given point on the strength axis: as with the hit rate,
everything to the right of that point will be (mis)identified by the operator as a
signal.
Thus, the hit rate and the false alarm rate give us the probabilities that an
operator is responding to their perception of signal and noise, respectively,
under each condition. Figure 8.12 depicts these probabilities for the signal
and noise distributions plotted separately for condition 3 (nothing super-
special about that condition, I just had to pick one).
Figure 8.12: Areas Under the Noise and Signal + Noise Curves Indicating
Probability of False Alarm and Hit, Respectively
Because both the signal and the noise distributions are normal distributions,
based on the area in the upper part of the curve, we can calculate z-scores that
mark the point on those curves that define those areas. Additionally, because
the noise distribution is assumed to be a standard normal distribution, the
values of z are also x-values on the strength axis. The signal distribution
FA
lives on the same axis, but it’s z-scores are based on its own mean and
standard deviation, and so for the points on the strength axis defined by the
criteria which, in turn, vary according to experimental condition (that is, the
motivation for adopting more stringent or more lenient criteria are
manipulated experimentally), the same x points will represent different z
values for the signal and for the noise distribution.
Thus, we take the experimentally-observed proportions of hits and false
alarms, we consider those to be the probabilities of hits and false alarms,
translate those probabilities into upper-tail areas under normal
distributions, and find the z-scores that define those upper-tail probabilities:
those will be our z values (for the signal distribution probabilities) and our
hit
At this point, our analyses hit a fork in the road. The observed ROC curve is
based on data: it does not change based on how we choose to analyze the
data, and the C-statistic does not change either. The d′ and β statistics will
change (not a whole lot, but substantially) based on what we believe about the
shape of the signal distribution.
There are two assumptions that we can make about the signal distribution that
will alter our evaluation of d′ and beta.
2. Unequal Variance: The signal and the noise distributions can have
different variances.
This page will cover both. We will start by analyzing the sample data based
on the unequal variance assumption, then circle back to analyze the same data
using the equal variance assumption.
Our sample data come from an experiment with five different conditions, each
one eliciting a different decision criterion (and, in turn, a different response
bias). To find the overall d′ (and β, which will depend on first finding d′), we
will use a tool known as the linearized ROC: a tranformation of the ROC
curve mapped on the Cartesian (x, y) plane.9 The linearized ROC plots z hit
on the x-axis and z FA on the y-axis (see Figure 8.13 below). That’s
potentially confusing since the ROC curve has the hit rate on the y-axis and
the FA rate on the x-axis. But, putting z on the x-axis and z on the y-axis
hit FA
makes the math much more straightforward, and we only need the linearized
ROC to calculate d′, so it’s likely worth a little (temporary) confusion.
Thus, for the linearized ROC and only for the linearized ROC, z FA = y and
z = x.
H it
The slope of the linearized ROC in the form y = âx + b is the ratio of the
standard deviation of the y variable to the ratio of the standard deviation of
the x variable:
sd y 1.25
â = = = 1.17.
sd x 1.07
ŷ = âx + b̂
ˆ
ŷ = 1.17x + b
0.342 = 1.17(−0.406) + b̂
The β estimates at each criterion point – as noted above – are the ratios of the
probability density of the signal plus noise distribution to the probability
density of the noise distribution at each point.
We can use the noise distribution to locate points on the strength axis: we
assumed the noise distribution is a standard normal so it is centered at 0 and
its standard deviation is 1 so the z-values for the noise distribution are also
x values. Those values are listed on the x-axis in Figure 8.13 We will use
those values to find the probability density at each point for the noise
distribution.
Our next step is to find what those x values represent in terms of the signal
plus noise distribution. We will calculate z-scores for the signal distribution
that correspond to each of the criterion points based on the estimates of the
mean and standard deviation of the signal distribution we got from the
linearized ROC, which are separate from the z values we got from the hit
H it
rates in the experimental data. The mean of the signal distribution is equal to
d′: μSignal= 0.817 and the standard deviation of the signal distribution is
estimated by the slope of the linearized ROC: σ Signal= 1.17. Plugging each
To find the β values at each criterion point, we next find the probability
density for each curve given the respective z values and the mean and
standard deviation of each curve. To do so, we can simply use the dnorm()
command: to find the densities for the noise distribution – which, again, is a
standard normal distribution – we use the values of znoise as the x variable in
the command dnorm(x, mean=0, sd=1) (as a reminder: mean=0, sd=1 are
the defaults for dnorm(), so in this specific case you can leave those out if
you prefer), and to find the densities for the signal distribution, we use the
values of z as the x variable in the command dnorm(x, mean=0.817,
signal
y = x + d′ = z F A = z H it + d′
d′ = z F A − z H it
Again plugging in the mean of the z FA values and the mean of the z H it values:
Because d′ has changed with the change in assumptions, the estimate of the
mean of the signal distribution μ has also changed: both are now
Signal
between the noise and the signal plus noise distributions, the variance and the
standard deviation of the signal plus noise distribution are the same as those
for the noise distribution, that is to say, σ
2
N oise
= 1 and
√σ 2
N oise
= σ N oise = 1 (because we still assume that the noise distribution is
a standard normal) and thus σ
2
Signal
= 1 and √σ
2
Signal
= σ Signal = 1 . Our
estimates of β for each criterion point will also change – the z-scores and the
c
corresponding probability densities for the noise distribution don’t change but
the z-scores and the densities for the signal plus noise distributions do.
Replacing the mean and standard deviation of the signal plus noise
distribution derived using the unequal-variance assumption – namely:
μSignal = 0.817 and σ = 1.17 – with the mean and standard deviation
Signal
betaequal.df$noisedensity<-dnorm(znoise)
betaequal.df$signaldensity<-dnorm(zsignal, 0.748, 1)
betaequal.df$beta<-betaequal.df$signaldensity/beta.df$noiseden
8.4.3.1 The Area Under the ROC Curve: AUC, or, the C-statistic
The C-statistic is also known as the area under the ROC curve (AUC) and is
literally the area under the curve. For the experimental data from Green &
Swets, should we want to calculate the C-statistic, we merely need to treat the
area under the observed curve as a series of triangles and rectangles as shown
in Figure 8.16:
Figure 8.16: Breaking Down The Empirical ROC from the Green & Swets
(1966) Data.
If we add up the areas of all those triangles (A = bh) and all those
1
For the Green & Swets data, we can translate the hit and false alarm rates to
binary responses because we know that there were n = 200 target-present
trials and n = 200 target-absent trials for each condition: each hit and each
false alarm gets a value of 1 for the response, and each miss and each correct
rejection gets a value of 0 for the response. Likewise, we can assign a value
of 1 for the binary condition of the target being present or absent for each of
the target-present trials and a value of 0 for each of the target-absent trials.
Pooling the data across all of the trials, here is a contingency table for the
operator responses vs. whether the target was present or absent:
Decision<-rep("Operator Response", 2)
rejectaccept<-c("Signal Present", "Signal Absent")
Yes<-c(599, 395)
No<-c(401, 605)
where b is the intercept and b , the coefficient, is the logarithm of the odds
0 1
But for our purposes, all we need to know is that we will use almost the exact
commands to predict the binary response outcome from the target-presence
variable as we did to predict y from x in linear regression. The only changes
are that we will call glm() instead of lm() – general linear model instead of
linear model – and we will indicate that this is a logistic regression – with a
binary outcome – by including the command family = "binomial" in the
glm() parentheses (press “Code” to see the commands).
logistic.model<-glm(response~signal.present, family="binomial"
Then, we will use the Cstat() command from the DescTools package to get
the C-statistic:
Cstat(glm(response~signal.present, family="binomial"))
## [1] 0.6020037
In the foregoing example from Green & Swets (1966), the binary presence or
absence of the signal was used to predict operator response. Using logistic
regression allows us to include much more data in the prediction. If, say, we
were not interested in predicting operator response but in predicting the
binary presence or absence of a medical condition, we could include multiple
health-related predictor variables to help inform the prediction.
For example, we can use systolic blood pressure data to predict high blood
pressure from some of the other variables in the dataset. First, we need to
define a binary variable for the prediction. We can (rather arbitrarily) define
“high” blood pressure as “above the median SBP level for this dataset:”
sbp<-data.frame(read.csv("data/hd.csv")) sbp$High<-
ifelse(sbp$SBP>median(sbp$SBP), 1,0)
sbpmodel<-glm(High~Age+BMI+Smoke, data=sbp,
family="binomial")
Cstat(sbpmodel)
sbp<-data.frame(read.csv("data/hd.csv"))
sbp$High<-ifelse(sbp$SBP>median(sbp$SBP), 1,0)
sbpmodel<-glm(High~Age+BMI+Smoke, data=sbp, family="binomial")
Cstat(sbpmodel)
## [1] 0.8554688
we find that the C-statistic for the model is 0.86. For context: if the model
were no better than chance, we would expect a C-statistic around 0.5, and
usually C-statistics greater than 0.7 are considered to show reasonably good
predictive power for the model and and C-statistics greater than 0.8 are
considered to indicate strong predictive power.10
2. Fun fact: have you ever heard that eating carrots improves eyesight
and/or confers the ability to see in the dark? That’s the result of a WWII
British intelligence campaign to prevent Germany from figuring out that
Royal Air Force pilots had on-board radar. Carrots are still good for
you, though, and can help in maintaining (but not super-powering)
eyesight.↩
3. It’s the central conceit of the book The Signal and the Noise by the
increasingly insufferable Nate Silver.↩
6. C-statistics can be less than 0.5 for the exact same, Ultimate-Spider-
Man-explained reasons that d′ can be negative: it’s an artifact of chance
responding or it’s a result of systematic misunderstanding of responses.↩
7. David Green and John Swets’s book is one of the seminal texts on
statistical analysis of psychophysical measures. The full citation is:
Green, D. M., & Swets, J. A. (1966). Signal detection theory and
psychophysics (Vol. 1). New York: Wiley.↩
8. If there is only one experimental condition and thus one criterion point,
then the point is moot: you would have to assume equal variances
because you wouldn’t have any way of assessing different variances.↩
10. The usual caveats about interpreting effect sizes apply, and the citation
for those figures is: David W.. Hosmer, & Lemeshow, S. (2000). Applied
logistic regression. New York: Wiley.↩
9 Markov Chain Monte Carlo
Methods
9.1 Let’s Make a Deal
In the September 9, 1990 issue of Parade Magazine, the following question
appeared in Marilyn vos Savant’s “Ask Marilyn” column:
This is the most famous statement of what came to be known as The Monty Hall
Problem due to the similarity of its setup to the game show Let’s Make a Deal,
hosted in the 60’s, 70’s, 80’s, and 90’s by Monty Hall. This article lays out the
history of the problem and the predictably awful reason why vos Savant in
particular drew so voluminous and vitriolic a response for her correct answer
(hint: it rhymes with “mecause she’s a woman”).
The Monty Hall Problem is set up nicely for solving with Bayes’s Theorem. In
this case, we are interested in comparing:
1. the posterior probability that the prize is behind the door originally picked
by the contestant and
2. the posterior probability that the prize is behind the door that the contestant
can switch to.
For simplicity of explanation, let’s say that the contestant originally chooses
Door #1, and the host shows them that the prize is not behind Door #2 (the
answer is the same regardless of what specific doors we choose to calculate the
probabilities for). Thus, we are going to compare the probability that the prize is
behind Door 1 given that the host has shown that the prize is not behind Door 2 –
we’ll call that p(Door |Show ) – with the probability that the prize is behind
1 2
Door 3 given that the host has shown Door 2 – p(Door |Show ).
3 2
The prior probability that the prize is behind door #1 is p(Door ) = 1/3 and
1
the prior probability that the prize is behind Door #3 p(Door ) is also 1/3 –
3
without any additional information, each door is equally likely to be the winner.
Since the contestant has chosen Door #1, the host can only show either Door #2
or Door #3. If the prize is in fact behind Door #1, then each of the host’s choices
are equally likely, so p(Show |Door ) = 1/2. However, if the prize is behind
2 1
Door #3, then the host can only show Door #2, so p(Show |Door ) = 1. Thus,
2 3
the likelihood associated with keeping a choice is half as great as the likelihood
associated with switching.
The base rate is going to be the same for both doors, so it’s not 100% necessary
to calculate, but here it is anyway:
p(Show 2 ) = p(Door 1 )p(Show 2 |Door 1 ) + p(Door 3 )p(Show 2 |Door 3 ) = 1/2
Thus, the probability of Door #1 having the prize given that Door #2 is shown –
and thus the probability of winning by staying with the original choice of Door #1
– is:
(1/3)(1/2) 1
p(Door 1 |Show 2 ) = =
1/2 3
and the probability of the prize being behind Door #3 and thus the probability of
winning by switching is:
(1/3)(1) 2
p(Door 3 |Show 2 ) = =
1/2 3
What does this have to do with Monte Carlo simulations? Well, the prominent
mathematician Paul Erdös famously refused to believe that switching was the
correct strategy until he saw a Monte Carlo simulation of it. so that’s what we’ll
do.
First, let’s randomly put the prize behind a door. To simulate three equally likely
options, we’ll draw a random number between 0 and 1: if the random number is
between 0 and 1/3, that will represent the prize being behind Door 1, if the
random number is between 1/3 and 2/3, that will represent the prize being
behind Door 2, and if the random number is between 2/3 and 1, that will
represent the prize being behind Door 3.
The default sufficient statistics in the base R distribution commands with the root
unif() define a continuous uniform distribution that ranges from 0 to 1. Using
the command runif(1) will take one sample from that uniform distribution: it’s
a random number between 0 and 1. We can use the ifelse command to record a
value of 1, 2, or 3 for the simulated prize location:
Prize
## [1] 1
Next, we can simulate the contestant’s choice. Since the contestant is free to
choose any of the three doors and their choice is not informed by anything but
their own internal life, we can use the same algorithm that we used to place the
prize:
Choice
## [1] 3
Now, here’s the key step: we simulate which door the host shows the contestant.
It’s going to be a combination of rules and randomness:
1. If the contestant has chosen the prize door, then the host shows either of the
two other doors with p = 1/2 per door.
a. We can draw another random number between 0 and 1: if it is less than 1/2
then one door is chosen, if it is greater than 1/2 then the other is chosen.
2. If the contestant has not chosen the prize door, then the host must show the
door that is neither the prize door nor the contestant’s choice:
x3<-runif(1)
if(Prize==1){
if(Choice==1){
ShowDoor<-ifelse(x3<=0.5, 2, 3)
}
else if(Choice==2){
ShowDoor<-3
}
else{
ShowDoor<-2
}
} else if(Prize==2){
if(Choice==1){
ShowDoor<-3
}
else if(Choice==2){
ShowDoor<-ifelse(x3<=0.5, 1, 3)
}
else{
ShowDoor<-1
}
} else if(Prize==3){
if(Choice==1){
ShowDoor<-2
}
else if(Choice==2){
ShowDoor<-1
}
else{
ShowDoor<-ifelse(x3<=0.5, 1, 2)
}
}
ShowDoor
## [1] 2
Given the variables Prize, Choice, and ShowDoor, we can make another
variable StayWin1 to indicate if the contestant would win (StayWin = 1) or
lose (StayWin = 0) by keeping their original choice:
StayWin<-ifelse(Prize==Choice, 1, 0)
StayWin
## [1] 0
And we can also create a variable SwitchWin2 to indicate if the contestin would
win (SwitchWin = 1) or lose (SwitchWin = 1) by switching their choices:
if (Choice==1){
if (ShowDoor==2){
Switch<-3
} else if (ShowDoor==3){
Switch<-2
}
} else if (Choice==2){
if (ShowDoor==1){
Switch<-3
} else if (ShowDoor==3){
Switch<-1
}
} else if (Choice==3){
if (ShowDoor==1){
Switch<-2
} else if (ShowDoor==2){
Switch<-1
}
}
SwitchWin<-ifelse(Prize==Switch, 1, 0)
SwitchWin
## [1] 1
Thus, to recap, in this specific random game: the prize was behind Door #1, the
contestant chose Door #3, and the host showed Door #2: the contestant would
win by switching and lose by staying. But, that’s just one specific game: in games
where stochasm is involved, good strategies often lose and bad strategies often
win.^[“My shit doesn’t work in the play-offs. My job is to get us to the play-ffs.
What happens after that is fucking luck”
The following code takes all the lines of code for the single-repetition game and
wraps it in a for loop with 1,000 iterations.
for (i in 1:1000){
x1<-runif(1) ## Select a random number between 0 and 1
Prize<-ifelse(x1<=1/3, 1, ## If the number is between 0 and 1/3,
ifelse(x1<=2/3, 2, ## Otherwise, if the number is l
3)) ## Otherwise, the prize is behind
## Simulating contestant's choice
x3<-runif(1)
if(Prize==1){
if(Choice==1){
ShowDoor<-ifelse(x3<=0.5, 2, 3)
}
else if(Choice==2){
ShowDoor<-3
}
else{
ShowDoor<-2
}
} else if(Prize==2){
if(Choice==1){
ShowDoor<-3
}
else if(Choice==2){
ShowDoor<-ifelse(x3<=0.5, 1, 3)
}
else{
ShowDoor<-1
}
} else if(Prize==3){
if(Choice==1){
ShowDoor<-2
}
else if(Choice==2){
ShowDoor<-1
}
else{
ShowDoor<-ifelse(x3<=0.5, 1, 2)
}
}
StayWin[i]<-ifelse(Prize==Choice, 1, 0)
if (Choice==1){
if (ShowDoor==2){
Switch<-3
} else if (ShowDoor==3){
Switch<-2
}
} else if (Choice==2){
if (ShowDoor==1){
Switch<-3
} else if (ShowDoor==3){
Switch<-1
}
} else if (Choice==3){
if (ShowDoor==1){
Switch<-2
} else if (ShowDoor==2){
Switch<-1
}
}
SwitchWin[i]<-ifelse(Prize==Switch, 1, 0)
}
MontyHallResults<-data.frame(c("Staying", "Switching"), c(sum(Sta
colnames(MontyHallResults)<-c("Strategy", "Wins")
ggplot(MontyHallResults, aes(Strategy, Wins))+
geom_bar(stat="identity")+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)+
labs(x="Strategy", y="Games Won")
Figure 9.1: Frequency of Wins in 1,000 Monty Hall Games Following ‘Stay’
Strategies and ‘Switch’ Strategies
In our simulation, a player who stayed every time would have won 331 games
for a winning percentage of 0.331; a player who switched every time would have
won 669 games for a winning percentage of 0.669. So, Marilyn vos Savant and
Thomas Bayes are both vindicated by our simulation.
Please note that we arrived at our simulation result without including anything in
the code about Bayes’s Theorem or any other mathematical description of the
problem. Instead, we arrived at estimates of the expected value of the
distribution of wins by describing the logic of the game. That’s a valuable feature
of MCMC methods: in situations where we might not have firm expectations for
the posterior distribution of a complex process, we can arrive at estimates based
on taking those processes step-by-step.
Another matter of interest (possibly only to me): the game described in the Monty
Hall Problem was never really played on Let’s Make a Deal.
In this example, the expected value of each gamble is 0 – one is just as likely to
win $10 as to lose $10, the variance of each gamble is the probability-weighted
square of the expectations: ( )($100) + ( )($100) = $100 and the standard
1
2
1
deviation is √$100 = $10. Thus, in the long run, the expectation of the gamble
over N trials is 0 × N , the variance is $100 × N , and the standard deviation is
10 × √ N .
4 But, that doesn’t really give us an accurate description of the
gambling experience, nor, by extension, of how people make decisions while
gambling. If somebody went to a casino with $100, they might be less likely to
want to know what would happen if I went to the casino an infinite amount of
times and played an infinite amount of games? and more likely to know will I
win?. With repeated Random Walk models – that is, running models like the one
above 1,000 or 1,000,000 times or whatever we like – we can model the rate at
which a gambler could run out of money, the rate at which they could hit a target
amount of winnings at which they decide to walk away, or the rate of anything in-
between.
We also can use MCMC models as blunt-force tools for understanding stochastic
processes where analytic methods come up short. Famously, mathematicians have
been unable to use analytic methods to determine the probability of winning a
game of Klondike Solitaire, (see Figure 9.4). Every time you start a new game,
the result of the prior game has no bearing on the next one.
Figure 9.4: Solitaire is a Lonely Man’s Game.
By simulating games using Monte Carlo methods, computer scientists have found
solutions to the Solitaire problem where mathematicians haven’t be able to.
On the remote chance that you, dear reader, may be interested in statistical
analysis regarding things other than game shows, or gamblers, or solitaire, or
boxing matches, we turn now to the main point:
There are classical methods that make use of repeated sampling, most
prominently the calculation of bootstrapped confidence intervals, but these also
rely on the assumption of probability distributions.] As described in Classical
and Bayesian Inference, once the posterior distribution on a parameter (or set of
parameters) is determined, then various inferential questions can be answered,
such as: what is the highest density interval (HDI)? or what is the probability
that the parameter is greater than/less than a variable of interest?
Here we will focus on the most flexible – although not always the most efficient
– MCMC method for use in Bayesian inference: the Metropolis-Hastings (MH)
algorithm. We will discuss how it works, how to get estimates of a posterior
distrbution, why MH gives us samples from the posterior distribution, and
applications to Psychological Science.
The italicized step (sometimes you keep it anyway) is the key to generating a
distribution. If the MH algorithm only picked the better number, then the
procedure would converge on a single value rather than a distribution. What’s
worse, that number might not be the best number, and the algorithm could get
stuck for a long time until a better number is found. The sometimes is key to
getting a distribution with both high-density probabilities and lower-density
probabilities (a common feature of probability distributions).
p(H 2 )p(D|H 2 )
r =
p(H 1 )(p(D|H 2 )
p(D)
,
r is the ratio of the two posterior probabilities p(H |D) and p(H |D) – p(D)
2 1
would be the same for both hypotheses, and so cancels out. If we sample all of
the candidate values from the same prior distribution – as we will for our
examples in this page – p(H ) cancels out as well. For example, if each candidate
value is sampled randomly from a uniform distribution, then the probability of
choosing each value is the same. Thus, it is more common to present the ratio r
as:
p(D|H 2 )
r =
p(D|H 1 )
Sampling parameter values from a prior distribution and evaluating them based
on the product of the (often canceled) prior probability and the likelihood given
the prior divided by the (always canceled) probability of the data ensures that the
MH produces samples from the posterior distribution.
The MH algorithm always accepts the proposed candidate parameter value if the
likelihood of the data given the new parameter is greater than or equal to 1:
p(D|H 2 )
1. Accept H 2 if r = ≥ 1.
p(D|H 1 )
Thus the posterior distribution will be loaded up with high-likelihood values of
the parameter. The sometimes is also not nearly as random as my super-basic
description has made it out to be.
The step of accepting parameters that don’t increase the likelihood at rates equal
to the amount that they do change the likelihood is accomplished by generating a
random value from a uniform distribution which is usually called u. Similarly to
how we simulated a choice between three doors by generating a random number
from the uniform distribution and calling it (for example) “Door #1” if the
random number were ≤ 1/3, if r > u, then we accept the proposed candidate
value for the parameter:
p(D|H 2 )
2. Accept H 2 if r = > u, else accept H 1
p(D|H 1 )
If, for example, r = 0.8, because there is an 80% chance that the random number
u < 0.8, there is an 80% chance that H will be accepted; if r = 0.2, then there
2
is a 20% chance that H will be accepted because there is a 20% chance that a
2
In theory, it doesn’t really matter which starting value you use – it can be a
complete guess – but it helps if you have an educated guess because the
algorithm will get to the target distribution faster if the starting parameter is
closer to the middle of what the target distribution will be. For a parameter in the
[0, 1] range, 0.5 is a reasonable place to start.
2. Generate a proposed parameter value from the prior distribution.
3. Calculate the ratio r between the likelihood of the observed data given the
proposed value to the and the likelihood of the observed data given the
current value.
This part can be trickier than it appears. For the MH algorithm to generate the
actual target distribution, it must use a good likelihood function (formally, this
requirement is often stated the likelihood function must be proportional to the
target distribution). In a case where the data are binomial, then the likelihood
function is known (it’s the binomial likelihood function π (1 − π) ). For
N!
s! f !
s f
more complex models, the likelihood function can be derived in much the same
way the binomial likelihood function is derived. Other investigations can make
assumptions about the likelihood function: if a normal distribution is expected to
be the target distribution, for example, then the likelihood function is the density
x−μ 2
function is decided upon, r is just a simple ratio of the likelihood given the
proposed value to the likelihood given the current value.
5. The accepted parameter becomes the current parameter. Store the accepted
parameter.
How many iterations are used is somewhat a matter of preference, but there are a
couple of considerations:
9.5.1.2.1 Model complexity
A model with a single parameter will generally require fewer samples for the
MH algorithm to converge to the target distribution than models with two or
more. Whereas 1,000 iterations might be enough to generate a smooth target
distribution for a single parameter, 1,000,000 or more might be required in
situations where two or more parameters have to be estimated simultaneously.
As noted above, the starting parameter value or values for the MH algorithm are
guesses (educated or otherwise). The algorithm is dumb – it doesn’t know where
it’s going, and if you happen to start it far away from the center of the target
distribution, there may be some meaningless wandering at the beginning. This
problem is more pronounced when there are multiple parameters. Thus, the first
few iterations are often referred to as the burn-in period and are thrown out of
estimates of the posterior distribution. As with the overall number of trials,
there’s no a priori way of knowing how many iterations the burn-in period
should last. For what it’s worth: I learned that the burn-in period should be about
the first 3% of trials (presumably based on my stats teacher’s MCMC
experience). Basically, you want to take out any skew caused by the first bunch of
trials: if you drop them and the mean and sd of the distribution changes, they
probably should be dropped, and if you drop more and the mean and sd stay the
same, then you probably don’t need to drop the additional trials.
9.5.1.2.3 Auto-correlation
Getting stuck in place is a concern for MH models when there are multiple
parameters involved: when the combination of multiple parameters need to
increase the posterior likelihood to have a high chance of being accepted, there
can be long periods of continuing to accept the current parameters. The concern
there is that some parameters could inappropriately be overrepresented by virtue
of the search algorithm getting stuck. That concern is easily alleviated by taking
every nth iteration (say, every 100th value of the parameters), but doing so might
require taking increased overall iterations and more computing time.6
s+f +2
≈ 0.77 and a
s+f +3
the parameter (which is called theta in the code) accumulate near the center of
the target distribution but occasionally less likely values are accepted.
set.seed(77)
s<-16
N<-20
theta<-c(0.5, rep(NA, 999))
for (i in 2:1000){ #Number of MCMC Samples
theta2<-runif(1) #Sample theta from a uniform distribution
ltheta2<-dbinom(s, N, theta2) ## p(D|H) for new theta
ltheta1<-dbinom(s, N, theta[i-1]) #p(D|H) for old theta
r<-ltheta2/ltheta1 #calculate likelihood ratio
The mean of the posterior distribution generated by the MH algorithm is 0.78 and
the standard deviation is 0.086 – very close to the analytic solution from the
posterior β distribution that has a mean of 0.77 and a standard deviation of
0.088.
Now that we have some test cases to indicate that the algorithm works and how it
works, let’s review a couple of useful applications for estimating posterior
distributions on parameters: multinomial models and regression.
Multinomial models are naturally described using probability trees. Figure 9.6 is
an example of a relatively basic generic multinomial model.
Figure 9.6: Generic Tree Representation of a Multinomial Model (adapted from
Batchelder & Reifer, 1999)
In this model, for the behaviors noted in Cell 18 to occur, two events need to
happen: one with probability θ and another with probability θ . Thus, behavior
a b
The blessing and the curse of multinomial models is that they are flexible enough
to handle a wide range of theories of latent factors. If there are, say, four possible
processes that could occur in the mind at a given stage of the process, one can
construct a multinomial model with four branches coming off of the same node. If
different paths can lead to the same behavior (for example, remembering a fact
can help you on a multiple choice test but guessing can get you there – less
reliably – too), one can construct a multinomial model where multiple paths end
in the same outcome. A multinomial model can take on all kinds of forms based
on the theory of the latent processes that give rise to observable phenomena like
behaviors and physiological changes: that’s the blessing. The curse is that we
have to come up with likelihood functions that fit the models, which can be a
chore.
For the model in Figure 9.6, the likelihood function is structurally similar to the
binomial likelihood function: we take the probabilities involved in getting to
each end of the paths in the model and multiply them by the number of ways we
can go down those paths. That is, the likelihood function is the product of a
combinatorial term and a kernel probability term.
Let’s say we have an experiment with N trials, where n , n , n , and n are the 1 2 3 4
N!
n1 ! n2 ! n3 ! n4 !
1 − θ , and 1 − θ – raised to the power of the cells in which they play a part in
a b
getting to:
n 1 +n 2 n 1 +n 3 n 3 +n 4 n 2 +n 4
θa θ (1 − θ a ) (1 − θ b )
b
Thus, our entire likelihood function for the model, where cells are represented by
the letter ϕ is:
i
N! n 1 +n 2 n 1 +n 3 n 3 +n 4 n 2 +n 4
L(θ a , θ b ) = p(D|H ) = θa θ (1 − θ a ) (1 − θ b )
b
n 1 !n 2 !n 3 !n 4
Observation n
Cell 1 28
Cell 2 12
Cell 3 42
Cell 4 18
And let’s estimate posterior distributions for θ and θ using the MH algorithm.
a b
Note that in this algorithm, we have two starting/current parameter values and
two proposed parameter values, corresponding to θ and θ . At each iteration,
a b
we plug values for both parameters into the likelihood function and either accept
both proposals or keep both current values. This tends to slow things down a bit,
but is more flexible than algorithms that change one parameter at a time.9 Given
that we have two parameters to estimate, for this analysis we will use 1,000,000
iterations.
iterations<-1000000
n1=28
n2=12
n3=42
n4=18
for (i in 2:iterations){
theta.a2<-runif(1)
theta.b2<-runif(1)
r<-likely(n1, n2, n3, n4, theta.a2, theta.b2)/likely(n1, n2, n3
u<-runif(1)
if (r>=1){
theta.a[i]<-theta.a2
theta.b[i]<-theta.b2
} else if (r>u){
theta.a[i]<-theta.a2
theta.b[i]<-theta.b2
} else {
theta.a[i]<-theta.a[i-1]
theta.b[i]<-theta.b[i-1]
}
}
Assuming a burn-in period of 3,000 iterations, here are the estimated posterior
distributions of θ and θ given the observed data:
a b
multinomial model to see if they result in predicted data similar to our observed
data:
Cell 1 28 θa θb 27.997
Cell 2 12 θ a (1 − θ b ) 12.201
Cell 3 42 (1 − θ a )θ b 41.651
Cell 4 18 (1 − θ a )(1 − θ b ) 18.151
Bayesian regression using MCMC is complex in theory, but easily handled with
software. Using the BAS package, for example, we can use many of the same
commands that we use for classical regression to obtain Bayesian models.
First, let’s run a classical simple linear regression with the base R lm()
commands. Instead of the full model summary() command, we will limit our
output by using just the coef() command to get the model statistics and the
confint() command to get interval estimates for the model statistics.
coef(lm(y~x))
## (Intercept) x
## 0.2589419 3.8024346
confint(lm(y~x))
## 2.5 % 97.5 %
## (Intercept) -1.042028 1.559912
## x 3.127986 4.476884
Next, we will use the Bayesian counterpart bas.lm() from the BAS package,
with 500,000 MCMC iterations11:
library(BAS)
set.seed(77)
confint(coef(bas.lm(y~x,
MCMC.iterations = 500000)))
The results are very similar! The Bayesian interval estimate for the coefficient of
x (that is, the slope) is slightly narrower than the classical confidence interval,
but for the Bayesian interval, you’re allowed to say “there’s a 95% chance that
the parameter is between 3.1 and 4.4.” The intercept is slightly different, too, but
that has little to do with the differences between Bayesian and classical models
and much more to do with the fact that the BAS package uses an approach called
centering which forces the intercept to be the mean of the y variable – it’s a
choice that is also available in classical models and one that isn’t going to mean
a whole lot to us right now.
And that’s basically it. There are lots of different options one can apply
regarding different prior probabilities and Bayesian Model Averaging (which, as
the name implies, gives averaged results that are weighted by the posterior
probabilities of different models, but is only relevant for multiple regression),
but for basic applications, modern software makes regression a really easy entry
point to applying Bayesian principles to your data analysis.
1. At this point, I feel compelled to admit that I’m not very good at naming
variables.↩
5. Recall from Probability Theory that the Random Walk is also known as the
Drukard’s Walk: none of the steps that have gotten somebody who is
blackout drunk to where they are predict the next step they will take.↩
6. Your experiences may vary with regard to how much additional computing
time may be incurred by increasing the number of MCMC iterations. If a lot
of MCMC investigations are being run, the time can add up. In the case of a
single analysis of a dataset, it’s probably more like an excuse to get up from
your computer and have a coffee break while the program is running. ↩
7. A more complicated memory model might posit that there is also a chance
of partial storage, as in the phenomenon when you can remember the first
letter or sound of somebody’s name.↩
10. Not to get too into what I meant to be a brief illustration, but it might be
confusing that we want to maximize a function based on the errors when the
goal of regression modeling is to minimize the errors generally. The link
there is that the normal probability density is highest for values near the
mean of the distribution; and if the mean of the errors is 0, then minimizing
the errors maximizes the closeness of the errors to the mean – and thus the
peak – of the normal density.↩
11. Without getting too far into the mechanics of the BAS package, it uses a
subset of MH-generated samples to estimate statistics and intervals.↩
10 Assumptions of Parametric Tests
10.1 Probability Distributions and Parametric Tests
Imagine you’ve just done an experiment involving two separate groups of
participants – one group is assigned to a control condition and the other is
assigned to an experimental condition – and these are the observed data:
##
## Two Sample t-test
##
## data: Experimental and Control
## t = 3.6995, df = 58, p-value = 0.000482
## alternative hypothesis: true difference in means is not equal
to 0
## 95 percent confidence interval:
## 0.4801216 1.6122688
## sample estimates:
## mean of x mean of y
## 2.296369 1.250174
The t-test says that there is a significant difference between the Experimental and
Control groups: t(df = 58) = 3.7, p < 0.01. That’s a pretty cool result, and you
may feel free to call it a day after observing it.
Using statistical software to calculate statistics like t and p is a little like riding a
bicycle in that it takes some skill and it will get you to your destination fairly
efficiently, but doing so successfully requires some assumptions. For the bicycle,
it is assumed that there is air in the tires and the chain is on the gears and the seat
is present: if those assumptions are not met, you still could get there, but the trip
would be less than ideal. For the t-test – and for other classical parametric tests –
the assumptions are about the source(s) of the data.
The t-test evaluates sample means using the t-distribution. It is based largely on
the Central Limit Theorem, which tells that sample means are distributed as
normals when n is sufficiently large…
2
σ
x ∼ N (μ, )
N
but it doesn’t tell us how large is sufficiently large. n = 1, 000, 000 would
probably do it. The old rule of thumb is that n = 30 does it, but that is garbage.1
The only way to guarantee that the sample means – regardless of sample size –
will be distributed normally is to sample from a normal distribution. We can prove
that rather intuitively: imagine you had sample sizes of n − 1. The sample means
for an n of 1 from a normal distribution would naturally be a normal distribution:
it would be just like taking a normal distribution apart one piece at a time and
rebuilding it (see Figure 10.2).
Figure 10.2: Normal Parent Distribution and Sampling Distribution with n = 1
The other feature that the use of the t-distribution to evaluate means depends on is
the variance: as noted in Probability Distributions, the t-distribution models the
distribution of sample means standardized by the sampling error. The standard
deviation of a t distribution is the standard error, and, like the immortals in
highlander…
Figure 10.3: This reference is almost too old even for me.
Thus, we assume that the data in the groups were drawn from populations with
equal variances. This is known as either the homoscedasticity assumption or the
homogeneity of variance assumption: take your pick (I prefer homoscedasticity
because it’s more fun to say). Another way to view the importance of
homoscedasticity is to consider what we would make of the results shown in
Figure 10.1 if we knew that they came from the distributions shown in Figure 10.4:
Figure 10.4: Populations with Means of 1.25 and 2.23, Respectively, and Very
Different Variances
Were the populations to vary as in Figure 10.4, we might not be so sure about the
difference between groups: it would appear that the range of possibilities in
sampling from the Experimental population comfortably includes the entire
Control population and that the difference is really of variances rather than means
(which is not what we want to test with a t-test).
These assumptions undergird the successful use of the t-test. Like the bicycle –
remember the bicycle analogy? – we can go ahead and use the t-test (or whatever
other parametric test we are using) assuming that everything is structurally sound.
If it gets us where we’re going: then violation of our assumptions probably didn’t
hurt us too much. But, the responsible thing to do before riding your bicycle is to at
least check your tire pressure, and the responsible thing to do before running
classical parametric tests is to check your assumptions.
If your data are numbers, they are scale data. End of test.
The classical parametric tests assume that in experimental studies with different
conditions, that individual observations – be they observations of human
participants, animal subjects, tissue samples, etc. – arise from random assignment
to each of those experimental conditions. The key to random assignment is that
individuals in a study should be on average the same with regard to the dependent
variables.
10.2.3 Normality
The normality assumption is that the data are sampled from a normal
distribution. This does not mean that the observed data themselves are normally
distributed!
That makes testing the normality assumption theoretically tricky. While data that
come from a normal distribution are not – and I can’t emphasize this enough –
necessarily themselves normally distributed, we test the normality assumption by
seeing whether the observed data could plausibly have come from a normal
distribution. We do so with a classical hypothesis test, with the null hypothesis
being that the data were sampled from a normal distribution, and the alternative
hypothesis being that the data were not sampled from a normal distribution. So, it’s
actually more accurate to say that we are testing the cumulative likelihood that the
observed data or more extremely un-normal data were sampled from a normal
distribution.
If the result is significant, then we have evidence that the data violate the normality
assumption, in which case, we have options as to how to proceed.
Tests of normality are all based around what would be expected if the data were
sampled from a normal distribution. The differences between different tests arise
from the different sets of expectations they use. For example, on this page there are
three tests of the normality assumption: the χ goodness-of-fit test, the
2
Kolmogorov-Smirnov test, and the Shapiro Wilk test. The χ test is based on how
2
many data points fall into different places based on what we would expect from
a sample from a normal. The Kolmogorov-Smirnov test is based on how the
observed cumulative distribution compares to a normal cumulative distribution.
The Shapiro-Wilk test is based on correlations between observed data and data
that would be expected from a normal distribution with the same dimensions.
But the general idea of all these tests is the same: they compare the observed data
to what would be expected if they were sampled from a normal: if those things are
close, then we continue to assume our null hypothesis that the normality
assumption is met, and if those things are wildly different, then we reject that null
hypothesis.
10.2.3.1.1 The χ
2
Goodness-of-fit Test
The χ goodness-of-fit test, as noted above, assesses where observed data points
2
fall relative to each other. Let’s unpack that concept. One thing we know about the
normal distribution is that 50% of the distribution is less than the mean and 50% of
the distribution is greater than the mean:
… we would expect roughly a quarter of our data set to fall into each quartile of a
normal distribution:
Figure 10.7: Four Parts of a Standard Normal Distribution Overlaid with a
Histogram of 10,000 Samples from a Standard Normal Distribution
distribution (just because it’s known to be skewed – the fact that it’s in the chi2
test section is just a nice coincidence), we would see that the samples do not line
up with the normal distribution with the same mean and standard deviation:
Figure 10.8: Four Parts of a Normal Distribution with μ = 1 and σ = 2 Overlaid
with a Histogram of 10,000 Samples from a χ Distribution with df = 1 (
2
μ = 1, σ = 2)
Thus, the χ goodness-of-fit test assesses the difference between how many values
2
this case, the normal distribution with a mean equal to the observed sample mean
and a standard deviation equal to the observed sample standard deviation – into k
equal parts, divided by quantiles (if k = 4, then the quantiles are quartiles, if
k = 10, then the quantiles are deciles, etc.). As an example, consider the
made-up data
1.6 3.7 5.1 6.7
made-up data
2.1 4.5 5.6 7.3
2.2 4.6 5.7 7.8
2.3 4.8 6.3 8.1
2.8 4.9 6.6 9.0
The mean of the made-up observed data is 5.08 and the standard deviation of the
made-up observed data is 2.16. If we decide that k = 4 for our χ test, then we
2
want to find the quartiles for a normal distribution with a mean of 5.08 and a
standard deviation of 2.16.
For k groups – which we usually call cells – we expect about n/k observed
values in each cell. The χ test is based on the difference between the observed
2
frequency f in each cell and the expected frequency f in each cell. The statistic
e o
known as the observed chi-squared statistic (χ ) is the sum of the ratio of the
2
obs
square of the difference between f and f to f – the squared deviation from the
o e o
The larger the difference between the observation and the expectation, the larger
the χ .
2
obs
For the observed data, 20/4 = 5 value are expected in each quartile as defined
by the normal distribution: 5 are expected to be less than 3.623102, 5 are expected
to be between 3.623102 and 5.080000, 5 are expected to be between 5.080000
and 6.536898, and 5 are expected to be greater than 6.536898. The observed
values are:
5 5 4 6
2 2 2 2 2
(f o − f e ) (5 − 5) (5 − 5) (4 − 5) (6 − 5)
2
χ obs = ∑ = + + + = 0.4
fe 5 5 5 5
The χ
2
test uses as its p-value the cumulative likelihood of χ given the df of 2
obs
the χ 2
distribution. The df is given by k − 1.4Thus, we find the cumulative
likelihood of χ
2
obs
– the area under the χ
2
curve to the right of χ
2
obs
– given
df = k − 1.
## [1] 0.9402425
There is one point that hasn’t been addressed – why k = 4, or why any particular
k? There are two things to keep in mind. The first is that a general rule for the χ
2
smaller than 5, the size of the observed χ statistics can be overestimated. The
2
second is that the larger the value of k – i.e., the more cells in the χ test – the 2
greater the power of the χ test and the better the ability to accurately identify
2
samples that were not drawn from normal distributions. Thus, the guidance for the
size of k is to have few enough cells so that f ≥ 5, but as many more (or close to
e
To get the empirical cumulative distribution for a dataset, we order the data from
smallest to largest. The first value in the empirical distribution is the first value in
the ordered dataset. The second value in the empirical cumulative distribution is
the sum of the first two values in the ordered dataset, the third value in the
empirical cumulative distribution is the sum of the first three values in the ordered
dataset, etc, until the last value in the empirical cumulative distribution which is
the sum of all values in the data. For example, here is the empirical cumulative
curve of the sample data:
Figure 10.10: Empirical Cumulative Curve of the Made-up Data
data<-c(1.6, 2.1, 2.2, 2.3, 2.8, 3.7, 4.5, 4.6, 4.8, 4.9, 5.1, 5.6
As with the χ test, the p-value is large by any standard, and thus we continue to
2
The Shapiro-Wilk test is based on correlations between the observed data and
datapoints that would be expected from a normal distribution with a mean equal to
the mean of the observed data and a standard deviation equal to the standard
deviation of the observed data.6
made for testing normality, and its implementation in R is the simplest of the three
tests:
data<-c(1.6, 2.1, 2.2, 2.3, 2.8, 3.7, 4.5, 4.6, 4.8, 4.9, 5.1, 5.6
9.0)
shapiro.test(data)
##
## Shapiro-Wilk normality test
##
## data: data
## W = 0.96466, p-value = 0.6406
set.seed(77)
group1<-rnorm(100, 1, 1)
group2<-rnorm(100, 11, 1)
group3<-rnorm(100, 22, 1)
And we run the Shapiro-Wilk test on all the observed data, we reject the null
hypothesis: the pooled data violate the normality assumption.
##
## Shapiro-Wilk normality test
##
## data: data
## W = 0.86785, p-value = 2.349e-15
The solution to this problem is to not assess the observed data, but rather to assess
the residuals of the data. The residuals of any group of data are the differences
between each value in the group minus the group mean (x − x). Because the mean
is the balance point of the data, the mean of the residuals of any data set will be
equal to 0. When we calculate and then put together the residuals of the 3 groups
depicted in Figure 10.11:
then we have a single group of data with a mean of 0, as depicted in Figure 10.12.
And running a goodness-of-fit test on the residuals, we find that the assumption of
normality holds:
shapiro.test(residuals)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.99431, p-value = 0.3251
It should be noted that we need to run normality tests on the residuals – rather than
the observed data – when we are assessing the normality assumption before
running parametric tests with multiple groups. We don’t have to calculate
residuals when testing the normality assumption for a single data set but we could
without changing the results. Subtracting the mean of a dataset from each value in
that set only changes the mean of the dataset from whatever the mean is to zero
(unless the mean of the dataset is already exactly zero, in which rare case it would
remain the same). Changing the mean of a normal distribution does not change the
shape of that normal distribution, it just moves it up or down the x-axis. When we
change the mean of an observed dataset, we also change the mean of the
comparison normal distribution for the purposes of normality, and the both the
shape of the data and the shape of the comparison distribution remain the same. So,
for a single set of data, since the residuals will have the same shape as the
observed data, we can run tests of normality on either the residuals or the
observed data and get the same result – it’s only for situations with multiple
groups of data that it becomes an issue.
Of the three goodness-of-fit tests, the χ test is the most intuitive (admittedly a
2
relative achievement). It is also extremely flexible: it can be used not just to assess
goodness-of-fit to a normal distribution, but to any distribution of data, so long as
expected values can be produced to compare with observed values. It is also the
least powerful of the three tests when it comes to testing normality – regardless of
how many cells are used – and as such is the most likely to miss samples that
violate the assumption that the data are drawn from a normal distribution.
The Kolmogorov-Smirnov test, like the χ test, is not limited to use with the
2
normal distribution. It can be used for any distribution for which a cumulative
theoretical distribution can be produced. It is also more powerful than the χ test
2
and therefore less likely to miss violations of normality. Put those facts together
and the Kolmogorov-Smirnov test is the most generally useful goodness-of-fit test
of the three.
The Shapiro-Wilk test is the most powerful of the three and the easiest to
implement in software. The only drawback to the Shapiro-Wilk test is that it is
specifically and exclusively designed to test normality: it is useless if you want to
assess the fit of data to any other distribution. Still, if you want to test normality,
and you have a computer to do it with, the Shapiro-Wilk test is the way to go.
the quantiles are perfectly aligned. Departures from the diagonal can provide
information about poor fits to the distribution being studied (the normal, for one,
but a Q-Q plot can be drawn for any distribution one wants to model): systematic
departures – like all of the points being above or below the diagonal for a certain
region, or curvature in the plot – can give information about how the observed data
depart from the selected model and/or clues to what probability distributions
might better model the observed data.
In the case of the sample data, though, the observed data are well-modeled by a
normal distribution: there is good goodness-of-fit, and we continue to assume that
the data are sampled from a normal distribution.
data<-c(1.6, 2.1, 2.2, 2.3, 2.8, 3.7, 4.5, 4.6, 4.8, 4.9, 5.1, 5.6
9.0)
qqnorm(data)
qqline(data, col="#9452ff")
Figure 10.13: Quantile-Quantile Plot of the Goodness-of-Fit Between the Made-up
Data and a Normal Distribution
In parametric tests involving more than one group (including the independent-
groups t-test and ANOVA) assume that the data in those groups are all sampled
from distributions with equal variance. This assumption implies in an
experimental context that even though group means might be different in different
conditions, everybody was roughly equivalent before the experiment started; in a
comparison of two different populations, we assume that each population has
equal variance (but not necessarily equal means).
variances, specifically: the ratio of the largest variance to the smallest variance.7If
all observed sample variances are precisely equal, then all evidence would point
to homoscedasticity, and the ratio of the largest variance to the smallest variance
(or technically, of any pair of variances if they are all identical) is 1. Of course, it
is extraordinarily unlikely that any two sample variances would be exactly
identical – even if two samples came from the same distribution, they would
probably have at least slightly different variances.
Hartley’s test examines the departure from a variance ratio of 1, with greater
allowances for departures given for combinations of small sample size and higher
number of variances. The test statistic is the observed F , and it is compared to
max
a critical F max value given the df of the smallest group and the number of
groups.8
For example, assume the following data from two groups are observed:
Placebo Drug
3.5 3.4
3.9 3.6
4.0 4.3
4.0 4.5
4.7 4.8
4.9 4.8
4.9 4.9
4.9 5.0
5.1 5.1
5.1 5.2
5.3 5.2
5.4 5.4
5.4 5.5
5.6 5.5
5.6 5.6
5.7 5.7
6.0 5.7
6.0 5.7
6.6 6.3
7.8 8.1
The variance of the placebo group is 0.98, and the df is n − 1 = 19. The
variance of the drug group is 0.96, and the df is also n − 1 = 19. The observed
Fmax statistic – the ratio of the larger variance to the smaller – is:
0.98
obs F max = = 1.02
0.96
library(SuppDists)
qmaxFratio(0.05, 19, 2, lower.tail=FALSE)
## [1] 2.526451
The critical F max is the largest possible value of F maxgiven the df of the
smallest group, the desired α level, and the number of groups k. Since the critical
Fmax for df = 19, k = 2, and α = 0.05 is 2.526451, and the observed F max
value – 1.02 – is less than the critical Fmax , then we continue to assume that the
data were sampled from populations with equal variance.
we’re testing normality anyway, that’s not a big deal! But, if you want to test
homoscedasticity without being tied to assuming normality as well, might I
recommend…
Levene’s test is based on the same concept as the Brown-Forsythe test, but can use
deviations either from the median or the mean of each group. The leveneTest()
command from the car (Companion to Applied Regression) package, by default,
uses the median. The bf.test() command from the onewaytests package is
related to the Brown-Forsythe test of homoscedasticity, but returns misleading
results when group medians differ. Therefore, use the leveneTest() command to
test homoscedasticity!
The Levene test command leveneTest() accepts data arranged in a long data
frame format. In a wide data frame, the data from different groups are arranged in
different columns; in a long data frame, group membership for each data point in a
column is indicated by a grouping variable in a different column. For example, the
following pair of tables show the same data in wide format (left) and in long
format (right):
To put the two-group example data into long format, and to the run the Brown-
Forsythe test on the variances of those data, takes just a few lines of R code:
library(car)
##
## Attaching package: 'car'
Placebo<-c(3.5, 3.9, 4.0, 4.0, 4.7, 4.9, 4.9, 4.9, 5.1, 5.1, 5.3,
Drug<-c(3.4, 3.6, 4.3, 4.5, 4.8, 4.8, 4.9, 5.0, 5.1, 5.2, 5.2, 5.4
First, we can collect more data. Although the Central Limit Theorem does not
support prescribing any specific sample size to alleviate our assumption woes, it
does imply that sample means based on larger sample size are more likely to be
normally distributed, so the normality assumption becomes less important.
Third, and I think this is the best option, is to use a nonparametric test. We’ll talk
about lots of those.
Finally, as noted above: parametric tests tend to be robust against violations of the
assumptions. A type-II error tends to me more likely than a type-I error, so if the
result is significant, there’s probably no harm done. The danger is more in not
observing an effect that you would have observed were the data more in line with
the classical assumptions.
1. I had learned that the n = 30 guideline came from the work of Gosset and
Fisher – and printed that in a book – but after re-researching there’s not a lot
of hard evidence that either Gosset or Fisher explicitly told anybody to
design experiments with a minimum n of 30. The only source I can find that
cited a possible source for the origin of the claim linked to either Gosset or
(noted dick) Fisher is the Stats with Cats blog.↩
5. Just in case you encounter the term Lilliefors test, the Lilliefors test isn’t
quite the same as the Kolmogorov-Smirnov test, but close enough that the
tests are often considered interchangeable.↩
8. The df in this case is the n per sample minus 1. When assessing a sample
variance, the df is based on how the sample mean (which is a part of the
2
n−1
) is calculated. When determining
cell membership as for the χ test, if we know how many of n values are in
2
k − 1 cells, we then know how many values are in the kth cell; when
9. If you think it’s funny that Cox and Box collaborated on papers, you’re not
alone: George Cox and David Box thought the same thing and that’s why they
decided to work together.↩
11 Differences Between Two Things
11.1 Classical Parametric Tests of the Differences Between Two
Things: t-tests
In the simplest terms I can think of, the t-test helps us analyze the difference between two things
that are measured with numbers. There are three main types of t-test –
1. The One-sample t-test: differences between a sample mean and a single numeric value
2. The Repeated-measures t-test: differences between two measurements of the same (or
similar) entities
The first part of that question – what is the difference – is relatively straightforward. When we
are comparing two things, the most natural question to ask is how different they are. The
numeric difference between two things is half of the t-test.
The other half – in terms of sampling error – is a little trickier. Please think briefly about a
point made on the page on categorizing and summarizing information: if could measure the
entire population, we would hardly need statistical testing at all. Because that is – for all
intents and purposes – impossible, we instead base analyses on subsets of the population –
samples – and contextualize differences based on a combination of the variation in the
measurements and the size of the samples.
That combination of variation in measurements and sample size is captured in the sampling
error or, equivalently, the standard error (the terms are interchangeable). The standard error is
the standard deviation of sample means – it’s how we expect the means of our samples drawn
from the same population-level distribution to differ from each other – more on that to come
below. The t-statistic – regardless of which of the three types of t-test is being applied to the
data – is the ratio of the difference between two things and the expected differences between
them:
a dif f erence
t =
sampling error
The distinctions between the three main types of t-test come down to how we calculate the
difference and _how we calculate the sampling error.
The difference in the numerator of the t formula is always a matter of subtraction. To
understand the sampling error, we must go to the central limit theorem, first introduced in the
page on probability distributions.
The central limit theorem (CLT) describes the distribution of means taken from a distribution
(any distribution, although we will be focusing on normal distributions with regard to t-tests).
It tells us that sample means (x) are distributed as (∼) a normal distribution N with a mean
¯
equal to the mean of the distribution from which they were sampled (μ) and a variance equal to
the variance of the population distribution (σ ) divided by the number of observations in each
2
sample n:
2
σ
x ∼ N (μ,
¯ )
n
Figure ?? illustrates an example of the CLT in action: it’s a histogram of the means of one
million samples of n = 49 each taken from a standard normal distribution. The mean of the
sample means is approximately 0 – matching the mean of the standard normal distribution. The
standard deviation of the sample means is approximately 1/7 – the ratio of the standard
deviation of the standard normal distribution (1) and the square root of the size of each sample
(√49 = 7).
As annotated in Figure ??, the majority of sample means are going to fall within 1 standard
error of the mean of the sample means. Relatively few sample means are going to fall 3 or
more standard errors from the mean. To help describe what happens when sample means are
drawn, please consider the two following examples that describe the process of using the
smallest possible sample size – 1 – and the process of using the largest possible sample size –
the size of the entire population, represented by approximately infinity – respectively.
If n = 1, then the distribution of the sample means – which, since n = 1, wouldn’t really be
means so much as they would be the value of the samples themselves – would have a mean
2
pnorm(1.645, lower.tail=FALSE)
## [1] 0.04998491
the probability of sampling a value that is more than 1.645 standard deviations less than the
mean is also approximately 5%:
pnorm(-1.645, lower.tail=TRUE)
## [1] 0.04998491
and that the probability that sampling a value that is either 1.96 standard deviations greater
than or less than the mean is also approximately 5%:
## [1] 0.04999579
In classical statistical tests, the null hypothesis implies that the sample being studied is an
ordinary sample from a given distribution defined by certain parameters. The alternative
hypothesis is that the sample being studied was taken from different distribution, such as a
distribution of the same type but defined by different parameters (e.g., if the null hypothesis
implies sampling from a standard normal distribution, the alternative hypothesis may be that the
sample comes from another normal distribution with a different mean). We reject the null
hypothesis if the cumulative likelihood of observing the observed data given that the null
hypothesis is true is extraordinarily small. So, if we knew the mean and variance of a
population and happened to observe a single value (or the mean of a single value, which is the
value itself) that was several standard deviations away from the mean of the population that we
assume under the null that we are sampling from, then we may conclude that we should reject
the null hypothesis that that observation came from the null-hypothesis distribution in favor of
the alternative hypothesis that the observation came from some other distribution.
Figure 11.1: Buddy (Will Ferrell) learning to reject the null hypothesis that he is sampled from
a population of elves is a major plot point in the 2003 holiday classic Elf
If n = ∞, then the distribution of the sample means would have a mean equal to μ and a
variance of σ /∞ = 01. That means that if you could sample the entire population and take
2
the mean, the mean you take would always be exactly the population mean. Thus, there would
be no possible difference between the mean of the sample you take and the population mean –
because the mean you calculated would necessarily be the population mean – and thus there
would be no sampling error. In statistical terms, the end result would be a sampling error or
standard error of 0.
With a standard error of 0, each mean we calculate would be expected to be the population
mean. If we sampled an entire population and we did not calculate the expected population
mean, then one of two things are true: either the expectation about the mean is wrong, or, more
interestingly, we have sampled a different population than the one we expected.
distribution of the sample means – of which the mean of our observed data is assumed to be
one in the null hypothesis – has a mean equal to the population mean (x = μ) and a variance
¯
equal to the population variance divided by the size of the sample (σ = σ /n) and therefore
¯
x
2 2
a standard deviation equal to the population standard deviation divided by the square root of
σ x = √σ
¯
¯
2
x
= σ/√n the size of the sample ( , which is the standard error).
The t-statistic (or just t, if you’re into the whole brevity thing) measures how unusual the
observed sample mean is from the hypothesized sample mean. It does so by measuring how far
away the observed sample mean is from the hypothesized population mean in terms of standard
errors (see the illustration in Figuref ??).
The t-distribution models the distances of sample means from the hypothesized population
mean in terms of standard errors as a function of the degrees of freedom (df ) used in the
calculation of those sample means (i.e., n − 1). Smaller (in terms of n) samples from normal
distributions are naturally more likely to be far away from the mean – there is greater sampling
error for smaller samples – and the t-distribution reflects that by being more kurtotic when df
is small. That is, relative to t-distributions with large df , the tails of t-distributions for small
df are thicker and thus there are greater areas under the curve associated with extreme values
of t (see Figure ??, which is reproduced from the probability distributions page).
Of course, it is possible to observe any mean value for a finite sample2 taken from a normal
distribution. But, at some point, an observed sample mean can be so far from the mean and
therefore the probability of observing a sample mean at least that far away is so unlikely that
we reject the null hypothesis that the observed sample mean came from the population
described in the null hypothesis. We reject the null hypothesis in favor of the alternative
hypothesis that the sample came from some other population. We use the t-statistic to help us
make that decision.
The observed t that will help us make our decision is given by the formula:
¯ x − μ
t obs =
se
¯
x μ where is the observed sample mean, is the population mean posited in the null hypothesis,
and se is the standard error of the sample mean s /√n, which approximates the standard
obs
deviation of all sample means of size n taken from the hypothesized normal distribution
σ/√ n (because we don’t really know what σ is in the population).
As noted above, μ is the mean of a hypothetical population, but in practice it can be any
number of interest. For example, if we were interested in whether the mean of a sample were
significantly not equal to zero (> 0, < 0, or ≠ 0), we could put some variation of μ = 0 (
≤ 0, ≥ 0, or = 0) in the null hypothesis to simulate what would happen if the population
mean were 0? even if we aren’t really thinking about 0 in terms of a population mean.
For our examples, let’s pretend that we are makers of large kitchen appliances. Let’s start by
making freezers. First, we will have to learn how freezers work and how to build them.
Looks easy enough! Now we need to test our freezers. Let’s say we have built 10 freezers and
we need to know that our sample of freezers produces temperatures that are significantly less
than 0 C .3 Here, in degrees Celsius, are our observed data:
∘
Freezer Temperature ( ∘
C )
1 -2.14
2 -0.80
3 -2.75
4 -2.58
5 -2.26
6 -2.46
7 -1.33
8 -2.85
9 -0.93
10 -2.01
Now we will use the six-step hypothesis testing procedure to test the scientific hypothesis that
the mean of the internal temperatures of the freezers we built is significantly less than 0.
Oh, sorry, Nick Miller. First we must do the responsible thing and check the assumptions of the
t-test. Since we have only one set of data (meaning we don’t have to worry about
shapiro.test(one.sample.data)
##
## Shapiro-Wilk normality test
##
## data: one.sample.data
## W = 0.89341, p-value = 0.1852
For this freezer-temperature-measuring experiment, we are going to start by assuming that the
mean of our observed temperatures is sampled from a normal distribution with a mean of 0. We
don’t hypothesize a variance in this step: for now, the population variance is unknown (it will
be estimated by the sample variance in the process of doing the t-test calculations in step 5 of
the procedure).
Now, it will not do us any good – as freezer-makers – if the mean internal temperatures of our
freezers is greater than 0 C . In this case, a point (or two-tailed) hypothesis will not do,
∘
because that would compel us to reject the null hypothesis if the mean were significantly less
than or greater than 0 C . Instead we will use a directional (or one-tailed) hypothesis, where
∘
our null hypothesis is that we have sampled our mean freezer temperatures from a normal
distribution of freezer temperatures with a population mean that is greater than or equal to 0
and our alternative hypothesis is that we have sampled our mean freezer temperatures from a
normal distribution of freezer temperatures with a population mean that is less than 0.:
H0 : μ ≥ 0
H1 : μ < 0
Since this is the one-sample t-test section of the page, let’s go with “one-sample t-test.”
4. Identify a rule for deciding between the null and alternative hypotheses
If the observed t indicates that the cumulative likelihood of the observed t or more extreme
unobserved t values is less than the type-I error rate (that is, if p ≤ α), we will reject H in0
favor of H .
1
We can determine whether p ≤ α in two different ways. First, we can directly compare the
area under the t curve for t ≤ t – which is p because the null hypothesis is one-tailed and
obs
. We can accomplish that fairly easily using software and that is what we will do (sorry for the
spoiler).
The second way we can determine whether p ≤ α is by using critical values of t. A critical
value table such as this one lists the values of t for which p is exactly α given the df and
whether the test is one-tailed or two-tailed. Any value of t with an absolute value greater than
the t listed in the table for the given α, df , and type of test necessarily has an associated p-
value that is less than α. The critical-value method is helpful if you (a) are in a stats class that
doesn’t let you use R on quizzes and tests, (b) are stranded on a desert island with nothing but
old stats books and for some reason need to conduct t-tests, and/or (c) live in 1955.
The mean of the observed sample data is -2.01, and the standard deviation is 0.74. We
incorporate the assumption that the observed standard deviation is our best guess for the
population standard deviation in the equation to find the standard error:
sd obs
se =
√n
although, honestly, if you don’t care much for the theoretical underpinning of the t formula, it
suffices to say the se is the sd divided by the square root of n.
Our null hypothesis indicated that the μ of the population was 0, so that’s what goes in the
numerator of the t formula. Please note that it makes no difference at this point whether H
obs 0
6. Make a decision
Now that we have an observed t, we must evaluate the cumulative likelihood of observing at
least that t in the direction(s) indicated by the type of test (one-tailed or two-tailed). Because
the alternative hypothesis was that the population from which the sample was drawn had a
mean less than 0, the relevant p-value is the cumulative likelihood of observing t or a obs
Because n = 10 (there were 10 freezers for which we measured the internal temperature),
df = 9. We thus are looking for the lower-tail cumulative probability of t ≤ t given that
obs
df = 9:
## [1] 6.241159e-06
That is a tiny p-value! It is much smaller than the α rate that we stipulated (α = 0.05). We
reject H .0
OR: the critical t for α = 0.05 and df = 9 for a one-tailed test is 1.833. The absolute value of
t is |−8.59|= 8.59, which is greater than the critical t. We reject H .
obs 0
OR: we could have skipped all of this and just used R from the jump:
##
## One Sample t-test
##
## data: one.sample.data
## t = -8.5851, df = 9, p-value = 6.27e-06
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
## -Inf -1.580833
## sample estimates:
## mean of x
## -2.010017
but then we wouldn’t have learned as much.
The repeated-measures t-test is used when we have two measurements of the same things and
we want to see if the mean of the differences for each thing is statistically significant.
Mathematically, the repeated-measures t-test is the exact same thing as the one-sample t-test!
The only special thing about it is that we get the sample of data by subtracting, for each
observation, one measure from the other measure to get difference scores.
To generate another example, let’s go back to our imaginary careers as makers of large kitchen
appliances. This time, let’s make an oven. To test our oven, we will make 10 cakes, measuring
the temperature each tin of batter in C before they go into the oven and then measuring the
∘
temperature of each of the (hopefully) baked sponges after 45 minutes in the oven.
Cake Pre-bake ( ∘
C ) Post-bake ( ∘
C ) Difference (Post − Pre)
1 20.83 100.87 80.04
2 19.72 98.58 78.86
3 19.64 109.09 89.44
4 20.09 121.83 101.74
5 22.25 122.78 100.53
6 20.83 111.41 90.58
7 21.31 103.96 82.65
8 22.50 121.81 99.31
9 21.17 127.85 106.68
10 19.57 115.17 95.60
Once we have calculated the difference scores – which for convenience we will abbreviate
with d – we have no more use for the original paired data4. The calculations proceed using d
precisely as they do for the observed data in the one-sample t-test.
This is as good a time as any to pause and note that what I have been calling the repeated-
measures t-test is often referred to as the paired-samples t-test. Either name is fine! What is
important to note is that the measures in this test do not have to refer to the same individual. A
paired sample could be identical twins. It could be pairs of animals that have been bred to be
exactly the same with regard to some variable of interest (e.g., murine models). It could also
be individuals that are matched on some demographic characteristic like age or level of
education attained. From a statistical methods point of view, paired samples from different
individuals are treated mathematically the same as are paired samples from the same
individuals. Whether individuals are appropriately matched is a research methods issue.
Back to the math: the symbols and abbreviations we use are going to be specific to the
repeated-measures t-test, but the formulas are going to be exactly the same. We calculate the
observed t for the repeated-measures test the same way as we calculated the observed t for the
one-samples test, but with assorted d’s in the formulas to remind us that we’re dealing with
difference scores.
¯
d − μd
t obs =
se d
The null assumption is that we are drawing the sample of difference scores from a population
of difference scores with a mean equal to μ . The alternative hypothesis is that we are are
d
Oh, thanks, Han Solo! We have to test the normality of the differences:
shapiro.test(difference)
##
## Shapiro-Wilk normality test
##
## data: difference
## W = 0.93616, p-value = 0.5111
We can continue to assume that the differences are sampled from a normal distribution.
11.1.3.1.0.1 S ix-step Hypothesis Testing
Let’s start with the scientific hypothesis that the oven will make the cakes warmer. Thus, we
are going to assume a null that the oven makes the cakes no warmer or possibly less warm,
with the alternative being that the mu is any positive non-zero difference in the temperature of
d
the cakes.
H0 : μd ≤ 0
H1 : μd > 0
Repeated-measures t-test.
4. Identify a rule for deciding between the null and alternative hypotheses
If p < α, which as noted for the one-sample t-test we can calculate with software or by
consulting a critical-values table but really we’re just going to use software, reject H in favor
0
of H . 1
The mean of the observed sample data is 92.54, and the standard deviation is 9.77. As for the
one-sample test, we incorporate the assumption that the observed standard deviation of the
differences is our best guess for the population standard deviation of the differences in the
equation to find the standard error:
sd d(obs)
se d =
√n
Because the null hypothesis indicates that μ ≤ 0, the p-value is the cumulative likelihood that
d
## [1] 1.252983e-10
The observed p-value is less than α = 0.01, so we reject the null hypothesis in favor of the
alternative hypothesis: the population-level mean difference between pre-bake batter and post-
bake sponge warmth is greater than 0.
Of course, we can save ourselves some time by using R. Please note that when using the
t.test() command with two arrays (as we do with the repeated-measures test and will again
with the independent-samples test), we need to note whether the samples are paired or not
using the paired = TRUE/FALSE option.
##
## Paired t-test
##
## data: postbake and prebake
## t = 29.966, df = 9, p-value = 1.255e-10
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 86.88161 Inf
## sample estimates:
## mean of the differences
## 92.54283
Based on this analysis, our ovens are in good working order.
But that result only tells us that the mean increase in temperature was significantly greater
than 0. That is probably not good enough, even for imaginary makers of large kitchen
appliances.
would mean that we would be testing whether our ovens, on average, raise the temperature of
our cakes by more than 90 C . We’ll skip all of the usual steps and just alter our R commands
∘
##
## Paired t-test
##
## data: postbake and prebake
## t = 0.82337, df = 9, p-value = 0.2158
## alternative hypothesis: true difference in means is greater than 90
## 95 percent confidence interval:
## 86.88161 Inf
## sample estimates:
## mean of the differences
## 92.54283
also the default in R if you leave out the mu= part of the t.test() command) – is not
scientifically interesting.
The last type of t-test is the independent-groups (or independent-samples) t-test. The
independent-groups test is used when we have completely different samples. They don’t even
have to be the same size.5.
When groups are independent, the assumption is that they are drawn from different populations
that have the same variance. The independent-samples t is a measure of the difference of the
means of the two groups. The distribution that the difference of the means is sampled from is
not one of a single population but of the combination of the two populations.
The rules of linear combinations of random variables tell us that the difference of two normal
distributions is a normal with a mean equal to the difference of the means of the two
distributions and a variance equal to the sum of the means of the two distributions:
Figure 11.2: Linear (Subtractive) Combination of two Normal Distributions
And so the difference of the means is hypothesized to come from the combined distribution of
the differences of the means of the two populations.
The variance of the population distributions that each sample comes from are assumed to be the
same: that’s the homescedasticity assumption. When performing the one-sample and repeated-
measures t-tests, we used the sample variances of the single sample and the difference scores,
respectively, to estimate the populations from which those numbers came. But, with two
groups, we will most likely have two sample variances (we could have the exact same sample
variance in both groups, but that would be quite improbable). We can’t say that one sample
comes from a population with a variance approximated by the variance of the that sample and
that the other sample comes from a population with a variance approximated by the variance of
the other sample: that would imply that the populations have different variances, and we can’t
have that, now, can we?
Instead, we treat the two sample variances together as estimators of the common variance
value of the two populations by calculating what is known as the pooled variance:
2 2
s 1 (n 1 − 1) + s 2 (n 2 − 1)
2
s pooled =
n1 + n2 − 2
The pooled variance, in practice, acts like a weighted average of the two sample variances: if
the samples are of uneven size (n1 ≠ n ), by multiplying each sample variance by n − 1, the
2
variance of the larger sample is weighted more heavily than the variance of the smaller sample.
In theory (in practice, too, I guess, but it’s a little harder to see), by multiplying the sample
variances by their respective n − 1, the pooled variance takes the numerator of the sample
variances – the sums of squares (SS) for each observation – adds them together, and creates a
new variance.
2 2
¯
¯ ∑(x 1 − x 1 ) + ∑(x 2 − x 2 )
2
s pooled =
n1 + n2 − 2
The denominator of the pooled variance is the total degrees of freedom of the estimate:
because there are two means involved in the calculation of the numerator – x to calculate the
¯ 1
minus 2.
Now, our null assumption is that the difference of the means comes from a distribution
generated by the combination of the two distributions from which the two means were
respectively sampled with a hypothesized mean of the difference of means. The variance of
that distribution is unknown at the time that the null and alternative hypotheses are determined6.
When we do estimate that variance, it will be the pooled variance. And, when we apply the
central limit theorem, we will use a standard error that – based on the rules of linear
combinations – is a sum of the standard errors of each sampling procedure:
2 2
σ σ
¯
¯
x1 − x2 ∼ N (μ 1 − μ 2 , + )
n1 n2
And therefore our formula for the observed t for the independent-samples t-test is:
¯
¯ x1 − x2 − Δ
t obs =
2 2
s s
√ pooled
+
pooled
n1 n2
Please note: in practice, as with the repeated-measures t-test, researchers rarely use a non-zero
value for the mean of the null distribution (μ in the case of the repeated-measures test; Δ in
d
the case of the independent-samples test). Still, it’s there if we need it.
The df that defines the t-distribution that our difference of sample means comes from is the
sum of the degrees of freedom for each sample:
df = (n 1 − 1) + (n 2 − 1) = n 1 + n 2 − 2
Now, let’s think up another example to work through the math. In this case, let’s say we have
expanded our kitchen-appliance-making operation to include small kitchen appliances, and we
have made two models of toaster: the Mark I and the Mark II. Imagine, please, that we want to
test if there is any difference in the time it takes each model to properly toast pieces of bread.
We test 10 toasters of each model (here n − n that’s doesn’t necessarily have to be true) of
1 2
We are now ready to test the hypothesis that there is any difference in the mean toasting time
between the two models.
Oh, I almost forgot again! Thanks, Diana Ross! We have to test two assumptions for the
independent-samples t-test: normality and homoscedasticity.
Mark.I.residuals<-Mark.I-mean(Mark.I)
Mark.II.residuals<-Mark.II-mean(Mark.II)
shapiro.test(c(Mark.I.residuals, Mark.II.residuals))
##
## Shapiro-Wilk normality test
##
## data: c(Mark.I.residuals, Mark.II.residuals)
## W = 0.94817, p-value = 0.3402
Good on normality!
We are looking in this case for evidence that either the Mark I toaster or the Mark II toaster is
faster than the other. Logically, that means that we are interested in whether the difference
between the mean toasting times is significantly different from 0. Thus, we are going to assume
a null that indicates no difference between the population mean of Mark I toaster times and the
population mean of Mark II toaster times.
¯
¯
H0 : x1 − x2 = 0
¯
¯
H1 : x1 − x2 ≠ 0
The type-I error rate we used for the repeated-measures example – α = 0.01 – felt good. Let’s
use that again.
Independent-groups t-test.
4. Identify a rule for deciding between the null and alternative hypotheses
If p < α, which again we can calculate with software or by consulting a critical-values table
but there’s no need to get tables involved here in the year 2020, reject H in favor of H . 0 1
The mean of the Mark I sample is 4.98, and the variance is 2.19. The mean of the Mark II
sample is 10.07, and the variance is 3.33.
Because this is a two-tailed test, the p-value is the sum of the cumulative likelihood of
t ≤ −|t | or t ≥ |t
obs | – that is, the sum of the lower-tail probability that a t could be less
obs
than or equal to the negative version of t and the upper-tail probability that a t could be
obs
## [1] 2.033668e-06
And all that matches what we could have done much more quickly and easily with the
t.test() command. Note in the following code that paired=FALSE – otherwise, R would run
a repeated-measures t-test – and that we have included the option var.equal=TRUE, which
indicates that homoscedasticity is assumed. Assuming homoscedasticity is not the default
option in R: more on that below.
##
## Two Sample t-test
##
## data: Mark.I and Mark.II
## t = -6.8552, df = 18, p-value = 2.053e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.654528 -3.532506
## sample estimates:
## mean of x mean of y
## 4.979966 10.073482
11.1.4.2 Welch’s t-test
The default option for the t.test() command when paired=FALSE is indicated is
var.equal=FALSE. That means that without instructing the software to assume
homoscedasticity, the default is to assume different variances. This default test is known as
Welch’s t-test. Welch’s test differs from the traditional, homoscedasticity-assuming t-test in
two ways:
1. The pooled variance is replaced by separate population variance estimates based on the
sample variances. The denominator for Welch’s t is therefore:
2 2
s s
√ 1 2
+
1 2
n n
2. The degrees of freedom of the t-distribution are adjusted to compensate for the differences
in variance. The degrees of freedom for the Welch’s test are not n + n − 2, but rather:1 2
2 2 2
s s
1 2
( + )
n1 n2
df ≈
4 4
s s
1 2
2
+ 2
n (n 1 −1) n (n 2 −1)
1 2
Really, what you need to know there is that the Welch’s test uses a different $ than the
traditional independent-samples test.
Repeating, then, the analysis of the toasters with var.equal=TRUE removed, the result of the
Welch test is:
##
## Welch Two Sample t-test
##
## data: Mark.I and Mark.II
## t = -6.8552, df = 17.27, p-value = 2.566e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.659274 -3.527760
## sample estimates:
## mean of x mean of y
## 4.979966 10.073482
Note that for these data where homoscedasticity was observed, the t bs is the same. The df is
o
a non-integer value, but close to the value of df = 18 for the traditional independent-samples
test, and the p-value is essentially the same.
The advantage of Welch’s test is that it accounts for possible violations of homoscedasticity. I
don’t really see a downside except that you may have to explain it a tiny bit in a write-up of
your results.
11.1.4.2.1 Notes on the t-test
1. Although I chose one or the other for each of the above examples, any type of t-test can be
either a one-tailed test or a two-tailed test. The direction of the hypothesis depends only
on the nature of the question one is trying to answer, not on the structure of the data.
2. Some advice for one-tailed tests: always be sure to keep track of your signs. For the one-
tailed test, that is relatively easy: keep in mind whether you are testing whether the sample
mean is supposed to be greater than the null value or less than the null value. For the
repeated-measures and independent-groups tests, be careful which values you are
subtracting from which. Which measurement you subtract from which in the repeated-
measures test and which mean you subtract from which in the independent-groups test is
arbitrary. However, it can be shocking if, for example, you expect scores to increase from
one measurement to another and they appear to decrease, but only because you subtracted
what was supposed to be bigger from what was supposed to be smaller (or vice versa).
3. Related to note 2: assuming that the proper subtractions have been made, if you have a
directional hypothesis and the result is in the wrong direction, it cannot be statistically
significant. If, for example, the null hypothesis is μ ≤ 0 and the t value is negative, then
one cannot reject the null hypothesis no matter how big the magnitude of t. An
experiment testing a new drug with a directional hypothesis that the drug will make things
better is not successful if the drug makes things waaaaaaaay worse.
Let’s run a linear regression on our toaster data where the predicted variable (y) is toasting
time and the predictor variable (x) is toaster model. Note the t-value on the “modelMark.II”
line:
summary(lm(times~model, data=toaster.long))
##
## Call:
## lm(formula = times ~ model, data = toaster.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7588 -0.9213 -0.3457 1.3621 2.6393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9800 0.5254 9.479 2.02e-08 ***
## modelMark.II 5.0935 0.7430 6.855 2.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.661 on 18 degrees of freedom
## Multiple R-squared: 0.7231, Adjusted R-squared: 0.7077
## F-statistic: 46.99 on 1 and 18 DF, p-value: 2.053e-06
That t-value is the same as the t that we got from the t-test (but with the sign reversed), and the
p-value is the same as well. The t-test is a special case of regression: we’ll come back to that
later.
Nonparametric Tests evaluate the pattern of observed results rather than the numeric
descriptors (e.g., summary statistics like the mean and the standard deviation) of the results.
Where a parametric test might evaluate the cumulative likelihood (given the null hypothesis, of
course) that people in a study improved on average on a measure following treatment, a
nonparametric test might take the same data and evaluate the cumulative likelihood that x out of
n people improved on the same measure following treatment.
Three examples of nonparametric tests were covered in correlation and regression: the
correlation of ranks ρ, the concordance-based correlation τ , and the categorical correlation γ.
In each of those tests, it is not the relative values of paired observations with regard to the
mean and standard deviation of the variables but the relative patterns of paired observations.
11.1.7.1 The χ
2
Test of Statistical Independence
Statistical Independence refers to a state where the patterns of the observed data do not
depend on the number of possibilities for the data to be arranged. That means that there is no
relationship between the number of categories that a set of datapoints can be classified into and
the probability that the datapoints will be classified in a particular way. By contrast, statistical
dependence is a state where there is a relationship between the possible data structure and the
observed data structure.
The χ test of statistical independence takes statistical dependence as its null hypothesis and
2
statistical independence as its alternative hypothesis. It is essentially the same test as the χ 2
goodness-of-fit test, but used for a wider variety of applied statistical inference (that is,
beyond evaluating goodness-of-fit). χ tests of statistical independence can be categorized by
2
the number of factors being analyzed in a given test: in this section, we will talk about the one
factor case – one-way χ tests – and the two factor case – two-way χ tests.
2 2
In the one-way test, statistical independence is determined on the basis of the number of
χ
2
There are two choices, neither of which is accompanied by any type of description. We would
expect an approximately equal number of the 100 hungry people to pick each option. Given that
there is no compelling reason to choose one over the other, the choice responses of the people
are likely to be statistically dependent on the number of choices. We would expect
approximately = 50 people to order from Restaurant A and approximately = 50 people
n
2
n
to order from Restaurant B: statistical dependence means we can guess based solely on the
possible options.
Now, let’s say our 100 hypothetical hungry people open the same third-party food-delivery
website and instead see these options:
Given that the titles of each restaurant are properly descriptive and not ironic, this set of
options would suggest that people’s choices will not depend solely on the number of options.
The choice responses of the 100 Grubhub customers would likely follow a non-random
pattern. Their choices would likely be statistically independent of the number of possible
options.
As in the goodness-of-fit test, the χ test of statistical independence uses an observed χ test
2 2
k 2
(f o − f e )
2
χ obs = ∑
fe
1
where k is the number of categories or cells that the data can fall into. For the one-way test, the
expected frequencies for each cell are the total number of observations divided by the number
of cells:
n
fe = .
k
The expected frequencies do not have to be integers! In fact, unless the number of observations
is a perfect multiple of the number of cells, they will not be.
The degrees of freedom (df , sometimes abbreviated with the Greek letter ν [“nu”]) for the
one-way χ test is the number of cells k minus 1:
2
df = k − 1
Generally speaking, the degrees of freedom for a set of frequencies (as we have in the data that
can be analyzed with the χ test) are the number of cells whose count can change while
2
maintaining the same marginal frequency. For a one-way, two-cell table of data with A
observations in the first cell and B observations in the second cell, the marginal frequency is
A + B:
thus, we can think of the marginal frequencies as totals on the margins of tables comprised of
cells. If we know A, and the marginal frequency A + B is fixed, then we know B by
subtraction. The observed frequency of A can change, and then (given fixed A + B) we would
know B; B could change, and then we would know A. We cannot freely change both A and B
while keeping A + B constant; thus, there is 1 degree of freedom for the two-cell case.
determines both the mean (df ) and the variance (2df ). Thus, the df are all we need to know to
calculate the area under the χ curve at or above the observed χ statistic: the cumulative
2 2
one-tailed distribution and the χ test is a one-tailed test (the alternative hypothesis of
2
statistical independence is a binary thing – there is no such thing as either negative or positive
statistical independence), so that upper-tail probability is all we need to know.
As the χ test is a classical inferential procedure (albeit one that makes an inference on the
2
– is always that there is statistical dependence and the alternative hypothesis is always that
there is statistical independence. Whatever specific form dependence and independence take,
respectively, depends (no pun intended) on the situation being analyzed. The α-rate is set a
priori, the test statistic is a χ value with k − 1 degrees of freedom, and the null hypothesis
2
will be rejected if p ≤ α.
To make the calculations, it is often convenient to keep track of the expected and observed
frequencies in a table resembling this one:
fe fe
f0 f0
We then determine whether the cumulative likelihood of the observed χ value or more
2
statistical dependence looks like χ ≈ 0 – is less than or equal to the predetermined α rate
2
Let’s work through an example using the fictional choices of 100 hypothetical people between
the made-up restaurants Dan’s Delicious Dishes and Homicidal Harry’s House of Literal
Poison. Suppose 77 people choose to order from Dan’s and 23 people choose to order from
Homicidal Harry’s. Under the null hypothesis of statistical dependence, we would expect an
equal frequency in each cell: 50 for Dan’s and 50 for Homicidal Harry’s:
fe1<-c("$f_e=50$", "")
fo1<-c("", "$f_0=77$")
fe2<-c("$f_e=50$", "")
fo2<-c("", "$f_0=23$")
Dan’s Harry’s
f e = 50 f e = 50
f 0 = 77 f 0 = 23
## [1] 6.66409e-08
which is smaller than an α of 0.05, or an α of 0.01, or really any α that we would choose.
Thus, we reject the null hypothesis of statistical dependence in favor of the alternative of
statistical independence. In terms of the example: we reject the null hypothesis that people are
equally likely to order from Dan’s Delicious Dishes or from Homicidal Harry’s House of
Literal Poison in favor of the alternative that there is a statistically significant difference in
people’s choices.
chisq.test(c(77, 23))
##
## Chi-squared test for given probabilities
##
## data: c(77, 23)
## X-squared = 29.16, df = 1, p-value = 6.664e-08
Wait a minute … the two-cell one-way χ test seems an awful lot like a binomial
2
probability problem. What if we treated these data as binomial with s as the frequency of
one cell and f as the frequency of the other with a null hypothesis of π = 0.5?
And it is very impressive that you might be thinking that, because the answer is:
p(s ≥ s obs |π = 0.5, N ), where s is the greater of the two observed frequencies (or
p(s ≤ s obs |π = 0.5, N ) where s is the lesser of the two observed frequencies). In fact, the
binomial test is the more accurate of the two methods – the χ is more of an approximation –
2
To demonstrate, we can use the data from the above example regarding fine dining choices:
## [1] 2.75679e-08
The difference between the p-value from the binomial and from the χ is about 4 × 10 . That’s
2 8
real small!
The χ test assesses categories of data, but that doesn’t necessarily mean that the data
2
breakdown was significantly different from a 50-50 split between A’s and B’s. We could do a
one-sample t-test based on the null hypothesis that μ = 90. Or, we could use a χ test where 2
the expected frequencies reflect an equal number of students scoring above and below 90:
fe1<-c("$f_e=6$", "")
fo1<-c("", "$f_0=2$")
fe2<-c("$f_e=6$", "")
fo2<-c("", "$f_0=12$")
B’s A’s
fe = 6 fe = 6
f0 = 2 f 0 = 12
2 2
(2 − 6) (10 − 6)
2
χ obs = + = 5.33
6 6
The p-value – given that df = 1 – would be 0.021, which would be considered significant at
the α = 0.05 level.
Often in statistical analysis we are interested in examining multiple factors to see if there is a
relationship between things, like exposure and mutation, attitudes and actions, study and recall.
The two-way version of the χ test of statistical independence analyzes patterns in category
2
membership for two different factors. To illustrate: imagine we have a survey comprising two
binary-choice responses: people can answer 0 or 1 to Question 1 and they can answer 2 or 3 to
Question 2. We can organize their responses into a 2 × 2 (rows × columns) table known as a
contingency table, where the responses to each question – the marginal frequencies of
responses – are broken down contingent upon their answers to the other question.8
Question 2
2 3 Margins
Question 1 0 A B A + B
Question 1 1 C D C + D
Margins A + C B + D n
Statistical independence in the one-way χ test was determined as a function of the number of
2
possible options. Statistical independence in the two-way χ test suggests that the two factors
2
are independent of each other, that is, that the number of possibilities for one factor are
unrelated to the categorization of another factor.
Just as the degrees of freedom for the one-way χ test were the number of cells in which the
2
frequency was allowed to vary while keeping the margin total the same, the degrees of freedom
for the two-way χ test are the number of cells that are free to vary in frequency while keeping
2
both sets of margins – the margins for each factor – the same. Thus, we have a set of df for the
rows of a contingency table and a set of df for the columns of a contingency table, and the total
df is the product of the two:
Unlike in the one-way test, the expected frequencies across cells in the two-way χ test do not
2
need to be equal: in most cases, they aren’t. It is possible – and not at all uncommon – for the
different response levels of each factor to have different frequencies, which the two-way test
accounts for. The expected frequencies instead are proportionally equal given the marginal
frequencies. In the arrangment in the table above, for example, we do not expect A, B, C , and
D to be equal to each other but we do expect A and B to be _proportional to A + C and
A + D, respectively; for A and C to be proportional to A + B and C + D, respectively; etc.
Thus, the expected frequency for each cell in a 2-way contingency table is:
For example, let’s say that we asked 140 people a 2-question survey:
Note, for example, that the expected number of people to say “no” to the hot dog question and
“1” to the straw question is greater than the expected number of people to say “yes” to the hot
dog question and “1” to the straw question, even though the opposite was observed. That is
because more people overall said “no” to the hot dog question, so proportionally we expect
greater number of responses associated with either answer to the other question among those
people who (correctly) said that a hot dog is not a sandwich.
2 2 2 2
(40 − 21.43) (10 − 28.57) (20 − 38.57) (70 − 51.43)
2
χ obs (1) = + + + = 43.81
21.43 28.57 38.57 51.43
The associated p-value is:
## [1] 6.530801e-16
which is smaller than any reasonable α-rate, so we reject the null hypothesis that there is no
relationship between people’s responses to the hot dog question and the straw question.
To perform the one-way χ test in R, we used the command chisq.test() with a vector of
2
values inside the parentheses. The two-way test has an added dimension, so instead of a
vector, we enter a matrix:
## [,1] [,2]
## [1,] 40 10
## [2,] 20 70
Correction. Personally, I think you can skip it, although it doesn’t seem to do too much harm.
We can turn off that default with the option correct=FALSE.
##
## Pearson's Chi-squared test
##
## data: matrix(c(40, 20, 10, 70), nrow = 2)
## X-squared = 43.815, df = 1, p-value = 3.61e-11
Having had a blast with the one-way χ test and the time of one’s life with the two-way
2
χ
2
1. Adding dimensions increases the number of cells exponentially: a three-way test involves
x × y × z cells, a four-way test involved x × y × z × q cells, etc. In turn, that means that
test indicates statistical independence of all of the factors but not some subsets of the
factors or vice versa.
So, the official recommendation here is to use factorial analyses other than the χ
2
test should
more than two factors apply to a scientific hypothesis.
As noted in our previous encounter with the χ statistic, the only requirement for the χ test is
2 2
that f ≥ 5. If the structure of observed data are such that you would use a 1-way χ test
e
2
except for the problem of f < 5, we can instead treat the data as binomial, testing the
e
cumulative likelihood of s ≥ s given that π = 0.5. If the structure indicate a 2-way χ test
obs
2
The Exact Test – technically known as Fisher’s Exact Test but since he was a historically
shitty person I see no problem dropping his name from the title – is an alternative to the χ test 2
The exact test returns the cumulative likelihood of a given pattern of data given that the
marginal totals remain constant. To illustrate how the exact test returns probabilities, please
consider the following labels for a 2 × 2 contingency table:
Factor 2 Margins
Factor 1 A B A + B
Factor 1 C D C + D
Margins A + C B + D n
First, let’s consider the number of possible combinations for the row margins. Given that there
are n observations, the number of combinations of observations that could put A + B of those
observations in the top row – and therefore C + D observations in the second row – is given
by:
n! n!
nCr = =
(A + B)! (n − (A + B))! (A + B)! (C + D)!
Next, let’s consider the arrangement of the row observations into column observations. The
number of combinations of observations that lead to A observations in the first column is
A + C things combined A at a time – C
A+C– and the number of combinations of
A
lead to the observed A (and thus C because A + C is constant) and the observed B (and thus
D because B + D is constant), and the formula:
describes the number of possible ways to get the observed arrangement of A, B, C , and D.
Because, as noted above, there are n!
total possible arrangements of the data,
(A+B)! (C+D)!
probability of the observed arrangement is the number of ways to get the observed arrangement
divided by the total possible number of arrangements:
(A+C)! (B+D)!
A! B! C! D!
(A + B)! (C + D)! (A + C)! (B + D)!
p(conf iguration) = =
n!
n! A! B! C! D!
(A+B)! (C+D)!
The p-value for the Exact Test is the cumulative likelihood of all configurations that are as
extreme or more extreme than the observed configuration.9
The extremity of configurations is determined by the magnitude of the difference between each
pair of observations that constitute a marginal frequency: that is: the difference between A and
B , the difference between C and D, the difference between A and C , and the difference
between B and D. The most extreme cases occur when the members of one group are split
entirely into one of the cross-tabulated groups, for example: if A + B = A because all
members of the A + B group are in A and 0 are in B. Less extreme cases occur when the
members of one group are indifferent to the cross-tabulated groups, for example: if
A + B ≈ 2A ≈ 2B because A ≈ B.
Consider the following three examples: two with patterns of data suggesting no relationship
between the two factors and another with a pattern of data suggesting a significant relationship
between the two factors.
In this example, an approximately equal number of responses are given to both questions: 7
people say a hot dog is a sandwich, 8 say it is not; 8 people say a straw has 1 hole, 7 say it has
2. Further, the cross-tabulation of the two answers shows that people who give either answer
to one question have no apparent tendencies to give a particular answer to another question: the
people who say a hot dog is a sandwich are about even-odds to say a straw has 1 hole or 2;
people who say a hot dog is not a sandwich are exactly even-odds to say a straw has 1 or 2
holes (and vice versa).
Given that p = 0.381 is greater than any reasonable α-rate, calculating just the probability of
the observed pattern is enough to know that the null hypothesis will not be rejected and we will
continue to assume no relationship between the responses to the two questions. However, since
this is more-or-less a textbook, and because there is no better place than a textbook than to do
things by the book, we will examine the patterns that are more extreme than the observed data
given that the margins remain constant.
Note that the margins in the above table are the same as in the first table: A + B = 7,
C + D = 8, A + C = 8, and B + D=7. However, the cell counts that make up those margins
are a little more lopsided: A and D are a bit larger than B and C . The probability of this
pattern is:
Finally, the most extreme pattern possible given that the margins are constant is:
(A good sign that you have the most extreme pattern is that there is a 0 in at least one of the
cells.)
is the p-value for a directional (one-tailed) hypothesis that there will be A or more
observations in the A cell. That sort of test is helpful if we have a hypothesis about the odds
ratio associated with observations being in the A cell, but isn’t super-relevant to the kinds of
problems we are investigating here. More pertinent is the two-tailed test (of a point
hypothesis) that there is any relationship between the two factors. To get that, we would repeat
the above procedure the other way (switching values such that the B and C cells get bigger)
and add all of the probabilities.
But, that’s a little too much work, even going by the book. Modern technology provides us with
an easier solution. Using R, we can arrange the observed the data into a matrix:
## [,1] [,2]
## [1,] 4 3
## [2,] 4 4
and then run an exact test with the base command fisher.test(). To check our math from
above, we can use the option alternative = "greater":
fisher.test(exact.example.1, alternative="greater")
##
## Fisher's Exact Test for Count Data
##
## data: exact.example.1
## p-value = 0.5952
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.1602859 Inf
## sample estimates:
## odds ratio
## 1.307924
To test the two-tailed hypothesis, we can either use the option alternative = "two.sided"
or – since two.sided is the default, just leave it out:
fisher.test(exact.example.1)
##
## Fisher's Exact Test for Count Data
##
## data: exact.example.1
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1164956 15.9072636
## sample estimates:
## odds ratio
## 1.307924
Either way, the p-value is greater than any α-rate we might want to use, so we continue to
assume the null hypothesis that there is no relationship between the factors, in this case: that
responses on the two questions are independent.
This table represents the most extreme possible patterns of responses: everybody who said a
hot dog was a sandwich also said that a straw has one hole, and nobody who said a hot dog
was a sandwich said that a straw has two holes. There is no more possible extreme on the
other side of the responses: if A equalled 0 and B equalled 7, then C would have to be 7 and
D would have to be 1 to keep the margins the same.
Thus, for this pattern, the one-tailed p-value for the observed data and the two-tailed p-value
are both the probability of the observed pattern:
## [,1] [,2]
## [1,] 7 0
## [2,] 0 8
fisher.test(exact.example.2)
##
## Fisher's Exact Test for Count Data
##
## data: exact.example.2
## p-value = 0.0001554
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 5.83681 Inf
## sample estimates:
## odds ratio
## Inf
Example 3: A Pattern That Looks Extreme But Does Not Represent a Significant
Relationship
Finally, let’s look at a pattern that seems to be as extreme as it possibly can be:
In this observed pattern of responses, there is a big difference between A and C and between
B and D. Crucially, though, there is hardly any difference between A and B or between C and
D. Even though there are no moves we can make that maintain the same marginal numbers, the
pattern here is one where the responses for one question don’t depend at all on the responses
for the other question: in this set of data, people are split on the straw question but nobody
regardless of their answer to the straw question thinks that a hot dog is a sandwich10. The p-
value for the Exact Test accounts for the lack of dependency between the responses:
##
## Fisher's Exact Test for Count Data
##
## data: exact.example.3
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0 Inf
## sample estimates:
## odds ratio
## 0
The core concept of the median test is: are there more values that are greater than the
overall median of the data than there are from one sample than from another sample? The
idea is that if one sample of data has values that tend to be bigger than values in another sample
– with the overall median observation as the guide to what is bigger and what is smaller – then
there is a difference between the two samples.
The median test arranges the observed data into two categories: less than the median and
greater than the median. Values that are equal to the median can be part of either category –
it’s largely a matter of preference – so those category designations, more precisely, can either
be less than or equal to the median/greater than the median or less than the median/greater
than or equal to the median (the difference will only be noticeable in rare, borderline
situations). When those binary categories are crossed with the membership of values in one of
two samples, the result is a 2 × 2 contingency table: precisely the kind of thing we analyze
with a χ test of statistical independence or – if f is too small for the χ test – the Exact Test.
2
e
2
To illustrate the median test, we will use two examples: one an example of data that do not
differ with respect to the median value between groups and another where the values do differ
between groups with respect to the overall median.
Imagine, if you will, the following data are observed for the dependent variable in an
experiment with a control condition and an experimental condition:
Median = 57
Control Condition Experimental Condition
Below Median (< 57 2, 12, 18, 23, 31, 35 4, 15, 16, 28, 44, 49
64, 67, 77, 84, 84, 85, 98, 100,
At or Above Median (≥ 57) 57, 63, 63, 66, 66, 69, 75
102
A plot of the above data with annotations for the overall median value (57, in this case),
illustrates the similarity of the two groups:
Figure 11.3: Control Group and Experimental Group Data Similar to Each Other With Respect
to the Overall Median of 57
Tallying up how many values are below the median and how many values are greater than or
equal to the median in each group results in the following contingency table:
Control Experimental
< 57 7 9
≥ 57 6 6
This arrangement lends itself nicely to a χ test of statistical independence with df
2
= 1 :
2 2 2 2
(7 − 7.43) (9 − 8.57) (6 − 5.57) (6 − 6.43)
2
χ obs (1) = + + + = 0.1077, p = 0.743
7.43 8.57 5.57 6.43
The results of the χ test lead to the continued assumption of independence between between
2
category (less than/greater than or equal to the median) and condition (control/experimental),
which we would interpret as there being no effect of being in the control vs. the experimental
group with regard to the dependent variable.
In the second example, there are more observed values in the control condition that are less
than the median than there are values greater than or equal to the median and more observed
values in the experimental condition that are greater than or equal to the median than there are
values less than the median:
Median = 57
Control Condition Experimental Condition
2, 4, 12, 15, 16, 18, 23,
Below Median (< 57) 44, 49
28, 31, 35
57, 63, 63, 64, 66, 67, 77, 84,
At or Above Median (≥ 57) 66, 69, 75
84, 85, 98, 100, 102
Counting up how many values in each group are either less than or greater than or equal to the
overall median gives the following contingency table:
Control Experimental
Control Experimental
< 57 10 2
≥ 57 3 13
2 2 2 2
(3 − 7.43) (13 − 8.57) (10 − 5.57) (2 − 6.43)
2
χ (1) = + + + = 11.50, p < .001
obs
7.43 8.57 5.57 6.43
The null hypothesis for the Wilcoxon-Mann-Whitney U test is that if we make an observation X
from population A and an observation Y from population B, it is equally probable that X is
greater than Y and that Y is greater than X:
and thus, the alternative hypothesis is that the probability that X is greater than Y is not equal to
the probability that Y is greater than X (or vice versa: the designation of X and Y is arbitrary):
Given two conditions A and B (in real life they will have real-life names like “control
condition” or “experimental condition”):
1. For each value in A, count how many values in B are smaller. Call this count U A
2. For each value in B, count how many values in A are smaller. Call this count U B
4. Test the significance of U using tables for small n, normal approximation for large n.
Note: the Wilcoxon-Mann-Whitney test ignores ties. If there are tied data, the observed p-value
will be imprecise (still usable, just imprecise).
For example, imagine the following data – which suggest negligible differences between the
two group – are observed:
We can arrange the data in overall rank order and mark whether each datapoint comes from the
Control (C) condition or the Experimental (E) condition:
Configuration of Observed
Data
C E C E C E C E E C
12 14 22 25 38 40 42 44 48 50
To calculate U (for the control condition), we take the total number of values from the
C
experimental condition that are smaller than each value of the control condition. There are 0
experimental-condition values smaller than the smallest value in the control condition (12), 1
that is smaller than the second-smallest value (22), 2 that are smaller than the third-smallest
value (38), 3 that are smaller than 42, and 5 that are smaller than 50 for a total of U = 11.
C
To calculate U (for the experimental condition), we take the total number of values for the
E
control condition that are smaller than each value of the experimental condition. There is 1
control-condition value smaller than the smallest experimental-condition value (14), 2 that are
smaller than 25, 3 that are smaller than 40, 4 that are smaller than 44, and also 4 that are
smaller than 48 for a total of U = 14.
E
We can also use software to calculate both the U statistic and the p-value (there is also a
paired-samples version of the Wilcoxon test called the Wilcoxon Signed Rank Test; as with the
t.test() command, we get the independent-samples version with the option paired=TRUE:
##
## Wilcoxon rank sum exact test
##
## data: control and experimental
## W = 11, p-value = 0.8413
## alternative hypothesis: true location shift is not equal to 0
Oh, and R calls U “W.” Pretty sure that’s for consistency with the output for the Wilcoxon
Signed Rank Test.
The following data provide an example of a pattern that does suggest a difference between
groups:
Control Condition Experimental Condition
12, 14, 22, 25, 40 38, 42, 44, 48, 50
Configuration of Observed
Data
C C C C E C E E E E
12 14 22 25 38 40 42 44 48 50
For these data, there is only one value of the experimental condition that is smaller than any of
the control-condition values (38 is smaller than 40), and 24 control-condition values that are
smaller than experimental-condition values, so the U statistic is equal to U = 1:
C
##
## Wilcoxon rank sum exact test
##
## data: control and experimental
## W = 1, p-value = 0.01587
## alternative hypothesis: true location shift is not equal to 0
If either sample has n > 20, then the following formulas can be used to calculate a z-score that
then can be used to get a p-value based on areas under the normal curve
n1 n2
μ =
2
n 1 n 2 (n 1 + n 2 + 1)
σ = √
2
U − μ
z =
σ
The randomization test is the most powerful nonparametric test for analyzing two independent
samples of scale data.13 The central idea of the randomization test is: of all the possible
patterns of the data, how unusual is the observed pattern?
There are 126 patterns that are as likely or less likely than this one.
n! 10!
N umber of possible permutations = = = 252
r! (n − r)! 5! (10 − 5)!
126
p = = .5, n.s.
252
This is the second-least likely possible pattern of the observed data (the least likely would be
if 38 and 40 switched places).
2
p = = .0079
252
To find the p-value for the randomization test, follow this algorithm:
1. Calculate the possible number of patterns in the observed data using the combinatorial
formula14
2. Assign positive signs to one group of the data and negative signs to the other
3. Find the sum of the signed data. Call this sum D.
4. Switch the signs on the data and sum again to find all possible patterns that lead to greater
absolute values of D
5. Take the count of patterns that lead to equal or greater absolute values of D
6. Divide the count in step 5, add 1 for the observed pattern, and divide by the possible
number of patterns calculated in step 1 to find the p-value
D = −12 − 14 − 22 − 25 − 40 + 38 + 42 + 44 + 48 + 50 = 109
Are there more extreme patterns? Only one: switch the signs of 38 (from the experimental
condition) and 40 (from the control condition)
D = −12 − 14 − 22 − 25 − 38 + 40 + 42 + 44 + 48 + 50 = 113
In this example, there are two possible combinations that are equal to or more extreme than the
observed pattern.
McNemar’s test is a repeated-measures test for categorical (or categorized continuous) data.
Just as the repeated-measures t-test measures the difference between two measures of the same
entities (participants, animal subjects, etc.), McNemar’s test measures categorical change
among two entities. Given two possible categorical states – for example: healthy and unwell,
in favor and opposed, passing and failing – of which each entity can exist in either at the times
of the two different measurements, McNemar’s test is a way to analyze the differences in the
two measures as a function of how much change happened between the two measurements.
To perform the test, the data are arranged thusly (with generic labels for the two possible states
and the two measurements):
Measure 2
State 2 State 1
Measure 1 State 1 A B
Measure 1 State 2 C D
In this arrangement, A and D represent changes in the state: A represents the count of entities
that went from State 1 in the first measurement to State 2 in the second, and D represents the
count that went from State 2 in the first measurement to State 1 in the second.
2
(A − D)
2
χ (1) =
obs
A + D
The larger the difference between A and D – normalized by the sum A + D – the bigger the
difference between the two measurements. And, by extension, larger differences between A
and D indicate how important whatever came between the two measurements (a treatment, an
intervention, an exposure, etc.) was.
N!
max(A,D) min(A,D)
p(s ≥max (A, D)|π = 0.5, N = A + D) = ∑ π (1 − π)
A! D!
to be clear: given that π = 0.5 for the McNemar test and the binomial distribution is therefore
symmetrical, the equation:
N! min(A,D) max(A,D)
p(s ≤min (A, D)|π = 0.5, N = A + D) = ∑ π (1 − π)
A! D!
For example, please imagine students in a driver’s education class are given a pre-test on the
rules of right-of-way before taking the class and a post-test on the same content after taking the
class. In the first set of example data, there is little evidence of difference indicated between
the two conditions by the observed data (n = 54):
Post-test
Fail Pass
Pre-test Pass 15 8
Pre-test Fail 16 15
In this example, 15 students pass the pre-test and fail the post-test, and another 15 students fail
the pre-test and pass the post-test. There are 24 students whose performance remains in the
same category for both the pre- and post-test (8 pass both; 16 fail both) – those data provide no
evidence either was as to the efficacy of the class.
The McNemar χ statistic – which we can use instead of the binomial because 15 + 15 > 20,
2
is:
2
(15 − 15)
2
M cN emar χ (1) = = 0
15 + 15
Which we don’t need to bother finding the p-value for: it’s not in any way possibly significant.
In the second example data set, there is evidence that the class does something:
Post-test
Fail Pass
Pre-test 1 Pass 25 8
Pre-test 1 Fail 16 5
In these data, 25 students pass the pre-test but fail the post-test, and 5 fail the pre-test but pass
the post-test. The McNemar χ statistic for this set of data is:
2
2
(25 − 5) 400
2
M cN emar χ (1) = = = 13.3
25 + 5 30
Measured in terms of a χ distribution with 1 degree of freedom – this might have been
2
intuitable from the 2 × 2 structure of the data, but the McNemar is always a df = 1 situation
– the observed p-value is:
## [1] 0.0002654061
and so we would reject the null hypothesis had we set α = 0.5, or α = 0.01 , or even
α = 0.001. Based on these data, there is a significant effect of the class.
Reviewing these particular data, we could further interpret the results as indicating that the
class makes students significantly worse at knowing the rules of the road. Which in turn
suggests that this driver’s ed class was taught in the Commonwealth of Massachusetts.
Conducting the McNemar test in R is nearly identical to conducting the χ test in R. The one
2
thing to keep in mind is that the mcnemar.test() command expects the cells to be compared to
be on the reverse diagonal:
B A
[ ]
D C
but, it’s not much trouble to just switch the columns. Also, you may want to – as with the χ
2
test – turn off the continuity correction with the option correct=FALSE:
##
## McNemar's Chi-squared test
##
## data: matrix(c(8, 5, 25, 16), nrow = 2)
## McNemar's chi-squared = 13.333, df = 1, p-value = 0.0002607
11.1.9.2 Sign (Binomial) Test
The Sign test, like McNemar’s test, is a measure of categorical change over 2 repeated
measures. In the case of the sign test, the categories are, specifically: positive changes and
negative changes. It takes the sign of the observed changes in the dependent variable – positive
or negative; zeroes are ignored – and treats them as binomial events with π = 0.5. In other
words, the sign test treats whether the difference between conditions is positive or negative as
a coin flip. For that reason, the sign test is often referred to as the binomial test, which is fine,
but I think there are so many things (including several on this page alone) that can be
statistically analyzed with the binomial likelihood function that that name can be a little
confusing.
The p-value for a binomial test is the cumulative binomial probability of the number of
positive values or negative values observed given the total number of trials that didn’t end up
as ties and π = 0.5. If there is a one-tailed hypothesis with an alternative that indicates that
more of the changes will be positive, then that cumulative probability is the probability of
getting at least as many positive changes out of the non-zero differences; if there is a one-tailed
hypothesis with an alternative that indicates that more of the changes will be negative, then the
cumulative probability is that of getting at least as many positive changes out of the non-zero
differences. If the hypothesis is two-tailed, then the p-value is the sum of the probability of
getting the smaller of the number of the positive changes and the negative changes or fewer plus
the probability of getting n – that number or greater (since the sign test is a binomial with π
=0.5, that’s more easily computed by taking the cumulative probability of the smaller of the
positive count and the negative count or fewer and multiplying it by two).
The following is an example data set in which positive changes are roughly as frequent as
negative changes: a data set for which we would expect our sign test analysis to come up with
a non-significant result.
2 8 5 -3 −
3 0 2 2 +
4 0 -1 -1 −
5 1 1 0 =
6 3 -5 -8 −
7 3 7 4 +
8 3 362 359 +
9 5 1 -4 −
10 6 0 -6 −
11 5 10 5 +
Let’s use a one-tailed test with a null hypothesis that the number of positive changes will be
greater than or equal to the number of negative changes (π ≥ 0.5) and an alternative hypothesis
that the number of positive changes will be less than the number of negative changes (π < 0.5)
Of the n = 11 observations, one had a difference score of 0. We throw that out. Of the rest,
there are 5 positive differences and 5 negative differences. Therefore, we calculate the
cumulative binomial probability of 5 or fewer successes in 10 trials with π = 0.5:
## [1] 0.6230469
and find that the changes are not significant. The software solution in R is fairly
straightforward:
##
## Exact binomial test
##
## data: 5 and 10
## number of successes = 5, number of trials = 10, p-value = 0.623
## alternative hypothesis: true probability of success is less than 0.5
## 95 percent confidence interval:
## 0.0000000 0.7775589
## sample estimates:
## probability of success
## 0.5
For a counterexample, let’s use the following example data where there are many more
negative changes than positive changes:
2 8 5 -3 −
3 0 -2 -2 −
4 0 -1 -1 −
5 1 1 0 =
6 3 -5 -8 −
7 3 -7 -10 −
8 3 -362 -365 −
9 5 -1 -6 −
10 6 0 -6 −
11 5 -10 -15 −
In this case, there is still 1 observation with no change between measurements: we’ll toss that
and we are left with 10 observations with changes: 9 negative and 1 positive. This time, we
will use a 2-tailed test. The sign test p-value is therefore (keeping in mind that for an upper-tail
pbinom, we enter s − 1 for s):
Using binom.test() with the option alternative="two.sided" gives us the same result:
##
## Exact binomial test
##
## data: 1 and 10
## number of successes = 1, number of trials = 10, p-value = 0.02148
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.002528579 0.445016117
## sample estimates:
## probability of success
## 0.1
takes into account the magnitude of the ranks of the differences. That is, a small negative
difference means less in the calculation of the test statistic than a large negative difference, and
likewise a small positive difference means less than a large positive difference. Thus,
significant values of the Wilcoxon test statistic W come from combinations of large and
frequent positive shifts or combinations of large and frequent negative shifts.
To calculate the Wilcoxon W statistic, we begin by taking differences (as we do with all of the
tests with two repeated measures). We note the sign of the difference: positive or negative, and
as in the sign test throwing out the ties. We also rank the magnitude of the differences from
smallest (1) to largest (n). If there are ties in the differences, each tied score gets the average
rank of the untied rank above and the untied rank below the tie cluster. For example, if a set of
differences is {10, 20, 20, 30}, 10 is the smallest value and gets rank 1; 40 is the largest value
and gets rank 4, and the two 20′s get the average rank 1+4
2
= 2.5, thus, the ranks would be
We then combine the ranks and the signs to get the signed-ranks. For example, if the observed
differences were {10, 20, 20, 30}, then the signed ranks would be {1, 2.5, 2.5, 4}. If the
observed differences were {−10, −20, −20, −30}, then the signed ranks would be
{−1, −2.5, −2.5, −4}: the ranks are based on magnitude (or, absolute value), so even though
−30 is the least of the four numbers, −10 is the smallest in terms of magnitude.
knitr::include_graphics("images/magnitude.gif")
Figure 11.4: Magnitude
For the one-tailed test, W is equal to the sum of the positive ranks. For the two-tailed test, the
W statistic is the greater of the sum of the magnitudes of the positive ranks and the sum of the
magnitudes of the negative ranks. If we denote the sum of the positive ranks as T and the sum
+
of the negative ranks as T , then we can write the W formulaw like this:
−
+ −
One − tailed W =max (T ,T )
+
T wo − tailed W = T
To demonstrate the Wilcoxon signed-rank test in action, we can re-use the examples from the
sign test. First, the dataset that indicates no significant effect:
2 8 5 3 + 4.0 + 4
3 0 2 -2 − 2.5 − 2.5
4 0 -1 1 + 1.0 + 1
5 1 1 0 = 0.0 0
6 3 -5 8 + 9.0 + 9
7 3 7 -4 − 5.5 − 5.5
9 5 1 4 + 5.5 + 5.5
10 6 0 6 + 8.0 + 8
11 5 10 -5 − 7.0 − 7
The sum of the negative signed-rank magnitudes is (note: we drop the negative signs to get the
sum of the magnitudes):
−
T = 4 + 1 + 9 + 5.5 + 8 = 27.5
If n > 30, there is a normal approximation that will produce a z-score; the area under the
normal curve beyond which will give the approximate p-value for the W statistic:
n(n+1)
W −
4
z =
n(n+1)(2n+1) 1 g
√ − ∑ t j (t j − 1)(t j + 1)
24 2 i=1
where g is the number of tie clusters in the ranks and t is the number of ties in each cluster j. I
j
We can just use the wilcox.test() command with the paired=TRUE option.
##
## Wilcoxon signed rank test
##
## data: Before and After
## V = 27.5, p-value = 1
## alternative hypothesis: true location shift is not equal to 0
Now let’s apply the same procedure to a dataset that indicates evidence of a difference
between the Before and After measures:
2 8 5 3 + 4.0 + 4
3 0 -2 2 + 2.5 + 2.5
4 0 -1 1 + 1.0 + 1
5 1 1 0 = 0.0 0
Observation Before After Difference Sign of Difference Rank Signed Rank
6 3 -5 8 + 7.0 + 7
7 3 -7 10 + 8.0 + 8
9 5 -1 6 + 5.5 + 5.5
10 6 0 6 + 5.5 + 5.5
11 5 -10 15 + 9.0 + 9
The sum of the positive signed-rank magnitudes (there is only one) is:
+
T = 2.5
The sum of the negative signed-rank magnitudes is (again: we drop the negative signs to get the
sum of the magnitudes):
−
T = 4 + 2.5 + 1 + 7 + 8 + 10 + 5.5 + 5.5 + 9 = 52.5
and thus:
+
One − tailed W = T
To test the significance of W , we can just take the easier and more sensible route straight to the
software solution:
##
## Wilcoxon signed rank test with continuity correction
##
## data: Before and After
## V = 52.5, p-value = 0.0124
## alternative hypothesis: true location shift is not equal to 0
The repeated-measures randomization test is also similar in principle to the Wilcoxon Signed-
Rank Test, in that it examines the relative sizes of the negative and the positive changes, but
different in that it deals with the observed magnitudes of the data themselves rather than their
ranks.
Again, we assume that the observed magnitudes of the differences are given, but that the sign on
each difference can vary. That means that number of possible permutations of the data are 2 n
randomization test is an analysis of how many patterns are as extreme as or more extreme than
the observed pattern; the p-value is the number of such patterns divided by the 2 possible
n
patterns.
Participant<-1:7
Condition1<-c(13,42,9,5,6,8,18)
Condition2<-c(5,36,2,0,9,7,9)
Difference<-Condition1-Condition2
A relatively easy way to judge the extremity of the pattern of the data is to take the sum of the
differences and to compare it to the sums of other possible patterns. For this example, the sum
of the differences d is:
obs
∑ d obs = 8 + 6 + 7 + 5 + −3 + 1 + 9 = 33
Any more extreme pattern will have a greater sum of differences. If, in this example,
Participant 5’s difference was 3 instead of −3 and Participant 6’s difference was −1 instead
of 1, the difference would be:
∑ d = 8 + 6 + 7 + 5 + 3 + −1 + 9 = 37
which indicates that that hypothetical pattern is more extreme than the observed pattern.
There is one more extreme possible pattern: the one where all of the observed differences are
positive:
∑ d = 8 + 6 + 7 + 5 + 3 + 1 + 9 = 39
Thus, there are three patterns that are either the observed pattern or more extreme patterns.
Given that there are 7 non-tied observations in the set, the number of possible patterns is
2 = 128. Therefore our observed p-value is:
7
3
p obs = = 0.023
128
More specifically, that is the p-value for a one-tailed test wherein we expect most of the
differences to be positive in the alternative hypothesis. To get the two-tailed p-value, we
simply multiply by two (the idea being that the three most extreme patterns in the other
direction are equally extreme). In this case, the two-tailed p-value would be = 0.047.
6
128
(using the t-distribution). Most commonly, confidence intervals are estimated for means.
Because the t-distribution represents the distribution of sample means (i.e., over repeated
samples, sample means are distributed in the shape of a t-distribution), it is naturally used to
calculate the confidence intervals associated with t-tests: intervals about sample means, mean
differences, and the differences between sample means.
width.
¯
(1 − α)% CI = x ± t α se
2
where t is the value of the t-distribution for the relevant df that puts
α
2
in the upper tail of
α
the area under the t-curve (and thus −t puts in the lower tail of the curve).
α
2
α
In the above equation, stands for whatever mean value is relevant: for a confidence interval
¯
x
substitute d¯ for x; and for a difference between means, we substitute x − x for x. Likewise,
¯
¯
¯
¯ 1 2
the se can be se for a single group, se for a set of difference scores, and se to represent the
x d p
x
¯ x = 62.6 x sd x /√n = 5.32/2.24 = 2.38The mean of is and the standard error of is .
Given that n = 5, df = 4. The t value that puts α/2 = 0.025 in the upper tail of the t-
distribution with df = 4 is t = 2.78. Thus, the 95% confidence interval on the sample mean
α
is:
As is usually the case, finding the confidence interval is easier with software. By default, the
t.test() command in R will return a 95% confidence interval on whatever mean (sample
mean, mean difference, or difference between means) is associated with the t-test. In this case,
the one-sample t-test produces a 95% confidence interval on the sample mean:
t.test(x)
##
## One Sample t-test
##
## data: x
## t = 26.313, df = 4, p-value = 1.24e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 55.99463 69.20537
## sample estimates:
## mean of x
## 62.6
If one is interested in a confidence interval of a different width – say, a 99% confidence
interval – one needs only to specify that with the conf.level option within the t.test()
command:
t.test(x, conf.level=0.99)
##
## One Sample t-test
##
## data: x
## t = 26.313, df = 4, p-value = 1.24e-05
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## 51.64651 73.55349
## sample estimates:
## mean of x
## 62.6
Confidence intervals are good for evaluating the relative precision of two estimates. From the
relative width of two 1 − α confidence intervals, we can infer that the narrower interval is
based on data with some combination of lesser variance in the data and greater n.
Confidence intervals can also be used to make inferences. For example, if a 1 − α confidence
interval does not include 0, we can say that the mean described by that confidence interval is
significantly different from 0 at the α level. If two confidence intervals do not overlap, we may
say that the means described by those confidence intervals are significantly different from each
other. This feature results from the similarity between the construction of a 1 − α confidence
interval and a two-tailed t-test with a false alarm rate of α: because the very same t-
distributions are used in both, the confidence interval is functionally equivalent to a two-tailed
t-test.
The use of confidence intervals to make inferences is especially helpful when confidence
intervals are generated for statistics for which the population probability distribution is
unknown. If we have sufficient data and meet all of the required assumptions to construct
confidence intervals in the manner described above using t-statistics, there is probably nothing
stopping us from just performing t-tests on the data without having to rely on confidence
intervals and with what they do or do not overlap. But, if we are dealing with statistics that are
new (and thus untested with regard to their parent distributions) or are generated by repeated-
sampling procedures such as bootstrapping, then we can use those confidence intervals to make
inferences where t-tests may be unavailable to us and/or inappropriate for the data.
The effect size associated with a statistical test is a measure of how big the observed effect – a
sample-level association (e.g., for a correlation) or a sample-level difference (e.g., for a t-
test) – is. We have already encountered one pair of effect-size statistics in the context of
correlation and regression: r (more specifically, the magnitude of r) and R . The product-
2
moment correlation r is the slope of the standardized regression equation: the closer the
changes in z are to corresponding changes in z , the stronger the association. The
y x
the model rather than associated with the error.15 The magnitude of the r statistic was
associated with guidelines for whether the correlation was weak (0.1 ≤ r < 0.3), moderate (
0.3 ≤ r < 0.5), or strong ( r ≥ 0.5): in Figure 11.5, scatterplots of data with weak, moderate,
and strong are shown. As described by Cohen (2013)16, weak effects are present but difficult
to see, moderate effects are noticeable, and strong effects are obvious.
In the context of t-tests, effect sizes refer to differences. The effect size associated with a t-test
is the difference between the observed mean statistic (x for a one-sample test, d¯ for a
¯
independent-groups test) specified in the null hypothesis. However, because the size of an
observed differences can vary widely depending on what is being measured and in what units,
the raw effect size can be misleading to the point of uselessness. The much more commonly
used measure of effect size for the difference between two things is the standardized effect
size: the difference divided by the standard deviation of the data. The standardized effect size
is actually so much more commonly used that when scientists speak of effect size for
differences between two things, they mean the standardized effect size (the standardized is
understood).
In frequentist statistics, the standardized effect size goes by the name of Cohen’s d. Cohen’s d
addresses the issue of the importance of sample size to statistical significance by removing n
from the t equation, using standard deviations in the denominator instead of standard errors.
Otherwise, the calculation of d is closely aligned with the formulae for the t-statistics for each
flavor of t-test. Just as each type of t-test is calculated differently but produces a t that can be
evaluated using the same t-distribution, the standardized effect size calculations for each t-test
produces a d that can be evaluated independently of the type of test.
¯ x − μH
0
d =
sd x
¯ − μ
d d
d =
sd d
where the pooled standard deviation sd is the square root of the pooled variance.
p
Note: although d can take on an unbounded range of values, d is usually reported as positive,
so the d we report is more accurately |d|.
“there is a certain risk inherent in offering conventional operational definitions for those
terms for use in power analysis in as diverse a field of inquiry as behavioral science”
so please interpret effect-size statistics with caution.
On this page, we have already calculated the values we need to produce d statistics for
examples of a one-sample t-test, a repeated-measures t-test, and an independent-groups t-test:
all we need are the differences and the standard deviations from each set of calculations.
Freezer Temperature ( ∘
C )
1 -2.14
2 -0.80
3 -2.75
4 -2.58
5 -2.26
6 -2.46
7 -1.33
8 -2.85
9 -0.93
10 -2.01
The mean temperature is -2.01, and the standard deviation of the temperatures is 0.74. In this
example, the null hypothesis was μ ≥ 0 C , so the numerator of the d statistic is −2.01 − 0,
∘
¯ x − μ −2.01 − 0
d = = = −2.71
sd x 0.74
For these data, d = −2.71, which we would report as d = 2.71 . Because d ≥ 0.8 , these data
represent a large effect.
For our repeated-measures t-test example, we used the following data about ovens and cakes:
Cake Pre-bake ( ∘
C ) Post-bake ( ∘
C ) Difference (Post − Pre)
1 20.83 100.87 80.04
2 19.72 98.58 78.86
3 19.64 109.09 89.44
4 20.09 121.83 101.74
Cake Pre-bake ( ∘
C ) Post-bake ( ∘
C ) Difference (Post − Pre)
5 22.25 122.78 100.53
6 20.83 111.41 90.58
7 21.31 103.96 82.65
8 22.50 121.81 99.31
9 21.17 127.85 106.68
10 19.57 115.17 95.60
As when calculating the t-statistic, to calculate d we are only interested in the difference
scores. The mean difference score is 92.54 and the standard deviation of the difference scores
is 9.77. We previously tested these data with both H : μ ≤ 0 C and H : μ ≤ 90 C : let’s
0 d
∘
0 d
∘
just examine the effect size for H : μ ≤ 90 C but keep in mind that effect size – like t’s and
0 d
∘
¯ x − μ 92.54 − 0
d = = = 9.48
sd x 9.77
Finally, for the independent-groups t-test example, we examined the following data:
The numerator for for independent groups is the difference between the means
¯
¯
d x1 − x2 and
the hypothesized population-level difference μ − μ . For these data, the null hypothesis was 1 2
pooled standard deviation, which is the square root of the pooled variance; for these data, the
pooled standard deviation is $=. Thus:
−5.09
d =
1.66
The main difference between a t statistic and its corresponding d statistic is the sample size.
That’s by design: effect size measurement is meant to remove the impact of sample size from
assessing the magnitude of differences. Thus, it is possible to have things like statistically
significant effects with very small (< 0.2) sizes. Another possible outcome of that separation
is that the test statistic and the effect size statistic can be distinct to the point of disagreement: a
small, medium, or even large d value can be computed for data where there is no significant
effect. If there is no significant effect, then there is no effect size to measure: please don’t
report an effect size if the null hypothesis hasn’t been rejected.
Figure 11.6: Love watching you dunk on Littlefinger, but for our purposes that’s not very
helpful, Cersei
There are two types of errors we can make in this framework. The Type-I error, also known as
the α error, is a false alarm: it happens when is no effect – similarity or difference – present in
the population-level data but we reject H regardless. The Type II error, also known as the β
0
error, is a miss: it happend when there is an effect – a similarity or a difference – present in the
population-level data but we continue to assume H regardless.
0
The rate at which we make the α error is more-or-less the α rate.19 When we set the α rate –
most frequently it is set at α = 0.05, but more stringent rates like α = 0.01 or α = 0.001 are
also used – we are implicitly agreeing to a decision-making algorithm that will lead to false
alarms α% of the time given the assumptions of the statistical test we are using.
and α = 0.05, we are saying that our null hypothesis is that our observed sample is drawn
from a normal distribution with μ = 0. If the mean of the population from which we are
sampling is, in fact, 0, and we were to run the same experiment some large, infinity-
approaching number of times, we would expect the resulting t-statistics to be distributed thus:
With α = 0.05 and a two-tailed test, the rejection regions for the t-distribution with df = 20
are defined as anything less than t = −2.09 and anything greater than t = 2.09. If an observed
t is located in either of those regions, H will be rejected. However, if the null hypothesis is
0
true, then exactly 5% of t-statistics calculated based on draws of n = 21 (that’s df + 1) from
a normal distribution with a mean of 0 will naturally live in those rejection regions, as we can
see by superimposing line segments representing the boundaries of the rejection regions over
the null t-distribution in Figure 11.8:
We can simulate the process of taking a large number of samples from a normal distribution
with μ = 0 with the rnorm() command in R. In addition to μ = 0, we can stipulate any value
for σ because it will all cancel out with the calculation of the t-statistic. Let’s take 1,000,000
2
for (i in 1:k){
sample<-rnorm(n, mean=0, sd=1)
t<-mean(sample)/
(sd(sample)/sqrt(n))
t.statistics[i]<-t
}
t.table<-cut(t.statistics, c(-Inf, qt(0.025, df=20), qt(0.975, df=20), Inf)
table(t.table)
## t.table
## (-Inf,-2.09] (-2.09,2.09] (2.09, Inf]
## 24907 950070 25023
That is all to say that the rates at which α-errors are made are basically fixed by the α-rate.
The rates at which β-errors are made are not fixed, and are only partly associated with α-
rates. In the example illustrated in Figure 11.8, any observed t-statistic that falls between the
rejection regions will lead to continued assumption of H . That would be true even if the
0
sample were drawn from a different distribution than the one specified in the null, that is, if the
null were false. That is the β error: when the null is false – as in, the sample comes from a
different population than the one specified in the null – but we continue to assume H anyway.
0
The complement of the β error – the rate at which the the null is false and H is rejected in
0
favor of H – is therefore 1 − β.
1
Power is the rate at which the null hypothesis is rejected given that the alternative is true.
To illustrate, please imagine that H defines a null distribution with μ = 0 and σ = 1 but that
0
2
the observed samples instead come from a different distribution – an alternative distribution
– with μ = 2 and σ = 1. If this were the case, the central limit theorem tells us that the
2
alternative sampling distribution will be a normal distribution with a mean of 2 and a variance
of 1/n. We can represent the set of t-statistics we would expect from this distribution with a
regular t-distribution shifted two units to the right: an alternative t-distribution with t̄ = 2
and df = n − 1. Figure 11.9 presents both the null and alternative parent distributions (i.e., the
normals) and the null and alternative sampling distributions (i.e., the t’s).
Figure 11.9: Null and Alternative Distributions of N (top) and t (bottom)
In this particular situation, with H : μ = 0 and H : μ ≠ 0, with df = 20, and the t-values
0 1
that define the rejection regions being -2.09 and 2.09, then the proportion of the alternative t-
distribution that lives between the rejection regions is given by pt(2-2.09, df=20)= 0.46.
That means that we would expect approximately 46% of the t-statistics generated by sampling
from the alternative distribution to lead to continuing to assume H , thus, we expect to commit
0
the β-error about 46% of the time. We would also expect to correctly reject H approximately
0
54% of the time: the power is therefore 0.54 (power is usually expressed as a decimal,
although expressing it as a percentage wouldn’t be wrong). See Figure 11.10 for an illustration.
Figure 11.10: Visualization of β and Power for a t-test Where H 0 = μ = 0 and the True
Population μ = 2
So the power for this particular situation is 0.54: there is a real population-level effect (in
this case, a population mean that is not equal to 0) and a 54% chance that H will be rejected
0
The minimum limit to power, given the stipulated α-rate, is α itself. That minimum would
occur if the alternative hypothesis completely overlapped the null distribution: the only place
from which a sample could be drawn that would lead to rejecting H would be in the rejection
0
Figure 11.12: Settle down, Palpatine. Power has a lower limit of α and an upper limit of 1.
Power analyses for t-tests can be conducted by solving simultaneous equations: given a null
distribution with rejection regions defined by α and the effect size d, we can calculate the area
under the alternative distribution curve – the power – as a function of n. If you’d rather not –
and I wouldn’t blame you one bit – there are several software packages that can give
convenient power analyses with a minimum effort. For example, using the pwr package, we can
use the following code to find out how many participants we would expect to need to have
power of 0.8 given an effect size d = 0.5 (for a medium-sized effect) and an α-rate of 0.05 for
a two-tailed hypothesis on a one-sample t-test:
library(pwr)
pwr.t.test(n=NULL, d=0.5, sig.level=0.05, power=0.8, type="one.sample", alte
##
## One-sample t test power calculation
##
## n = 33.36713
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
Using the same pwr package, we can also plot a power curve indicating the relationship
between n and power by wrapping the pwr.t.test() command in plot():
what the power should have been, assuming that an alternative distribution can be estimated
from the observed data. That is a bad practice, for at least a couple of reasons. First: it
assumes that the alternative distribution is something that can be estimated rather than
stipulated, putting aside that the point of power analysis is to stipulate various reasonable
conditions for an experiment. Second: citing the probability of an event after an event has
occurred isn’t terribly meaningful – it’s like losing a bet but afterwards saying that you should
have won because you had a 51% chance of winning. So, like reporting an effect size for a
non-significant effect: don’t do it.
In the null-hypothesis testing framework, failing to reject the null hypothesis can happen for one
of two reasons:
1. There is a population-level effect but the scientific algorithm didn’t detect it, in other
words: a type-II error
.
A major flaw of frequentist null-hypothesis testing is that there is no way of knowing whether
reason 1 (a type-II error) or reason 2 (a correct assumption of H has occurred) is behind a
0
Thus, there is perfectly-warranted skepticism regarding null results: statistical analyses that
show neither differences nor similarities, which in the null-hypothesis testing framework take
the form of failure to reject H . Such results go unsubmitted and unpublished at rates that we
0
can’t possibly know precisely because they are neither submitted nor published (this is known
as the file-drawer effect). The problem is that sometimes null results are really important. A
famous example is the finding that the speed of light is unaffected by the rotation of the earth. In
psychological science, null results in cognitive testing between people of different racial,
ethnic, and/or gender identification are both scientifically and socially important; if we only
publish the relative few studies that find small differences (which themselves may be type-I
errors), those findings receive inordinate attention among what should be a sea of contradictory
findings.
In classical statistics, confidence intervals likely provide the best way of the available options
to support null results – confidence intervals that include 0 imply that 0 will be sampled in the
majority of repeated tests. Bayesian methods include more natural and effective ways of
producing support for the null: in the Bayesian framework, results including zero-points can be
treated as would be any other posterior distribution. One can create Bayesian credible
interval estimates that include values like 0 and can calculate Bayes Factors indicating support
for posterior models that include null results. So: point for Bayesians there.
We can do the same thing for mean data. For a relatively (that word is doing a lot of work
here) simple example, we can use the MH algorithm to produce a posterior distribution on the
mean freezer temperature from the one-sample t-test example above. (Note: this code also
calculates the posterior distribution on the variance, but I included that more as a way to let the
variance…vary).
for (i in 2:iterations){
mean.prop<-runif(1,-5, 1)
vars.prop<-runif(1, 0.00001, 2)
u<-runif(1)
r<-prod(dnorm(freezer.data, mean.prop, sqrt(vars.prop)))/
prod(dnorm(freezer.data, means[i-1], sqrt(vars[i-1])))
if (r>=1){
means[i]<-mean.prop
vars[i]<-vars.prop
} else if (r > u){
means[i]<-mean.prop
vars[i]<-vars.prop
} else {
means[i]<-means[i-1]
vars[i]<-vars[i-1]
}
}
ggplot(data.frame(means), aes(means))+
geom_histogram(binwidth=0.02)+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)
This posterior distribution indicates that the probability that the mean freezer temperature is
less than 0 is 1, and that the 95% highest-density interval of the mean is
p(−2.59 ≤ x ≤ −1.44) = 0.95 .
¯
But, Bayesian analyses requiring more complexity can be a problem, especially if we are
interested in calculating the Bayes factor. When we calculated the Bayes factor comparing two
binomial models that differed only in the specification of the π parameter, the change from the
prior odds to the posterior odds was given by the ratio of two binomial likelihood formulae:
20! 16 4 16 4
p(D|H 1 ) (0.8) (0.2) (0.8) (0.2)
16! 4!
B.F . = = =
20! 16 4 16 4
p(D|H 2 ) (0.5) (0.5) (0.5 (0.5) )
16! 4!
For more complex models than the binomial – even the normal distribution works out to be a
much more complex model than the binomial model – the Bayes factor is calculated by
integrating competing models over pre-defined parameter spaces, that is, areas where we might
reasonably expect the posterior parameters to be. Given a comparison between a model 1 and
a model 0, that looks basically like this:
Several R packages have recently been developed to make that process much, much easier. In
this section, I would like to highlight the BayesFactor package20, which does a phenomenal
job of putting complex Bayesian analyses into familiar and easy-to-use formats.
The key insight that makes the Bayesian t-test both computationally tractable and more familiar
to users of the classical version comes from Gönen et al (2005), who formulated t-test data in
a way similar to the regression-model form of the t-test. Taking for example an independent-
groups t-test, the observed data points y in each group are distributed as normals with a mean
determined by the grand mean of all the data μ plus or minus half the (unstandardized) effect
size δ and a variance equal to the variance of the data:
σδ
2
Group 1 : y 1i ∼ N (μ + ,σ )
2
σδ 2
Group 2 : y 2j ∼ N (μ − ,σ )
2
This insight means that the null model (not to be confused with a null hypothesis because that’s
the other kind of stats – this is just a model that says there is no difference between the groups)
M 0 is one where there is no effect size (δ = 0), and that can be compared to another model
M where there is an effect size ( δ ≠ 0). Moreover, a prior distribution can be stipulated for
1
δ and, based on the data, a posterior distribution can be derived based on that prior
distribution and the likelihood of the data given the prior.21
For the prior distribution, the Cauchy distribution – also awesomely known as The Witch of
Agnesi – is considered ideal for its shape and its flexibility. The Cauchy distribution has a
location parameter (x ) and a scale parameter (formally γ but the BayesFactor output calls it
0
r so we’ll comply with that), but for our purposes the location parameter will be fixed at 0.
When the scale parameter r = 1, the resulting Cauchy distribution is the t-distribution with
df = 1. The ttestBF() command in the BayesFactor package allows any positive value for
r to be entered in the options: smaller values indicate prior probabilities that favor smaller
effect sizes; larger values indicate prior probabilities favoring larger effect sizes. Figure 11.14
illustrates the Cauchy distribution with the three categorical options for width of the prior
distribution in the ttestBF() command: the default “medium” which corresponds to
r = 1/√ 2 ≈ 0.71, “wide” which corresponds to r = 1, and “ultrawide” which corresponds
to r = √2 ≈ 1.41.
√2
(for Medium Priors), 1 (for
Wide Priors), and √2 (for Ultra-Wide Priors)
The main output of the ttestBF() command is the Bayes factor between model 1 – the model
determined by the maximum combination of the prior hypothesis and likelihood function – and
model 2 – the model that assumes no effect size. To interpret Bayes factors, the following sets
of guidelines have been proposed – both are good – reposted here from the page on classical
and Bayesian inference:
The ttestBF() output also includes the value of r that was used to define the Cauchy
distribution for the prior and a margin-of-error estimate for the Bayes factor calculation (which
will be ~0 unless there are very few data points in the analysis).
To start, let’s conduct a Bayesian t-test on the example data we used for the one-sample t-test:
library(BayesFactor)
##
## Attaching package: 'Matrix'
## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact
Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************
ttestBF(c(freezer.data))
The result of the Bayes factor analysis, which uses a Cauchy prior on δ with a scale parameter
of r = 1
√2
(which is the default for ttestBf() and the value one would use if expecting a
medium-sized effect), indicates that the posterior odds in favor of the alternative model
increase the prior odds by a factor of 1,702; that is, the posterior model is about 1,702 times as
likely as the null model. That result agrees with the significant difference from 0 we got with
the frequentist t-test – the alternative is more likely than the null.
We can also wrap the ttestBF() results in the posterior() command – specifying in the
options the number of MCMC iterations we want to use – to produce posterior estimates of the
population mean, the population variance, the effect size (delta), and a parameter called g that
you do not need to worry about. To get summary statistics for those variables, we then wrap
posterior() in the summary() command. While the Bayes factor is probably the most
important output, the summary statistics on the posterior distributions help fill out the reporting
of our results.
summary(posterior(ttestBF(freezer.data), iterations=5000))
##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu -1.9338 0.2877 0.004068 0.004438
## sig2 0.7994 0.5505 0.007785 0.011190
## delta -2.4209 0.7219 0.010209 0.013686
## g 27.1980 192.7326 2.725651 3.067133
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu -2.4948 -2.1182 -1.9351 -1.7574 -1.345
## sig2 0.2773 0.4768 0.6607 0.9597 2.129
## delta -3.9121 -2.9096 -2.3910 -1.9103 -1.099
## g 0.5086 2.0731 4.4705 11.5516 130.982
plot(posterior(ttestBF(freezer.data), iterations=5000))
As shown with the examples above, ttestBF() uses the same syntax as t.test(). It is
therefore straightforward to translate the commands for the classical repeated-measures t-test
into the Bayesian repeated-measures t-test (with the posterior() options in there for
completeness)
##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu -92.315 3.749 5.301e-02 5.301e-02
## sig2 140.810 90.368 1.278e+00 1.787e+00
## delta -8.642 2.175 3.076e-02 4.056e-02
## g 1469.930 82453.962 1.166e+03 1.166e+03
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu -99.783 -94.67 -92.342 -90.049 -84.81
## sig2 49.556 85.03 116.378 169.251 376.78
## delta -13.184 -10.05 -8.533 -7.083 -4.66
## g 6.995 25.28 53.540 133.545 1736.64
ind.groups.BF
summary(posterior(ind.groups.BF, iterations=5000))
##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 7.517 4.092e-01 5.786e-03 5.786e-03
## beta (x - y) -4.842 8.401e-01 1.188e-02 1.319e-02
## sig2 3.303 1.319e+00 1.865e-02 2.275e-02
## delta -2.803 6.935e-01 9.808e-03 1.229e-02
## g 346.610 1.788e+04 2.528e+02 2.528e+02
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 6.6793 7.251 7.514 7.783 8.308
## beta (x - y) -6.5054 -5.390 -4.851 -4.319 -3.114
## sig2 1.6687 2.412 3.013 3.869 6.604
## delta -4.1766 -3.269 -2.799 -2.326 -1.461
## g 0.7926 2.727 6.012 15.278 171.300
plot(posterior(ind.groups.BF, iterations=5000))
1. I think it would be more appropriate to say that the limit of the variance of the sample
means as n goes to infinity is 0 and that the proper mathematical way to say it is not
σ /∞ = 0 but rather
2
2
σ
lim = 0
n→∞ n
2. The qualifier “finite” is used to exclude a sample of the entire population, because in that
(practically impossible) case the only possible sample mean is the population mean↩
4. Don’t delete your raw data, though! We won’t need them for the rest of the calculations,
but that doesn’t mean they might not come in handy for a different analysis.↩
5. Although if there is unequal n between the two groups, it is more likely that the data
violate the homoscedasticity assumption↩
7. In this section, we will focus only on one-way tests with two possible categories simply
because this is a page about differences between two things. It is an incredibly arbitrary
distinction and there probably is too little difference between χ tests with two categories
2
and χ tests with three or more categories to justify discussing them separately. But, the
2
distinction makes a lot more sense for the other tests being discussed in this chapter, so
we will revisit χ tests again in Differences Between Three or More Things.↩
2
8. It is also common to say that such a table represents a crosstabulation (or crosstab) of the
responses.↩
9. The probabilities in the exact test follow a hypergeometric distribution, which gives the
probability of drawing s successes out of N trials without replacement.↩
10. If you’re curious, I very much agree: a hot dog is not a sandwich. While we’re here: I
understand how a straw has just one hole in a topological sense, in the sense in which we
use a straw it has two.↩
11. Frank Wilcoxon invented it, Mann and Whitney worked out the important details↩
12. The linked critical value table also includes critical values for a one-tailed test, which
involves multiplying the expected p-value by two. However, that procedure doesn’t really
square with the method of calculation of the U statistic, so I am not convinced by non-
two-tailed U tests.↩
13. Unfortunately, in addition to being the most powerful, it’s the biggest pain in the ass.↩
14. Seeing as the randomization test is also known as the permutation test, it is odd that we
use the combination formula instead of the permutation formula. We could use the
permutation formula to calculate the number of permutations of the data, but we would
end up with the same result, so we use the computationally easier of the two
approaches.↩
15. The R type of effect size – proportions of variance explained by the effect – will be
2
16. Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic
press.↩
17. Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic
press.↩
18. There may no more magnificent collection of incredibly large effect sizes than in the
pages of statistics textbooks. Please don’t expect to regularly see effect sizes this large in
the wild.↩
19. The qualifier more-or-less is there to remind us that α won’t be super-precise if the
assumptions of parametric tests are violated, particularly with small sample sizes.↩
20. Richard D. Morey and Jeffrey N. Rouder (2018). BayesFactor: Computation of Bayes
Factors for Common Designs. R package version 0.9.12-4.2. https://CRAN.R-
project.org/package=BayesFactor↩
21. The base rate p(D), as is often the case, cancels out↩
12 Differences Between Three or More Things
12.1 Differences Between Things and Differences Within Things
Tufts University (“Tufts”) is an institution of higher learning in the Commonwealth of Massachusetts in
the United States of America. Tufts has three campuses: a main campus which covers parts of each of
the neighboring cities of Medford, MA and Somerville, MA; a medical center located in Boston, MA;
and a veterinary school located in Grafton, MA. Here’s a handy map:
Imagine that somebody asked you – given your access to the map pictured above – to describe the
distances between Tufts buildings. What if they asked you if they could comfortably walk or use a
wheelchair to get between any two of the buildings? Your answer likely would be that it depends: if
one wanted to travel between two buildings that are each on the Medford/Somerville campus, or each
on the Boston campus, or each on the Grafton campus, then those distances are small and easily
covered without motorized vehicles. If one wanted to travel between campuses, then, yeah, one would
need a car or at least a bicycle and probably a bottle of water, too.
We can conceive of the distances between the nine locations marked on the map as if they were just
nine unrelated places: some happen to be closer to each other, some happen to be more distant from
each other. Or, we can conceive of the distances as a function of campus membership: the distances are
small within each campus and the distances are relatively large between campuses. In that conception,
there are two sources of variance in the distances: within campus variance and between campus
variance, with the former much shorter than the latter. If one were trying to locate a Tufts building,
getting the campus correct is much more important than getting the location within each campus correct:
if you had an appointment at a building on the Medford/Somerville campus and you went to the wrong
Medford/Somerville campus building, there might be time to recover and find the right building; if you
went to, say, the Grafton campus, you should probably call, explain, apologize, and try to reschedule.
The example of campus buildings (hopefully) illustrates the basic logic of the Analysis of Variance,
more frequently known by the acronym ANOVA. We use ANOVA when there are three or more groups
being compared at a time: we could compare pairs of groups using t-tests, but this would A. inflate the
type-I error rate (the more tests you run, the more chances of false alarms),1 and B. fail to give a clear
analysis of the omnibus effect of a factor along which all of the groups differ.
As noted in the introduction, ANOVA is based on comparisons of sources of variance. For example,
the structure of the simplest ANOVA design – the one-way, between-groups ANOVA – creates a
comparison between the variance in the data that comes from between-groups differences and the
variance in the data that comes from within-groups differences: if differences in the data that are
attributable to the group membership of each datapoint are generally more important than the
differences that just occur naturally (and are observable within each group). The logic of ANOVA is
represented visually in Figure 12.1 and Figure 12.2. Figure 12.1 is a representation of what happens
when the populations of interest are either the same or close to it: there the variance between data
points in different groups is the same as the variance between data points within each group, and
the is no variance* added by group membership.
In Figure 12.2, group membership makes a substantial contribution to the overall variance of the data:
the variance within each group is the same as in the situation depicted in Figure 12.1, but the
differences between the groups slide the populations up and down the x-axis, spreading out the data as
they go.
Figure 12.2: Visual Representation of ANOVA Logic: Significant Difference Between Populations
2The overall variance in the observed data can be decomposed into the sum of the two sources of
variance between and within. If the null hypothesis is true and there is no contribution of between-
groups differences to the overall variance – which is what is meant by σ = 0, then all of the
2
between
null hypothesis is false and there is any contribution of between-groups difference to the overall
variance – σ 2
between
> 0, then there will be variance beyond what is attributable to within-groups
differences: σ2
between
+ σ
2
> σ
within
2
.
within
nσ
2
between
+ σ
2
within
and σ 2
: the influence of σ
within
multiplies with every observation that it
2
between
impacts (e.g., the influence of between-groups differences cause more variation if there are n = 30
observations per group than if there are n = 3 observations per group). The way that ANOVA makes
this comparison is by evaluating ratios of between-groups variances and within-groups variances: on
the population-level, if the ratio is equal to 1, then the two things are the same, if the ratio is greater
than one, then the between-groups variance is at least a little bit bigger than 0. Of course, as we’ve
seen before with inference, if we estimate that ratio using sample data, any little difference isn’t going
to impress us the way a hypothetical population-level little difference would: we’re likely going to
need a sample statistic that is comfortably greater than 1.
When we estimate the between-groups plus within-groups variance and within-groups variance using
sample data, we calculate what is known as the F -ratio, also known as the F -statistic, or just plain F
.3The cumulative likelihood of observed F -ratios are evaluated in terms of the F -distribution. When
we calculate an F -ratio, we are calculating the ratio between two variances: between variance and
within variance. As we know from the chapter on probability distributions, variances are modeled by
χ distributions: χ distributions comprise the squares of normally-distributed values and given the
2 2
set of variables (and dividing by the constant n − 1 doesn’t disqualify it). Developed to model the
ratio of variances, the F -distribution is literally the ratio of χ distributions. The F -distribution has 2
distribution in the numerator and the df of the denominator χ distribution by which it is divided to get2
Figure 12.3: Example Derivation of the F -statistic From a Numerator χ Distribution with df
2
= 3 and
a Denominator χ Distribution with df = 11
2
2 2
σ
2
within
. If the
within
sample data indicate no difference between the observed variance caused by a combination of
between-groups plus within-groups variance and the variance observed within-groups, then F ≈ 1 and
we will continue to assume H . Otherwise, F >> 1, and we will reject H in favor of the H that
0 0 1
there is some contribution of the differences between groups to the overall variance in the data.
Because of the assumptions involved in constructing the F -distribution – that at the population-level
the residuals are normally distributed and the within-groups variance is the same across groups – the
assumptions of normality and homoscedasticity apply to all flavors of ANOVA. For some of those
flavors, there will be additional assumptions that we will discuss when we get to them, but normality
and homoscedasticity are univerally assumed for classical ANOVAs.
Conceptually, the ANOVA logic applies to all flavors of ANOVA. The tricky part, of course, is how to
calculate the F -ratio, and that will differ across flavors of ANOVA. There are also a number of other
features of ANOVA-related analysis that we will discuss, including estimating effect sizes, making
more specific between-group comparisons post hoc, a couple of nonparametric alternatives, and the
Bayesian counterpart to the classical ANOVA (don’t get too excited: it’s just a “Bayesian ANOVA”).
The flavors of ANOVA are more frequently (but less colorfully) described as different ANOVA models
or ANOVA designs. We will start our ANOVA journey with the one-way between-groups ANOVA,
describing it and its related features before proceeding to more complex models.
Figure 12.4: The features of this ANOVA model include Bluetooth capability, a dishwasher safe
detachable stainless steel skirt and disks, cooking notifications, and a 2-year warranty.
the between-groups difference is the difference caused by the j different levels of the between-groups
factor α. The last element is the within-groups variance, which falls into the more general categories
of error or residuals. It is the naturally-occurring variation within each condition. To continue the
example of the plants and the sunlight: not every plant that is exposed to the same sunlight for the same
duration will have leaves that are exactly the same width. In this model, the within-groups influence is
symbolized by ϵ , indicating that the error (ϵ) exists within each individual observation (i) and at each
ij
In a between-groups design, each observation comes from an independent group, that is, each
observation comes from one and only one level of the treatment.
Could we account for at least some of the within-groups variation by identifying individual
differences? Sure, and that’s the point of within-groups designs, but it’s not possible unless we observe
each plant under different sunlight Conditions, which is not part of a between-groups design. For more
on when we want to (or have to) use between-groups designs instead of within-groups designs, please
refer to the chapter on differences between two things.
Thus, each value y of the dependent variable is said to be determined by the sum of the grand mean μ,
the mean of the factor-level α , and the individual contribution of error ϵ ; and because y is influenced
j ij
If you think that equation vaguely resembles a regression model, that’s because it is:
ANOVA is a special case of multiple regression where there is equal n per factor-level.
ANOVA is regression-based much as the t-test can be expressed as a regression analysis. In fact, while
we are making connections, the t-test is a special case of ANOVA. If, for example, you enter data with
two levels of a between-groups factor, you will get precisely the same results – with an F statistic
instead of a t statistic – as you would for an independent-groups t-test.5
The results of ANOVA can be represented by an ANOVA table. The standard ANOVA table includes
the following elements:
1. Source (or Source of Variance, or SV): the leftmost column represents the decomposition of the
elements that influence values of the dependent variable (except the grand mean μ). For the one-
way between-groups ANOVA, the listed sources are typically between groups (or just between)
and within groups (or just within), with an entry for total variance often included. To increase the
generality of the table – which will be helpful because we are about to talk about a whole bunch
of ANOVA tables, we, we can call the between variance A to indicate that it is the variance
associated with factor A and we can call the within variance the error.
2. Degrees of Freedom (df ): the degrees of freedom associated with each source. These are the
degrees of freedom that are associated with the variance calculation (i.e., the variance
denominator) for each source of variance.
SS
¯ 3. Sums of Squares (
∑(x − x) ): The numerator of the calculation for the variance ( 2
) associated
with each source.
4. Mean Squares (M S ): The estimate of the variance for each source. It is equal to the ratio of SS
and df that are respectively associated with each source. Note: M S are not calculated for the
total because there is no reason whatsoever to care about the total M S .
5. F : The F -ratio. The F -ratio is the test statistic for the ANOVA. For the one-way between-groups
design (and also for the one-way within-groups design but we’ll get to that later), there is only one
F -ratio, and it is the ratio of the between-groups M S ( M S ) to within-groups M S M S .
A e
In addition to the five columns described above, there is a sixth column that is extremely optional to
include:
σ
2
within
, which can
within
2 2
somewhat more simply be annotated . In other words, it is the ratio of the sum of between-
nσ α +σ ϵ
2
σ
ϵ
groups variance – which is multiplied by the number of observations per group – plus the within-
groups variance divided by the within-groups variance. The EM S for each group lists the terms
that are represented by each source. For between-groups variance – or factor α – the EM S is the
numerator of the population-level F -ratio nσ + σ . For the within-groups variance – or the
2
α
2
ϵ
error (ϵ) – it is σ . The EM S are helpful for two things: keeping track of how to calculate the F -
2
ϵ
ratio for a particular factor in a particular ANOVA model (admittedly, this isn’t really an issue for
the one-way between-groups ANOVA since there is only one way to calculate an F -statistic and
it’s relatively easy to remember it), and calculating the effect-size statistic ω . Both of those things 2
can be accomplished without explicitly knowing the EM S for each source of variance, which is
why the EM S column of the ANOVA table is eminently skippable (but, if you do need to include
it – presumably at the behest of some evil stats professor – please keep in mind that there are no
numbers to be calculated for EM S : they are strictly algebraic formulae).
Here is the ANOVA table – with EM S – for the one-way between-groups ANOVA model with
formula guides for how to calculate the values that go in each cell:
SV df SS MS F EM S
2 2 2
A j − 1 n ∑(y ∙j − y ∙∙ ) SS A /df A M S A /M S e nσ α + σ ϵ
2 2
Error n − 1 ∑(y ij − y ∙j ) SS A /df e σϵ
2
T otal jn − 1 ∑(y ij − y ∙∙ )
In the table above, the bullet (∙) replaces the subscript i, j, or both to indicate that the y values are
averaged across each level of the subscript being replaced. y indicates the average of all of the y
∙j
values in each j averaged across all i observations and y is the grand mean of all of the y values
∙∙
averaged across all i observations in each of j Conditions. Thus, SS is equal to the sum of the A
squared differences between each group mean y and the grand mean y times the number of
∙j ∙∙
observations in each group n, SS is equal to the sum of the sums of the squared differences between
e
each observation y and the mean y of the group to which it belongs, and SS otal is equal to the sum
ij ∙j t
of the squared differences between each observation y and the grand mean y . ij ∙∙
In the context of ANOVA, n always refers to the number of observations per group. If one needs to
refer to all of the observations (e.g. all the participants in an experiment that uses ANOVA to analyze
the data), N is preferred to (hopefully) avoid confusion. In the classic ANOVA design, n is always the
same between all groups: if n is not the same between groups, then technically the analysis is not an
ANOVA. That’s not a dealbreaker for statistical analysis, though: if n is unequal between groups, the
data can be analyzed in a regression context with the general linear model and the procedure will
return output that is pretty-much-identical to the ANOVA table. The handling of unequal n is beyond the
scope of this course, but I assure you, it’s not that bad, and you’ll probably get to it next semester.
12.2.1.3 Example
Please imagine that the following data are observed in a between-groups design:
12.2.1.3.0.1 Calculating df
df A+ df = df e , and that df
total = N − 1 (or the total number of observations minus 1). Both of
total
those facts will be good ways to check your df math as models become more complex.
12.2.1.3.0.2 Calculating SS A
SS A is the sum of squared deviations between the level means and the grand mean multiplied by the
number of observations in each level. Another way of saying that is that for each observation, we take
the mean of the level to which that observation belongs, subtract the grand mean, square the difference,
and sum all of those squared differences. Because n is the same across groups by the ANOVA
definition, it is no difference to multiply the squared differences between the group means and the
grand mean by n or to add up the squared differences between the group means for every observation; I
happen to find the latter more visually convincing for pedagogical purposes, as in the table below:
Condition 2
Condition 2
Condition 2
Condition 2
(y ∙1 − y ∙∙ ) (y ∙2 − y ∙∙ ) (y ∙3 − y ∙∙ ) (y ∙4 − y ∙∙ )
1 2 3 4
-3.10 31.49 7.28 1.17 0.12 0.04 8.18 18.89
0.18 31.49 3.06 1.17 5.51 0.04 9.05 18.89
-0.72 31.49 4.74 1.17 5.72 0.04 11.21 18.89
0.09 31.49 5.29 1.17 5.93 0.04 7.31 18.89
-1.66 31.49 7.88 1.17 6.56 0.04 8.83 18.89
The sum of the values in the shaded columns is 257.94: that is the value of SS . A
12.2.1.3.0.3 Calculating SS e
The SS for this model – or, the within-groups variance – is the sum across levels of factor A of the
e
sums of the squared differences between each y j value of the dependent variable and the mean y of
i ∙j
Condition 1 (y i1 − y ∙j )
2
Condition 2 (y i2 − y ∙j )
2
Condition 3 (y i3 − y ∙j )
2
Condition 4 (y i4 − y ∙j )
2
2
∑ (y ij − y ∙∙ )
i=1
To calculate SS total , we sum the squared differences between each observed value y ij and the grand
mean y : ∙∙
1 2 3 i4
The sum of the shaded values is 316.76, which is SS total . It is also equal to SS A +SS , which is
e
2. The total sums of squares of the dependent variable is completely broken down by the sums of
squares associated with the between-groups factor and the sums of squares associated with
within-groups variation, with no overlap between the two. The fact that the total variation is
exactly equal to the sum of the variation contributed by the two sources demonstrates the principle
of orthogonality: where two or more factors do not account for the same variance because they
are uncorrelated with each other.
Orthogonality is an important statistical concept that shows up in multiple applications, including one
of the post hoc tests to be discussed below and, critically, all ANOVA models. In every ANOVA model,
the variance contributed to the dependent variable will be broken down orthogonally between each of
the sources, meaning that SS will always be precisely equal to the sum of the SS for all sources of
total
As noted in the ANOVA table above, the mean squares for each source of variance is the ratio of the
sums of squares associated with that source divided by the degrees of freedom for that source. For the
example data:
SS A 257.94
M SA = = = 85.98
df A 3
SS e 58.82
M Se = = = 3.68
df e 16
The observed F -ratio is the ratio of the M S to M S , and is annotated with the degrees of freedom
A e
M SA 85.98
F (3, 16) = = = 23.39
M Se 3.68
SV df SS MS F EM S
T otal 19 316.76
The F-ratio is evaluated in the context of the F -distribution associated with the observed df and num
df denom . For our example data, the F -ratio is 23.39: the cumulative likelihood of observing an F -
statistic of 23.39 or greater given df = 3 and df num = 16 – the area under the F -distribution
denom
## [1] 4.30986e-06
Since 4.3 × 10 6 is smaller than any reasonable value of α, we will reject the null hypothesis that
−
2
σα = 0 in favor of the alternative hypothesis that σ > 0: there is a significant effect of the between-
2
α
groups factor.
12.2.1.3.0.8 ANOVA in R
There are several packages to calculate ANOVA results in R, but for most purposes, the base command
aov() will do. The aov() command returns an object that can then be summarized with the summary()
command wrapper to give the elements of the traditional ANOVA table.
aov() expects a formula and data structure similar to what we used for linear regression models and
for tests of homoscedasticity. Thus, it is best to put the data into long format. Using the example data
from above:
one.way.between.groups.example.df<-data.frame(DV, Factor.A)
summary(aov(DV~Factor.A,
data=one.way.between.groups.example.df))
The effect-size statistics for ANOVA models are based on the same principle as R : they are measures
2
of the proportion of the overall variance data that is explained by the factor(s) in the model (as
opposed to the error). In fact, one ANOVA-based effect-size statistic – η – is precisely equal to R
2 2
for the one-way between-groups model (remember: ANOVA is a special case of multiple regression).
Since they are proportions, the ANOVA-based effect-size statistics range from 0 to 1, with larger
values indicating increased predictive power for the factor(s) in the model relative to the error.
We here will discuss two ANOVA-based effect-size statistics: η and ω .6 Estimates of η and of ω
2 2 2 2
tend to be extremely close: if you calculate both statistics for the same data, you are less likely to
observe that the two statistics disagree as to the general size of an effect than you are to see them be
separated by a couple percent, or tenths of a percent, or hundredths of a percent.
Of η and ω , η is vastly easier to calculate, and as noted just above, not far off from estimates of ω
2 2 2 2
for the same data. η is also more widely used and understood. The only disadvantage is that η is a
2 2
biased estimator of effect size: it’s consistently too big (one can test that with Monte Carlo simulations
and/or synthetic data – using things like the rnorm() command in R – to check the relationship between
the magnitude of a statistical estimate and what it should be given that the underlying distribution
characteristics are known). That bias is usually more pronounced for smaller sample sizes: if you
google “eta squared disadvantage,” the first few results will be about the estimation bias of η and the
2
problems it has with relatively small n. Again, it’s not a gamechanging error, but if you are pitching
your results to an audience of ANOVA-based effect-size purists, they may be looking for you to use the
unbiased estimator ω .
2
In addition to being unbiased, the other advantage that ω has is that it is equivalent to the statistic
2
known as the intraclass correlation (ICC), which is usually considered a measure of the psychometric
construct reliability. ANOVA-based effect size statistics – including both η and ω – can also be
2 2
interpreted as measuring reliability of a measure: if a factor or factors being measured make a big
difference in the scores, then assessing dependent variables based on those factors should return
reliable results over time. In the world of psychometrics – which I can tell you from lived experience
is exactly as exciting and glamorous as it sounds – some analysts report ICC statistics and some report
ω statistics and a suprisingly large proportion don’t know that they are talking about the same thing.
2
12.2.2.1 Calculating η 2
For the one-way between-groups ANOVA model, η is a ratio of sums of squares: the sums of squares
2
2
SS A
η =
SS total
SS A 257.94
2
η = = = 0.81
SS total 316.76
which is a large effect (see the guidelines for interpreting ANOVA-based effect sizes below).
12.2.2.2 Calculating ω 2
The ω statistic is an estimate of the population-level variance that is explained by the factor(s) in a
2
model relative to the population-level error.7 The theoretical population-level variances are known as
population variance components: for the one-way between-groups ANOVA model, the two
components are the variance from the independent-factor variable σ and the variance associated with
2
α
error σ . 2
ϵ
As noted above, the between-groups variance in the one-way model is classified as nσ + σ , and is 2
α
2
ϵ
estimated in the sample data by M S ; the within-groups variance is classified as σ , and is estiamted
A
2
ϵ
in the sample data by M S : these are the terms listed for factor A and for the error, respectively, in the
e
2
σα
2
ω =
2 2
σα + σϵ
2
σ̂ ϵ = M S e
M S (being the estimate of σ ) from M S and divide by n, we get the population variance component
2
e ϵ A
σ :
2
α
2
M SA − M Se
σ̂ α =
n
Please note that a simpler and probably more meaningful way to conceive of σ is that it is literally 2
α
the variance of the condition means. For the one-way between-groups ANOVA model, it is more
straightforward to simply calculate the variance of the condition means than to use the above formula,
but it gets moderately more tricky to apply that concept to more complex designs.
We’re not quite done with σ̂ yet: there is one more consideration.
2
α
The value of σ̂ as calculated above has for its denominator df , which is j − 1. Please recall that the
2
α A
2
¯
formula for a sample variance – s =
∑(x−x)
2
also has an x − 1 term in the denominator. In the case
n−1
of a sample variance, that’s an indicator that the variance refers to a sample that is much smaller than
the population from which it comes. In the case of a population variance component, that indicates that
the number of levels j of the independent-variable factor A is a sample of levels much smaller than
the total number of possible levels of the factor. When the number of levels of the independent
variable represent just some of the possible levels of the independent variable, that IV is known as a
random-effects factor (the word “random” may be misleading: it doesn’t mean that factors were
chosen out of a hat or via a random number generator, just that there are many more possible levels that
could have been tested). For example, a health sciences researcher might be interested in the effect of
different U.S. hospitals on health outcomes. They may not be able to include every hospital in the
United States in the study, but rather 3 or 4 hospitals that are willing to participate in the study: in that
case, hospital would be a random effect with those 3 or 4 levels (and the within-groups variable
would be the patients in the hospitals).
A fixed-effect factor is one in which all (or, technically, very nearly all) possible levels of the
independent variables are examined. To recycle the example of studying the effect of hospitals on
patient outcomes, we might imagine that instead of hospitals in the United States, the independent
variable of interest were hospitals in one particular city, in which case 3 or 4 levels might represent
all possible levels of the variable. Please note that it is not the absolute quantity of levels that
determines whether an effect is random or fixed but the relative quantity of observed levels to the total
possible set of levels of interest.
When factor A is fixed, we have to replace the denominator j − 1 with j (to indicate that the variance
is a measure of all of the levels of interest of the IV). We can do that pretty easily by multiplying the
population variance component σ̂ by j − 1 – which is equal to df – and then dividing the result by j
2
α A
– which is equal to df + 1. We can accomplish that more simply by multiplying the estimate of σ by
A
2
α
the term .
df A
df A +1
2
M SA − M Se
σ̂ α =
n
σα
2
, as noted above, is the variance of the means of the levels of factor A. If factor A is random, then
σα
2
is, specifically, the sample variance of the means of the levels of factor A about the grand mean y ∙∙
:
2
∑(y ∙j − y ∙∙ )
2
σ̂ α =
j − 1
2
M SA − M Se df A
σ̂ α = [ ]
n df A + 1
which is equivalent to the population variance of the means of the levels of factor A about the grand
mean y : ∙∙
2
∑(y ∙j − y ∙∙ )
2
σ̂ α =
j
One more note on population variance components before we proceed to calculating ω for the sample 2
data: the fact that the numerator of σ̂ (using x as a generic indicator of a factor that can be α, β, γ, etc.
2
x
for more complex models) includes a subtraction term (in the above equation, it’s M S − M S ), it is a e
possible to end up with a negative estimate for σ̂ if the F -ratio for that factor is not significant. This
2
x
really isn’t an issue for the one-way between-groups ANOVA design because if F is not significant A
then there is no effect to report and, as noted in the chapter on differences between two things, we do
not report effect sizes when effects are not significant. But, it can happen for population variance
components in factorial designs. If that is the case, we set all negative population variance components
to 0.
M SA − M Se 85.98 − 3.68
2
σ̂ α = = = 16.46
n 5
and:
2
σ̂ α 16.46
2
ω = = = 0.82
2 2
σ̂ α + σ̂ ϵ 16.46 + 3.68
2
M SA − M Se df A 85.98 − 3.68 3
σ̂ α = [ ] = ( ) = 12.35
n df A + 1 5 4
and:
2
σ̂ α 12.35
2
ω = = = 0.77
2 2
σ̂ α + σ̂ ϵ 12.35 + 3.68
As is the case for most classical effect-size guidelines, Cohen (2013) provided interpretations for η 2
and ω . In his book Statistical Power Analysis for the Behavioral Sciences – which is the source for
2
all of Cohen’s effect-size guidelines – Cohen claims that η and ω are measures of the same thing
2 2
(which is sort of true but also sort of not) and thus should have the same guidelines for what effects are
small, medium, and large (which more-or-less works out because η and ω are usually so similar in2 2
magnitude):
η
2
or ω
Interpretation 2
Thus, the η value and the ω values assuming either random or fixed effects for the sample data all fall
2 2
Roughly translated, post hoc is Latin for after this. Post hoc tests are conducted after a significant
effect of a factor has been established to learn more about the nature of an effect, i.e.: which levels of
the factor are associated with higher or lower scores?
Post hoc tests can help tell the story of data beyond just a significant F -ratio in several ways. There is
a test that compares experimental conditions to a single control condition (the Dunnett common control
procedure). There are tests that compare the average(s) of one set of conditions to the average(s) of
another set of conditions (including orthogonal contrasts and Scheffé contrasts). There are also post
hoc tests that compare the differences between each possible pair of condition means (including Dunn-
Bonferroni tests, Tukey’s HSD, and the Hayter-Fisher test). The choice of post hoc test depends on the
story you are trying to tell with the data
12.2.3.1 Pairwise Comparisons
The use of Dunn-Bonferroni tests, also known as applying the Dunn-Bonferroni correction, is a
catch-all term for any kind of statistical procedure where multiple hypotheses are conducted
simultaneously and the type-I error rate is reduced for each test of hypotheses in order to avoid
inflating the overall type-I error rate. For example, a researcher testing eight different simple
regression models using eight independent variables to separately predict the dependent variable – that
is, running eight different regression commands – might compare the p-values associated each of the
eight models to an α that is 1/8th the size of the overall desired α-rate (otherwise, she would risk
committing eight times the type-I errors she would with a single model).
In the specific context of post hoc tests, the Dunn-Bonferroni test is an independent-samples t-test for
each of the condition means. The only difference is that one takes the α-rate of one’s preference – say,
α = 0.05, and divides it by the number of possible multiple comparisons that can be performed on the
condition means:
α overall
α Dunn−Bonf erroni =
# of pairwise comparisons
(it may be tempting to divide the overall α by the number of comparisons performed instead of the
number possible, but then one could just run one comparison with regular α and thus defeat the
purpose)
To use the example data from above, let’s imagine that we are interested in applying the Dunn-
Bonferroni correction to a post hoc comparison of the mean of Condition 1 and Condition 2. To do so,
we simply follow the same procedure as for the independent-samples t-test and adjust the α-rate.
Assuming we start with α = 0.05, since there are 4 groups in the data, there are C = 6 possible
4 2
##
## Welch Two Sample t-test
##
## data: Condition1 and Condition2
## t = -6.2688, df = 7.1613, p-value = 0.0003798
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.204768 -4.179232
## sample estimates:
## mean of x mean of y
## -1.042 5.650
The observed p-value of 0.0002408 < 0.0125, so we can say that there is a signficant difference
between the groups after applying the Dunn-Bonferroni correction.
Please also note that if you need to for any reason, you could also multiply the observed p-value by the
number of possible pairwise comparisons and compare the result to the original α. It works out to be
the same thing, a fact that is exploited by the base R command pairwise.t.test(), in which you can
use the p.adjust="bonferroni"9 to conduct pairwise Dunn-Bonferroni tests on all of the conditions
in a set simultaneously:
DV<-c(Condition1, Condition2, Condition3, Condition4)
Factor.A<-c(rep("Condition 1", length(Condition1)),
rep("Condition 2", length(Condition2)),
rep("Condition 3", length(Condition3)),
rep("Condition 4", length(Condition4)))
##
## Pairwise comparisons using t tests with non-pooled SD
##
## data: DV and Factor.A
##
## Condition 1 Condition 2 Condition 3
## Condition 2 0.0023 - -
## Condition 3 0.0276 1.0000 -
## Condition 4 2.3e-05 0.1125 0.1222
##
## P value adjustment method: bonferroni
Tukey’s Honestly Significant Difference (HSD) Test is a way of determining whether the difference
between two condition means is (honestly) significant. The test statistic for Tukey’s HSD test is the
studentized range statistic q, where for two means y and y :
∙1 ∙2
y ∙1 − y ∙2
q =
M Se
√
n
^[In this chapter, we are dealing exclusively with classical ANOVA models with equal n per group.
The q statistic for the HSD test can also handle cases of unequal n for post hoc tests of results from
generalized linear models by using the denominator:
M Se 1 1
√ ( + )
2 n1 n2
which is known as the Kramer correction. If using the Kramer correction, the more appropriate name
for the test is not the Tukey test but rather the Tukey-Kramer procedure.]
The critical value of q depends on the number of conditions and on df : those values can be found in
e
For any given pair of means, the observed value of q is given by the difference between those means
divided by √M S /n, where M S comes from the ANOVA table and n is the number in each group:
e e
if the observed q exceeds the critical q, then the difference between those means is honestly
significantly different.
When performing the Tukey HSD test by hand, it also can be helpful to start with the critical q and
calculate a critical HSD value such that any observed difference between means that exceeds the
critical HSD is itself honestly significant:
M Se
H SD crit = q crit √
n
But, it is unlikely that you will be calculating HSDs by hand. The base R command TukeyHSD() wraps
an aov() object – in the same way that we can wrap summary() around an aov() object to get a full-
ish ANOVA table – and returns results about the differences between means with confidence intervals
and p-values:
TukeyHSD(aov(DV~Factor.A,
data=one.way.between.groups.example.df))
According to the results here, there are honestly significant differences (assuming α = 0.05) between
Condition 2 and Condition 1, between Condition 3 and Condition 1, between Condition 4 and
Condition 1, and between Condition 4 and Condition 3. All other differences between level means are
not honestly significant.
The Hayter-Fisher test is identical to the Tukey HSD, but with a smaller critical q value, which makes
this test more powerful than the Tukey HSD test. The critical value of q used for the Hayter-Fisher test
is the value listed in tables for k − 1 groups given the same df (in the table linked above, it is the
e
critical q listed one column immediately to the left of the value one would use for Tukey’s HSD).
The condition that comes from being able to use a smaller critical q value is that the Hayter-Fisher test
can only be used following significant F test. However, since you shouldn’t be reporting results of post
hoc tests following non-significant F -tests anyway, that’s not really an issue. The issue with the
Hayter-Fisher test is that fewer people have heard about it: probably not a dealbreaker, but something
you might have to write in a response to Reviewer 2 if you do use it.
12.2.3.2 Contrasts
A contrast is a comparison of multiple elements. In post hoc testing, just as a pairwise comparison
will return a test statistic like q in the Tukey HSD test or the Dunn-Bonferroni-corrected t, a contrast is
a single statistic – which can be tested for statistical significance – that helps us evaluate a comparison
that we have constructed. For example, please imagine that we have six experimental conditions and
we are interested in comparing the average of the means of the first three with the average of the means
of the second three. The resulting contrast would be a value equal to the differences of the averages
defined by the contrast. The two methods of contrasting discussed here – orthogonal contrasts and
Scheffé contrasts – differ mainly in how the statistical significance of the contrast values are
evaluated.
As noted above, orthogonality is the condition of being uncorrelated. Take for example the x- and y-
axes in the Cartesian Coordinate System: the two axes are at right angles to each other, and the x value
of a point in the plane tells us nothing about the y value of the point and vice verse. We need both
values to accurately locate a point. The earlier mention of orthogonality in this chapter referred to the
orthogonality of between-groups variance and within-groups variance: the sum of the two is precisely
equal to the total variance observed in the dependent variable, indicating that the two sources of
variance do not overlap, and each explain separate variability in the data.
Orthogonal contrasts employ this principle by making comparisons between condition means that
don’t compare the same things more than once. Pairwise comparisons like the ones made in Tukey’s
HSD test, the Hayter-Fisher test, and the Dunn-Bonferroni test are not orthogonal because the same
condition means appear in more than one of the comparisons: if there are three condition means in a
dataset and thus three possible pairwise comparisons, then each condition mean will appear in two of
the comparisons.^[For this example, if we call the three condition means “C ,” “C ,” and “C ,” the
1 2 3
1. C vs. C
1 2
2. C vs. C
1 3
3. C vs. C
2 3
C1 appears in comparisons (1) and (2), C2 appears in comparisons (1) and (3), and C3 appears in
comparisons (2) and (3).]
By contrast (ha, ha), orthogonal contrasts are constructed in such a way that each condition mean is
represented once and in a balanced way. This is a cleaner way of making comparisons and is a more
powerful way of making comparisons because it does not invite the extra statistical noise associated
with accounting for within-groups variance appearing multiple times.
To construct orthogonal contrasts, we choose positive and negative coefficients that are going to be
multiplied by the condition means. We multiply each condition mean by its respective coefficient and
take the sum of all of those terms to get the value of a contrast Ψ .
For each set of contrasts to be orthogonal, the sum of the products of all coefficients associated with
each condition mean must be 0.
For example, in a 4-group design like the one in the example data, the coefficients for one contrast Ψ 1
could be:
1 1 1 1
c 1j = [ , ,− ,− ]
2 2 2 2
That set of contrasts represents the average of the first two condition means (the sum of the first two
condition means divided by two) minus the average of the second two condition means (the sum of the
second two condition means divided by two times negative 1).
c 2j = [0, 0, 1, −1]
which would represent the difference between the condition 3 mean and the condition 4 mean, ignoring
the means of conditions 1 and 2. Ψ and Ψ are orthogonal to each other because:
1 2
1 1 1 1
( )(0) + ( )(0) + (− )(1) + (− )(−1) = 0
2 2 2 2
c 3j = [1, −1, 0, 0]
which would represent the difference between the condition 1 mean and the condition 2 mean, ignoring
the means of conditions 3 and 4. Ψ and Ψ are orthogonal to each other because:
1 3
1 1 1 1
( )(1) + ( )(−1) + (− )(0) + (− )(0) = 0
2 2 2 2
Thus, {Ψ 1
, Ψ2 , Ψ3 } is a set of orthogonal contrasts.
The significance of Ψ is tested using an F -statistic. The numerator of this F -statistic is Ψ : contrasts 2
always have one degree of freedom, the sums of squares is the square of the contrast, and the mean
squares (which is required for an F -ratio numerator) is equal to the sums of squares whenever df = 1.
The denominator of the F -statistic for testing Ψ has df equal to df from the ANOVA table and its e
2
c
value is given by M Se ∑
nj
j
, where M Se comes from the ANOVA table, cj is the contrast c for
condition mean j and n is the number of observations in condition j (for a classic ANOVA model,
j n
is the same for all groups so the subscript j isn’t all that relevant):
2
Ψ
Ψ
F =
obs 2
c
ij
M Se ∑
nj
2
Ψ1
−4.54 20.61
F = = = 27.85
obs 2 2 2 2
(1/2) (1/2) (−1/2) (−1/2) 0.74
3.68 [ + + + ]
5 5 5 5
## [1] 7.514161e-05
Unless our α-rate is something absurdly small, we can say that Ψ – the contrast of the average of the 1
first two condition means to average of the last two condition means – is statistically significant.
2
Ψ2
−4.15 17.22
F = = = 27.85, p = 0.003
obs
0
2
0
2
1
2
−1
2
1.47
3.68 [ + + + ]
5 5 5 5
2
Ψ3
−6.69 44.76
F = = = 30.45, p = 0.00005
obs
1
2
−1
2
0
2
0
2
1.47
3.68 [ + + + ]
5 5 5 5
The Ψ value for any given Scheffé contrast is found in the same way as for orthogonal contrasts, except
that any coefficients can be chosen without concern for the coefficients of other contrasts. Say we were
interested, for our sample data, in comparing the mean of condition 2 with the mean of the other three
condition means. The Scheffé Ψ would therefore be:
1 1 1 −1.042 4.768 8.916
Ψ Schef f e
ˊ
= (y ∙1 ) + −1(y ∙2 ) + (y ∙3 ) + (y ∙4 ) = − 5.65 + + = −1.44
3 3 3 3 3 3
The F -statistic formula for the Scheffé is likewise the same as the F -statistic formula for orthogonal
contrasts:
2
Ψ
F Schef f e
ˊ
=
2
c
j
M Se ∑
nj
The more stringent signficance criterion for the Scheffé contrast is given by multiplying the F -statistic
that would be used for a similar orthogonal contrast – the F with df = 1 and df num denom = df e , by
the number of conditions minus one:
ˊ
Schef f e
F = (j − 1)F crit (df num = 1, df denom = df e )
crit
So, to test our example Ψ of 2.11, we compare it to a critical value equal to (j − 1) times the F
Schef f e
ˊ
alpha=0.05
## [1] 13.482
The observed Ψ is less than the critical value of 13.482, so this particular Scheffé contrast is not
significant at the α = 0.05 level.
Critical values of the Dunnett test can be found in tables. There is also an easy-to-use DunnettTest()
command in the DescTools package. To produce the results of the Dunnett Test, enter the data array for
each factor level as a list(): the DunnettTest() command will take the first array in the list to be
the control condition. For example, if we treat Condition 1 of our sample data as the control condition:
library(DescTools)
##
## Dunnett's test for comparing several treatments with a control :
## 95% family-wise confidence level
##
## $`1`
## diff lwr.ci upr.ci pval
## 2-1 6.692 3.548031 9.835969 0.00015 ***
## 3-1 5.810 2.666031 8.953969 0.00057 ***
## 4-1 9.958 6.814031 13.101969 9.6e-07 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The basic concept behind power analysis for ANOVA is the same as it was for classical t-tests: how
many observations does an experiment require to expect to be able to reject a null hypothesis over a
long-term rate defined by our desired power level? For the t-test, that calculation was relatively
simple: given a t-distribution representing sample means drawn from a parent distribution as defined
by the null hypothesis and a measure of alternative-hypothesis expectations, one can calculate the
proportion of expected sample means that would fall in the rejection region of that null-defined t-
distribution. With ANOVA, however, there are more parent distributions reflecting the different levels
of each factor, and those parent distributions may be related to each other in complex ways
(particularly in repeated-measures and factorial ANOVA designs).
There are multiple software packages designed to help with the potentially complex task of power
analysis for ANOVA designs. For the between-groups one-way ANOVA, we can use the base R
command power.anova.test(). The power.anova.test() command takes as its options the number
of groups, the within variance (given by M Se), the between variance (given by σ : the variance of the
2
α
group means), n, α, and the desired power. To run the power analysis, leave exactly one of those
options out and power.anova.test() will estimate the missing value. For example, imagine that we
wanted to run a study that produces results similar to what we see in the example data. There are 4
groups, α = 0.05, and we desire a power of 0.9. We can estimate the within-groups variance and
between-groups variance for the next study based on rough estimate of what we would expect (we
could use the precise values from our example data but we don’t want to get too specific because that’s
just one study: to be conservative, we can overestimate the within-groups variance and underestimate
the between-groups variance), and the output will recommend our n per group:
##
## Balanced one-way analysis of variance power calculation
##
## groups = 4
## n = 5.809905
## between.var = 8
## within.var = 8
## sig.level = 0.05
## power = 0.9
##
## NOTE: n is number in each group
If we expect our between-groups variance to be 8 (about half of what we observed in our sample data)
and our within-groups variance to be about 8 as well (about twice what we observed in our sample
data), R tells us that would require 5.81 participants per group. However, we are smarter than R, and
we know that we can only recruit whole participants – we always round up the results for n from a
power analysis to the next whole number – so we should plan for n = 6 for power to equal 0.9.
There are R packages and other software platforms that can handle more complex ANOVA models.
However, if you can’t find one that suits your particular needs, then power analysis can be
accomplished in an ad hoc manner by simulating parent distributions (using commands like rnorm())
that are related to each other in the ways one expects – in terms of things like the distance between the
distributions of levels of between-groups factors and/or correlations between distributions of levels of
repeated-measures factors – and then derive n based on repeated sampling from those synthetic
datasets.
Median Test
The χ test of statistical independence was covered in depth in the chapter on differences between two
2
things for k = 2 and 2 × 2 contingency tables. The exact same procedure is followed for k > 2 and for
contingency tables with more than 2 rows and/or columns. The one thing to keep in mind is that the df
for the χ test is equal to df
2
rows
× df (in all of the examples from the differences between two
columns
Everything said above about the χ test applies to the extension of the median test. It’s exactly the same
2
We can’t use the sample data for the one-way between-groups ANOVA model to demonstrate these
tests because the sample is too small to ensure that f ≥ 5 for each cell. So, here is another group of
e
sample data:
The median of all of the observed data is 0.63. We can construct a contingency table with Conditions
as the columns and the greater than or equal to/less than the median status of each observed value of
the dependent variable as the rows, populating the cells of the table with the counts of values that fall
into each cross-tabulated category:
We then can perform a χ on the contingency table for the extension of the median test:
2
##
## Pearson's Chi-squared test
##
## data: matrix(c(0, 10, 2, 8, 9, 1, 9, 1), ncol = 4)
## X-squared = 26.4, df = 3, p-value = 7.864e-06
The observed p-value is 7.9 × 10 6, and so is likely smaller than whatever α-rate we might choose.
−
Therefore, we can conclude that there is a significant difference between the conditions.
The closest nonparametric analogue to the classical one-way between-groups ANOVA is the Kruskal-
Wallis test. The Kruskal-Wallis test relies wholly on the ranks of the data, determining whether there is
a difference between groups based on the distribution of the ranks in each group.
We can use the example data from the one-way between-groups ANOVA to demonstrate the Kruskal-
Wallis test. The first step in is to assign an overall rank (independent of group membership) to each
observed value of the dependent variable from smallest to largest, with any ties receiving the average
of the rank above and below the tie cluster:
The sum of the ranks for each condition j are called R as noted in the table above. The test statistic H
j
H =
T
3
∑ t −t
1
1 − 3
N −N
where N is the total number of observations, n is the number of observations per group (they don’t
have to be equal for the Kruskal-Wallis test), T is the number of tie clusters and t is the number of
values clustered in each tie: if there are no tied ranks, then the denominator is 1.11
Observed values of H for relatively small samples can be compared to ancient-looking tables of
critical values based on the n per each group. If – as in this case – the values are not included in the
table, we can take advantage of the fact that H can be modeled by a χ distribution with df = k − 1.
2
In our case, that means that the p-value for the observed H = 15.25 is:
## [1] 0.001615003
Of course there is, as usual, a better way afforded us through the miracle of modern computing:
## Arrange the data into long format and add Condition labels
kruskal.test(data~Condition, data=one.way.between.example.long)
##
## Kruskal-Wallis rank sum test
##
## data: data by Condition
## Kruskal-Wallis chi-squared = 15.251, df = 3, p-value = 0.001614
12.4 Bayesian One-Way Between-Groups ANOVA
Bayesian ANOVA works via the same principles as the Bayesian t-tests. There is a null model that
posits that there is no variance between groups: in that model, the data are normally distributed, all
sitting atop each other. In the alternative model, there is some non-zero variance produced by the
factor(s) of interest, and the prior probabilities of those variance estimates are distributed by a
multivariate Cauchy distribution (a generalization of the univariate Cauchy distribution that served as
the prior for the Bayesian t-tests).
As with the Bayesian t-tests, the BayesFactor() package provides a command structure that is
maximally similar to the commands used for the classical ANOVA. For the sample data, we can
calculate an ANOVA and use the posterior() wrapper to estimate posterior condition means,
standard deviations, and quantile estimates:
# The Bayes Factor software is very sensitive about variable types, so we have to
one.way.between.groups.example.df$Factor.A<-as.factor(one.way.between.groups.examp
anovaBF(DV~Factor.A, data=one.way.between.groups.example.df)
posterior.data<-posterior(anovaBF(DV~Factor.A, data=one.way.between.groups.example
summary(posterior.data)
##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 4.5728 0.4674 0.006609 0.006609
## Factor.A-Condition 1 -5.1719 0.8918 0.012612 0.016256
## Factor.A-Condition 2 0.9865 0.7878 0.011141 0.011452
## Factor.A-Condition 3 0.1824 0.7929 0.011213 0.011734
## Factor.A-Condition 4 4.0029 0.8777 0.012413 0.015091
## sig2 4.5923 2.2721 0.032132 0.055177
## g_Factor.A 5.7630 8.7128 0.123218 0.162367
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 3.6411 4.2688 4.5773 4.8821 5.477
## Factor.A-Condition 1 -6.7866 -5.7710 -5.2048 -4.6443 -3.283
## Factor.A-Condition 2 -0.6042 0.4818 0.9911 1.5021 2.514
## Factor.A-Condition 3 -1.3928 -0.3121 0.1922 0.7023 1.741
## Factor.A-Condition 4 2.2114 3.4612 4.0392 4.5868 5.634
## sig2 2.1423 3.1978 4.1298 5.3586 9.919
## g_Factor.A 0.5707 1.8616 3.3765 6.3762 25.685
A useful feature offered by BayesFactor is the option to change the priors for an analysis by including
a one-word option. The default prior width is “medium;” for situations where we may expect a larger
difference between groups, choosing “wide” or “ultrawide” will give greater support to alternative
models – but likely won’t give wildly different Bayes factor results – when the actual effect size is
large. The options make no difference for our sample data, but for purposes of reviewing the syntax of
the prior options:
1. There are ways to conduct multiple pairwise tests that manage the overall α-rate when more than
two groups are being studied at a time. One way is to reduce the α-rate for each test in order to
keep the overall α-rate where one wants it (e.g., α = 0.05): this approach is generally known as
applying the Bonferroni correction but should be known as applying the Dunn-Bonferroni
correction to honor Olive Jean Dunn as well as Carlo Bonferroni. The other way is through the
use of post-hoc pairwise tests like Tukey’s Honestly Signficant Difference (HSD) test and the
Hayter-Fisher test, which are discussed below.↩
2. This pair of null and alternative hypotheses may appear odd, because usually when the null
hypothesis includes the = sign, the alternative hypothesis features the ≠ sign, and when the
alternative has the > sign, the null has the ≤ sign. Technically, either pair of signs – = / ≠ or
≤ / > – would be correct to use, it’s just that because variances can’t be negative, σ ≤ 0 is
2
somewhat less meaningful than σ = 0, and σ ≠ 0 is somewhat less meaningful than σ > 0.↩
2 2 2
3. The F stands for Fisher but it was named for him, not by him, for what it’s worth.↩
4. The choice of the letter α is to indicate that it is the first (α being the Greek counterpart to A) of
possibly multiple factors: when we encounter designs with more than one factor, the next factors
will be called β, γ, etc.↩
5. The SAS software platform doesn’t even have a t-test command: to perform a t-test in SAS, one
uses the ANOVA procedure.↩
6. To my knowledge, there are only three such effect-size statistics. The third, ϵ , is super-obscure:
2
possibly so obscure as to be for ANOVA-hipsters only. The good news is that estimates of ϵ tend2
to fall between estimates of η and estimates of ω for the same data, so it can safely be
2 2
ignored.↩
7. Population-level error may seem an odd term because we so frequently talk of error as coming
from sampling error – that is the whole basis for the standard error statistic. However, even on
the population level, there is variance that is not explained by the factor(s) that serve(s) as the
independent variable(s): the naturally-occurring width of the population-level distribution of a
variable.↩
8. Honestly, the effect-size statistics for the example data are absurdly large. There should be
another category above large called stats-class-example large. Or something like that.↩
10. There are more post hoc tests, but the ones listed here (a) all control the α -rate at the overall
specified level and (b) are plenty.↩
11. The 12 is always just 12 and the 3 is always just 3, in case you were wondering.↩