You are on page 1of 457

Advanced Statistics I 2021 Edition

1 Introduction
This book is a compilation of the readings developed for the Fall 2021
Semester offering of Psychology 207: Advanced Statistics.

Please don’t try to sell this book because there are about a million copyright
violations in it.

Best, DB
2 Categorizing and Summarizing
Information
2.1 How to Tell a Story

Figure 2.1: A Heartwarming Tale of Extreme Weather Events and Traumatic


Brain Injury

When you tell a story, you usually have to leave some things out. If you saw a
movie and a friend asked you to describe the plot, it wouldn’t be terribly
helpful if you described literally everything that happened in the movie – that
would take longer than just watching the movie – you would likely instead
discuss the basic conflict, the characters, the most interesting dialogue, and
(if your friend doesn’t mind spoilers) the film’s resolution. If your partner
asks you about your day, you wouldn’t respond with a minute-by-minute
description of each interaction you had with each person complete with an
estimate of the amplitude of their speaking voice, the ambient temperature
throughout the day, and how much your breakfast cost to the penny; you might
instead give a description of your global mood, a basic rundown of what you
did during the day, and talk about anything that stood out as particularly
funny, frustrating, or otherwise noteworthy. And if you wrote a history of an
entire country, or described the evolution of species, or told the story of the
formation of the Planet Earth, you’d have to leave a lot of details out – if you
kept all the details in then you wouldn’t survive all the way to the end.

We make choices about what to include in our stories – sometimes without


even thinking about it (and sometimes we don’t remember or don’t know all
of the details). When we summarize data, we’re telling stories about that data
and we make choices in doing so.

This page is all about data: how we categorize it and how we summarize it.
We are going to cover some of the different types of data, how we summarize
data, and how what we leave in and what we leave out affect the story we
tell about the data. And, of course, I have made some choices on what to
include and what not to include on the page. There’s only so much we can
cover.

2.2 Types of Data


2.2.1 Statistics and Parameters

In conversational English, pretty much any number that describes a fact can
be called a “statistic.” In the field of statistics, however, the term has a
specific connotation: a statistic1 is a number that describes a sample or
samples2. Something like the proportion of 100 people polled at random
who answered “yes” to survey question is a statistic. The average reaction
time of 30 participants in a psychology study is a statistic. Something like
the number of people who live in Canada? That’s not technically a statistic.
Because the people who live in Canada is an entire population.3, the
number of them is an example of a parameter4 Thus, statistics describe
samples, and parameters describe populations.

Why is knowing that distinction important? It is not – I repeat, not – so that


you can correct people who use the term statistic in casual conversation
when you know that technically they should say parameter: that’s not a good
way to make friends. It’s important for a couple of reasons:

1. It is specifically important to know whether a number refers to a sample


or to a population for use in some statistical procedures.5

2. It is generally important to know that we use statistics to make


inferences about parameters.

On that second point: scientists are rarely interested only in the subjects used
in their research. There are exceptions – like case studies or some clinical
trials – but usually scientists want to generalize the findings from a sample to
the population. A cognitive researcher isn’t interested in the memory
performance of a few participants in an experiment so much as what their
performance means for all humans; a social psychologist studying the
behavioral effects of prejudice does not mean to describe the effects of
prejudice for just those who participate in the study but for all of society. It is
unrealistic in the vast majority of scientific inquiries to measure that which
they are interested in about the entire population – even researchers who do
work with population-level data are working with things like census data and
are thus somewhat limited in the types of questions they can investigate (that
is, they can only work with answers to questions asked in the census they are
working with).

Hypothetically, if you could invite the entire population to participate in a


psychology experiment – if you could bring the nearly 8 billion people on
earth to a lab – then most of the statisical tests to be discussed in this course
would be irrelevant. If, say, the whole population had an average score of x
in condition A and an average score of x + 1 in condition B, then the results
conclusively would show that scores are higher in condition B than in
condition A: no need in that case for fancy stats.
Statistical tests, therefore, are designed for analyzing small samples*6 and
for making inferences from those data about larger populations. In part
because statistical tests are designed for samples that are much, much smaller
in number than populations is that in those cases that researchers do have
large amounts of data, the tests we run will almost always end up being
statistically significant. Results of statistical tests should always be subject
to careful scientific interpretation: perhaps more so when statistical
significance is more a product of huge sample sizes than it is a meaningful
finding in small ones.

There is also a notational difference between statistics and parameters.


Symbols for statistics use Latin letters, such as x and s ; symbols for
¯
2

parameters use Greek letters, such as μ and σ . So, when you see a number 2

associated with a Latin-letter symbol, that is a measurement of a sample, and


when you see a number associated with a Greek-letter symbol, that is a
measurement of a population.

2.2.2 Scales of Measurement

Data are bits of information7, and information can take on many forms. The
ways we analyze data depend on the types of data that we have. Here’s a
relatively basic example: suppose I asked a class of students their ages, and I
wanted to summarize those data. A reasonable way to do so would be to take
the average of the students’ ages. Now suppose I asked each student what
their favorite movie was. In that case, it wouldn’t make any sense to report
the average favorite movie – it would make more sense to report the most
popular movie, or the most popular category of movie.

Thus, knowing something about the type of data we have helps us choose the
proper tools for working with our data. Here, we will talk about an extensive
– but not exhaustive – set of data types that are encountered in scientific
investigations.

2.2.2.1 S.S. Stevens’s Taxonomy of Data


The psychophysicist S.S. “Smitty” Stevens proposed a taxonomy of
measurement scales in his 1946 paper *On the Theory of Scales of
Measurement that was so influential that is often described without citation,
as if the system of organization Stevens devised were as fundamental as a
triangle having three sides, or the fact that 1 + 1 = 2. To be fair, it’s a pretty
good system. And, it is so common that it would be weird not to know the
terminology from that 1946 paper. There are some omissions, though, and we
will get to those after we discuss Stevens’s data types.

2.2.2.1.1 Discrete Data

Discrete data8 are data regarding categories or ranks. They are discrete in
the sense that there are gaps between possible values: whereas a continuous
measurement like length or weight can take on an infinite amount of values
between any two given points (e.g., the distances between 1 and 2 meters
include 1.5 meters, 1.25 meters, 1.125 meters, 1.0675 meters, etc.), a
measurement of category membership can generally only take on as many
values as there are categories; and a measurement of ranks can only take on
as many values as there are things to be ranked.

2.2.2.1.1.1 Nominal (Categorical) Data

Nominal (categorical) data9 are indicators of category membership.


Examples include year in school (e.g., freshman, sophomore, junior, senior,
1st year, etc.) and ice cream flavor (e.g., cookie dough, pistachio, rocky
road, etc.). Nominal or categorical (the terms are 100% interchangeable)
data are typically interpreted in terms of frequency or proportion, for
example, how many 1st-year graduate students are in this course?, or, what
proportion of residents of Medford, Massachusetts prefer vanilla ice
cream?.

2.2.2.1.1.2 Ordinal (Rank) Data

Ordinal (rank) data10 are measurements of the relative rank of


observations. Usually these are numeric ranks (as in a top ten list), but they
also can take on other descriptive terms (as in gold, silver, and bronze medal
winners in athletic competition). Ordinal (or rank – like nominal and
categorical, ordinal and rank are 100% interchangeable) data are
interpreted in terms of relative qualities. For example, consider the sensation
of pain: it’s impossible to objectively measure pain because it’s an inherently
subjective experience that changes from person to person. Instead, medical
professionals use a pain scale like the one pictured here:

Figure 2.2: Looks like the red face murdered six people in prison, so maybe
that’s more of an emotional pain.

Pain is therefore measured in terms of what is particularly painful (or not


painful) to the medical patient, and their treatment can be determined as a
result of whether pain is mild or severe compared to their baseline
experience, or whether pain is improving or worsening.

2.2.2.1.2 Continuous (Scale) Data

Continuous (scale) data11 , in contrast to discrete data, can take an infinite


number of values within any given range. That is, a continuous measure can
take on values anywhere on a number line, including all the tiny inbetween
spots. Continuous (or scale data – these are also 100% interchangeable
terms) data have values that can be transformed into things like averages
(unlike categorical data) and are meaningful relative to each other regardless
of how many measurements there are (unlike rank data). There are two main
subcategories of continuous data in Stevens’s taxonomy, and the difference
between those two categories is in the way you can compare measurements
to each other.

2.2.2.1.2.1 Interval Data


Interval data12 are data with meaningful subtractive differences – but not
meaningful ratios – between measurements. The classic example here has to
do with temperature scales. Imagine a day where the temperature is 1 ∘

Celsius. That’s pretty cold! Now imagine that the temperature the next day is
2 Celsius. That’s still cold! You likely would not say that the day when it’s

2 Celsius is twice as warm as the day before when it was 1 out, because it
∘ ∘

wouldn’t be! The Celsius scale (like the Fahrenheit scale) is measured in
degrees because it is a measure of temperature relative to an arbitrary 0.
Yes, 0 isn’t completely arbitrary because it’s the freezing point of water, but

it’s also not like 0 C is the bottom point of possible temperatures (0 Kelvin

is, but we’ll get to that in a bit). In that sense, the intervals between Celsius
measurements are consistently meaningful: 2 C is 1 degree warmer than 1
∘ ∘

C, which is the same difference between 10 C and 11 C, which is the same


∘ ∘

difference between 36 C and 37 C. But ratios between Celsius


∘ ∘

measurements are not meaningful – 2 is not twice as warm as 1 , 15 is


∘ circ ∘

not a third as warm as 45∘ C, and −25∘ C is certainly not negative 3 times
as warm as 75 C. ∘

2.2.2.1.2.2 Ratio Data

Ratio data13, as the name implies, do have meaningful ratios between


values. If we were to use the Kelvin scale instead of the Celsius scale, then
we could say that 2 K (notice there is no degree symbol there because Kelvin
is not relative to any arbitrary value) represents twice as much heat as 1 K,
and that 435 K represents 435 times as much heat as 1 K because there is a
meaningful 0 value to that scale – it’s absolute zero (0 K). Any scale that
stops at 0 will produce ratio data. A person who weighs 200 pounds weighs
twice as much as a person who weighs 100 pounds because the weight scale
starts at 0 pounds; a person who is 2 meters tall is 6/5ths as tall as somebody
who is 1.67 meters tall (note that it doesn’t matter that you will never
observe a person who weighs 0 pounds or stands 0 meters tall – it doesn’t
matter what the smallest observation is, just where the bottom part of the
scale is). Interval differences are also meaningful for ratio data, so all ratio
data are also interval data, but not all interval data are ratio data (it’s a
squares-and-rectangles situation).
Here is an important note about the difference between interval and ratio
data:

just replace “a kid” with “in stats class” and “quicksand” with “the
difference between interval and ratio data.” The important thing is to know
that both are continuous data and that continuous data are very different
from discrete data.

2.2.2.1.3 Categories that Stevens Left Out

So, our man Smitty has taken some heat for leaving out some categories, and
subsequently people have proposed alternate taxonomies. I’m not going to
pile on poor Smitty – leaving things out is a major theme of this page! – but
keeping in mind what we said at the beginning of this section about the type
of data informing the type of analysis, there are a couple of additional
categories that I would add. Data of these types all have a place in Stevens’s
taxonomy, but have special features that allow and/or require special types of
analyses.

2.2.2.1.3.1 Cardinal (Count) Data

Cardinal (count)14 data are, by definition, ratio data: counts start at zero,
and a count of zero is a meaningful zero. But, counts also have features of
ordinal (rank) data: counts are discrete (like most of the dialogue in the
show, Two and a Half Men is an unfunny joke15, based on the absurdity of
continuous count data) and their values imply relativity between each other.
As counts become larger and more varied, the more appropriate it is to treat
them like other ratio data, but data with small counts (the actually most
famous example of this is the number of 19th century Prussian Soldiers who
died by being kicked by horses or mules – as you may imagine, the counts
were pretty small) are distributed in specific ways and are ideally analyzed
using different tools like Poisson and negative binomial modeling.

2.2.2.1.3.2 Proportions

Proportions16, like counts, can rightly be categorized as ratio data but have
special features of their own. For example, the variance and standard
deviation of a proportion can be calculated in both the traditional manner
(follow the hyperlinks or scoll down to find the traditional equations), but
also in its own way.17 Because proportions by definition are limited to the
range between zero and one, they tend to arrange themselves differently than
data that are unbounded (that is, data look different when they are squished).

Proportional data are also similar to count data in that they A. are often
treated like ratio data (which is not incorrect) and B. are often analyzed using
special tools. In general, the shape of distributions of proportions are beta
distributions – we will talk about those later.

2.2.2.1.3.3 Binary Data

Binary (dichotomous) data18 are, as the name implies, data that can take
one of two different values. Binary data can be categorical – as in yes or no
or pass or fail – or numeric – as in having 0 children or more than 0
children – but regardless can be given the values 0 or 1. Binary data have a
limited set of possible distributions – all 0, all 1, and some 0/some 1. We
will discuss several treatments of binary data, including uses of binomial
probability and logistic regression.

2.2.3 Dependent and Independent Variables, Predictor


Variables and Predicted Variables

Another important way to categorize variables is in terms of input and


output. Take for example a super-simple physics experiment: rolling a ball
down a ramp. In that case, the input is whether we push the ball or not (that’s
a binary variable); and the output is whether it rolls or not (also binary).
Most of the time, we are going to use the term independent variable 19 to
describe the input and dependent variable 20 to describe the output, and our
statistical analyses will focus on the extent to which the dependent variable
changes as a function of changes to the independent variable.

The other terminology we will use to describe inputs and outputs is in terms
of predictor variable 21 and predicted (outcome) variable. Those terms are
used in the context of correlation and regression (although the terms
independent and dependent are used there as well). The predictor/predicted
terms are similar to the independent/dependent terms in that the latter is
considered to change as some kind of function of changes in the former. They
are also similar in that the former are usually assigned to be the x variable
and the latter are usually assigned to be the y variable.

They are different – and this is super-important – in that changes in an


independent variable are hypothesized to cause changes in the dependent
variable, while the changes in a predictor variable are hypothesized to be
associated with changes in the predicted variable, because correlation does
not imply causation.

2.3 Summary Statistics


2.3.1 A Brief Divergence Regarding Histograms

The histogram22 is both one of the simplest and one of the most effective
forms of visualizing sets of data. They will be covered at length on the page
on data visualization, but a brief introduction here will be helpful tools for
describing the different ways that we summarize data.

A histogram is a univariate 23 visualization. Possible values of the variable


being visualized are represented on the x-axis and the frequency of
observations of each of those values are represented on the y-axis (this
frequency can be an absolute frequency – the count of observations – or a
relative frequency – the number of observations as a proportion or
percentage of the total number of observations). The possible values
represented on the x-axis can be divided into either each possible value of
the variable or into bins of adjacent possible values: for example, a
histogram of people’s chronological ages might put values like
1, 2, 3, ..., 119 on the x-axis, or it might use bins like
0 − 9, 10 − 19, ..., 110 − 119. There is no real rule for how to arrange the

values on the x-axis, despite the fact that default values for binwidth and/or
number of bins are built in to statistical software packages that produce
histograms: it is up to the person doing the visualizing to choose the width of
bins that best represents the distribution of values in a data set.

Here’s an example of a histogram:


Figure 2.3: Histogram of Points Scored by Players in the 2019-2020 NBA
Season

This histogram represents the number of points scored by each player in the
2019-2020 NBA season (data from Basketball Reference). Each bar
represents the number of players who scored the number of points
represented on the x-axis. The number of points are sorted into bins of 50, so
the first bar represents the number of players who scored 0 – 50 points, the
second bar represents the number of players who scored 51 – 100 points,
etc. All of the bars in a histogram like this one are adjacent to each other,
which is a standard feature of histograms that shows that each bin is
numerically adjacent to the next. That layout implies that difference between
one bar and the next is a difference in the grouping of the one variable – gaps
between all of the bars (as in bar charts) imply a categorical difference
between observations, which is not the case with histograms. Apparent gaps
in histograms – as we see in Figure 2.3 between 1900 and 1950 and again
between 2000 and 2300, are really bars with no height. In the case of our
NBA players, nobody scored between 1900 and 1950 points, and nobody
scored between 2000 and 2300 points.24
Again: histograms will be covered in more detail in the page on data
visualization. For now, it suffices to say that histograms are a good way to
see an entire dataset and to pick up on patterns. Thus, we will use a few of
them to help demonstrate what we leave in and what we leave out when we
summarize data.

2.3.2 Central Tendency

Central tendency25 is, broadly speaking, where a distribution of data is


positioned. The central tendency of a dataset is similar to a dot on a
geographic map that indicates a city’s position: while the dot indicates a
central point in the city, it doesn’t tell you how far out the city is spread in
each direction from that point, nor does it tell you things like where most of
the people in that city live. In that same sense, the central tendency of a
distribution of data gives an idea of the midpoint of the distribution, but
doesn’t tell you anything about the spread of a distribution, or the shape of a
distribution, or how concentrated the distribution is in different places.

So, the central tendency of a distribution is basically the middle of a


distribution – but there are several ways to define the middle: each measure
is a different way to tell the story of the center aspect of a distribution.

2.3.2.1 Mean

When we talk about the mean26 in the context of statistics, we are usually
referring to the arithmetic mean of a distribution: the sum of all of the
numbers in a distribution divided by the number of numbers in a distribution.
If x is a variable, x represents the i observation of the variable x, and
i
th

there are n observations, then the arithmetic mean symbolized by x is given


¯

by:
n
∑ xi
i=1
x =
¯ .
n

That equation might be a little more daunting than it needs to be.27


For example, if we have x = {1, 2, 3}, then:

1 + 2 + 3
x =
¯ = 2.
3

The calculation for a population mean is the same as for a sample mean. In
the equation, we simply exchange x for μ (the Greek letter most similar to
¯

the Latin m) and the lower-case n for a capital N to indicate that we’re
talking about all possible observations (that distinction is less important and
less-frequently observed than the distinction between Latin letters for
statistics and Greek letters for parameters, but I find it useful):
N
∑ i=1 x i
μ = .
N

2.3.2.1.1 What the Mean Tells Us

1. The mean gives us the expected value 28 of a distribution of data. In


probability theory, the expected value is the average event that could
result from a gamble (or anything similar involving probability): for
example, for every 10 flips of a fair coin, you could expect to get
5 heads and 5 tails. The expected value is not necessarily the most

likely value – in one flip of a fair coin, the expected value would be
1/2 heads and 1/2 tails, which is absurd (a coin can’t land half-

heads and half-tails) – but it is the value you could expect, on average,
in repeated runs of gambles.

In the context of a variable x, the expected value of x – symbolized E(x) –


is the value you would expect, on average, from repeatedly choosing a single
value of x at random. Let’s revisit the histogram of point totals for NBA
players in the 2019-2020 season, now adding a line to indicate where the
mean of the distribution lies:
Figure 2.4: Histogram of Points Scored by Players in the 2019-2020 NBA
Season; Dashed Line Indicates Mean Points Scored

On average, NBA players scored 447.61 points in the 2019-2020 season. Of


course, nobody scored exactly 447.61 points – that’s not how basketball
works. But, the expected value of an NBA player’s scoring in that season
was 447.61 points: if you selected a player at random and looked up the
number of points, occasionally you would draw somebody who scored more
than 1500 points, and occasionally you would draw somebody who scored
fewer than 50 points, but the average of your draws would be the average of
the distribution.

2. For any given set of data x, we can take a number y and find the
errors29 between x and y: x − y. The mean of x is the number that
i i

minimizes the squared errors (x − y) . For example, imagine you


i
2

were asked to guess a number from a set of six numbers. If those


numbers were x = {1, 1, 1, 2, 3, 47}, if you guessed “2,” then you
would be off by a little bit if one of the first five numbers were drawn,
but you would be off by a lot –45 – if the sixth number were drawn, and
that error would look even worse if you were judged by _the amount
you were off were squared – 45 = 2, 025. Now, you may ask, in what
2

world would such a scenario even happen? Well, as it turns out, it


happens all the time in statistics: when we describe data, we often have
to balance our errors to make consistent predictions over time, and
when our errors can be positive or negative and exist in a 2-
dimensional x, y plane, minimizing the square of our errors becomes
super-important (for more, see the page on correlation and regression).

3. Related to points (1) and (2), the mean can be considered the balance
point of a dataset: for every number or number less than the mean, there
is a number or are numbers greater than the mean to balance out the
distance. Mathematically, we can say that:
n

∑ (x i − x) = 0
¯

i=1

For those reasons, the mean is the best measure of central tendency for taking
all values of x into account in summarizing a set of data. While that is often a
positive thing, there are drawbacks to that quality as well, as we are about to
discuss.

2.3.2.2 What the Mean Leaves Out

1. The mean is the most susceptible of the measures of central tendency to


outliers.30 In Figure 2.5, the gross revenue of major movie releases for
the year 1980 are shown in a histogram. In that year, the vast majority of
films earned betwee $0 and $103 million, with one exception…
Figure 2.5: Luke, I am your outlier!

The exception was Star Wars Episode V: The Empire Strikes Back, which
made $203,359,628 in 1980 (that doesn’t count all the money it made in re-
releases), nearly twice as much as the second-highest grossing film (9 to 5,
which is a really good movie but is not part of a larger cinematic universe).
The mean gross of 1980 movies was $24,370,093, but take out The Empire
Strikes Back and the mean was $21,698,607, a difference of about $2.7
million (which is more than 13% of 1980-released movies made on their
own). The other measures of central tendency don’t move nearly as much: the
median changes by $63,098 depending on whether you include Empire or
not, and the mode doesn’t change at all.

We’re left with a bit of a paradox: the mean is useful because it can balance
all values in a dataset but can be misleading because the effect of outliers on
it can be outsized relative to other measures of central tendency. So is the
mean’s relationship with extreme values a good thing or a bad thing? The
(probably unsatisfying) answer is: it depends. More precisely, it depends on
the story we want to tell about the data. To illustrate, please review a pair
of histograms. Figure 2.6 is a histogram depicting the daily income in one
month for an imaginary person who works at an imaginary job where they get
paid imaginary money once a month.

Figure 2.6: Daily Expenditures for a Real Month for an Imaginary Person

We can see that for 29 of the 30 days in September, this imaginary person has
negative net expenditures – they spend more money than they earn – and for
one day they have positive net expeditures (the day that they both get paid
and have to pay the rent) – they earn much more than they spend. That day
with positive net expeditures is the day of the month when they get paid.
Payday is a clear outlier – it sits way out from the rest of the distribution of
daily expenditures. But, if we exclude that outlier, the average daily
expenditure for our imaginary person is $-35.19 and if we include the outlier,
the average daily expenditure is $5.99 – the difference between our
imaginary person losing money every month and earning money every
month. Thus, in this case, using the mean with all values of x is a better
representation of the financial experience of our imaginary hero.

Now, let’s look at another histogram, this one with a dataset of 2 people.
Figure 2.7 is a histogram of the distribution of years spent as President of the
United States ofAmerica in the dataset
x = {me, F ranklin Delano Roosevelt}.

Figure 2.7: Terms Spent as US President: Me and Franklin Delano Roosevelt

Here is a case where using the mean is obviously misleading. Yes, it is true
that the average number of years spent as President of the United States
between me and Franklin Delano Roosevelt is six years. I didn’t contribute
anything to that number: I’ve never been president and I don’t really care to
ever be president. So, to say that I am part of a group of people that averages
six years in office is true, but truly useless. Thus, some judgment is required
when choosing to use the mean to summarize data.

2. This is also going to be true of the median, and to a lesser extent the
mode, but using the mean to summarize data leaves out information
about the shape of the distribution beyond the impact of outliers. In
Figure 2.8, we see three distributions of data with the same mean but
very different shapes.
Figure 2.8: Histogram of Three Distributions with the Same Mean

As shown in Figure 2.8, Distribution A has a single peak, Distribution B has


two peaks, and Distribution C has three peaks (the potential meanings of
multiple peaks in distributions is discussed below in the section on the
mode). But, you wouldn’t know that if you were just given the means of the
three distributions. We lose that information when we go from a depiction of
the entire distributions (as the histograms do visually) to a depiction of one
aspect of the distributions – in this case, the means of the distributions.
Information loss is a natural consequence of summarization:31 it happens
every time we summarize data. It is up to the responsible scientist to
understand which information is being lost in any kind of summarization
(incidentally, histograms and other forms of data visualization are great ways
to reveal details about distributions of data) and to choose summary statistics
accordingly.

2.3.2.3 Median
The median32 is the value that splits a distribution evenly in two parts. If
there are n numbers in a dataset, and n is odd, then the median is the
th
(
n

2
+
1
)
2
largest value in the set; if n is even, then the median is the
th th
average of the ( ) and the ( + 1) largest values. That makes it sound
n

2
n

a lot more complicated than it is – here are two examples to make it easier:

if x = {1, 2, 3, 4, 5},

then median(x) = 3

if x = {1, 2, 3, 4},

2 + 3
then median(x) = = 2.5
2

2.3.2.3.1 What the Median Tells us

1. The median tells us more about the typical values of datapoints in a


distribution than does the mean or the mode. For that reason, the median
is famously used in economics to describe the central tendency of
income – income can’t be negative (net worth can) so it is bounded by
0, and has no upper bound and thus is skewed very, very positively.33

The median is used for a lot of skewed distributions in lieu of the mean not
only because it is more resistant to outliers than is the mean, but also
because it minimizes the absolute errors made by predictions. By absolute
errors we mean the absolute value of the errors |x − y|, where y is the
i

prediction and x is one of the predicted scores. Thus, when we use the
i

median in the case of income, we are saying that representation is closer


(positively or negatively) to more of the observed values than any other
number.

2. The median is the basis of comparison used in two important


nonparametric tests: The Mann-Whitney U test and the Wilcoxon
Signed-ranks test. It is used in those tests due to its applicability to both
continuous data and ordinal data.
The median is also the basis of a set of analytic tools known as robust
statistics, a field established to try to limit the influence of outliers and non-
normal distributions. Robust statistics as a field is beyond the scope of this
course, but I encourage you to read more if you are interested.

2.3.2.3.2 What the Median Leaves Out

1. Outliers can be important.

2. Like the mean, the median does not tell us much about the shape of a
distribution.

3. Outside of the field of Robust Statistics and certain nonparametric tests,


the median, unlike the mean, is not a measure of central tendency used in
classical statistical tests.

2.3.2.4 Mode

The mode 34 The mode is the most likely value or values of a distribution to
be observed. A distribution is unimodal if it has one clear peak, as in part a
of Figure 2.9. A distribution is bimodal if it has two clear peaks, as in part b.
Distributions with more than one peak are collectively known as multimodal
distributions.

Figure 2.9: A Unimodal and a Bimodal Distribution


Multimodality itself is of great scientific interest. As we will cover at length
when we discuss frequency distributions, unimodality is a common
assumption regarding patterns of data found in nature. When multiple modes
are encountered, it may be a sign that there are multiple processes going on –
for example, the distribution of gas mileage statistics for a car will have
different peaks for driving in a city (with lots of starts and stops that consume
more gas) and for driving on the highway (which is generally more fuel-
efficient). Multimodality can also suggest that there is actually a mixture of
distributions in a dataset – for example, a dataset of the physical heights of
people might show two peaks that reflect a mixture of people assigned male
at birth and people assigned female at birth, two groups that tend to grow to
different adult heights.

One note on multimodality: multiple peaks don’t have to be exactly as high


as each other. Multimodality is more about a peaks-and-valleys pattern than a
competition between peaks.

2.3.2.4.1 What the Mode Tells us

1. In a unimodal frequency distribution, the mode is the maximum


likelihood estimate 35 of a distribution. In terms of sampling, it’s the
most likely value to draw from a distribution (because there are the
most observations of it). In a multimodal distributions, the modes are
related to local maxima of the likelihood functions. Don’t worry too
much about that for now.

2. The mode minimizes the total number of errors made by a prediction.


A nice example that I am stealing from my statistics professor is that of
a proprietor of a shoe store. If you want to succeed in shoe-selling, you
don’t want to stock up on the mean shoe size – that could be a weird
number like 8.632 – nor on the median shoe size – being off by a little
bit doesn’t help you a lot here – you want to stock up the most on the
modal shoe size to fit the most customers.

3. Uniquely among the mean, median, and mode, the mode can be used
with all kinds of continuous data and all kinds of discrete data. There is
no way to take the mean or median of categorical data. You can but
probably shouldn’t use the mean of rank data36 (the median is fine to
use with rank data). Because the mode is the most frequent value, it can
be the most frequent continuous value (or range of continuous values,
depending on the precision of the measurement), the most frequent
response on an ordinal scale, or the most frequent observed category.

4. Unlike the mean and the median, the mode can tell us if a distribution
has multiple peaks.

2.3.2.4.2 What the Mode Leaves Out

1. Like the other measures of central tendency, the mode doesn’t tell us
anything about the spread, the shape, or the height on each side of the
peak of a distribution. There are some sources out there – and I know
this because I have never said it in class nor written it down in a text but
have frequently encountered it as an answer to the question what is a
drawback of using the mode? – that say that the mode is for some
reason less useful because “it only takes one value of a distribution into
account.” That’s wrong for one reason – there can be more than one
mode, so it doesn’t necessarily take only one value into account – and
misleading for another: the peak of a distribution depends, in part, on
all the other values being less high than the peak. To me, saying that
the peak of a distribution only considers one value is like saying that
identifying the winner of a footrace only takes one runner into account. I
think the germ of a good idea in that statement is that we don’t know
how high the peak of a distribution is, or what the distribution around it
looks like, but that’s a problem with central tendency in general.

2. Although, related to that last part: the mode doesn’t really account for
extreme values, but neither does the median.

2.3.3 Quantiles

Quantiles37 are values that divide distributions into sections of equal size.
We have already discussed one quantile: the median, which divides a
distribution into two sections of equal size. Other commonly-used quantiles
are:
Quantile Name Divides The Distribution Into
Quintiles Fifths
Quartiles Fourths
Deciles Tenths
Percentiles Hundreths

At points where quantiles coincide, their names are often interchanged. For
example, the median is also known as the 50th percentile and vice versa, the
first quartile is often known as the 25th percentile and vice versa, etc.

2.3.3.1 Finding quantiles

Defining the quantile cutpoints of a distribution of data is easy if the number


of values in the distribution are easily divided. For example, if
x = {0, 1, 2, 3, ..., 100}, the median is 50, the deciles are
{10, 20, 30, 40, 50, 60, 70, 80, 90}, the 77th percentile is 77, etc.. When n is

not a number that is easily divisible by a lot of other numbers, it gets a bit
more complicated because there are all kinds of tiebreakers and different
algorithms and stuff. The things that are important to know about finding
quantiles are:

1. We can use software to find them (for example, the R code is below),
2. Occasionally, different software will disagree with each other and/or
what you would get by counting out equal proportions of a distribution
by hand, and
3. Any disagreement between methods will be pretty small, probably
inconsequential, and explainable by checking on the algorithm each
method uses.

2.3.4 Spread

2.3.4.1 Range

The Range 38 is expressed either as the minimum value of a variable and the
maximum value of a variable (e.g._x is between a and b), or as the
difference between the highest value in a distribution and the lowest value in
a distribution (e.g., the range of x is b − a). For example:

x = {0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55}

Range(x) = 55–0 = 55

The range is highly susceptible to outliers: just one weird min or max value
can risk gross misrepresentation of the dataset. For that reason, researchers
tend to favor our next measure of spread…

2.3.4.2 Interquartile Range

The Interquartile range 39 is the width of the middle 50% of the data in a
set. To find the interquartile range, we simply subtract the 25th percentile of
a dataset from the 75th percentile of the data.

I QR = 75th percentile − 25th percentile

2.3.4.3 Variance

Variance 40 is both a general descriptor of the way things are distributed (we
could, for example, talk about variance in opinion without collecting any
tangible data) and a specific summary statistic that can be used to evaluate
data. The variance is, along with the mean, one of the two key statistics in
making inferences.

There are two equations for variance: one is for a population variance
parameter, and the other is for a sample variance statistic. Both equations
represent the average squared error of a distribution. However, to apply the
population formula to a sample would consistently underestimate the
variance of a sample41 and thus an adjustment is made in the denominator.

2.3.4.3.1 Population Variance


2
∑ (x − μ)
2
σ =
N

2.3.4.3.2 Sample Variance

2
¯ ∑ (x − x)
2
s =
n − 1

Aside from the differences in symbols between the population and sample
equations – Greek letters for population parameters are replaced by their
Latin equivalents for the sample statistics – the main difference is in the
denominator. The key reason for the difference is related to the nature of μ
and x. For any given population, there is only one population mean – that’s μ
¯

– but there can be infinite values of x (it just depends on which values you
¯

sample). In turn, that means that the error term x − x depends on the mean of
¯

the sample (which, again, itself can vary). That mean is going to vary a lot
more if n is small – it’s a lot easier to sample three values with a mean
wildly different from the population mean than it is to sample a million
values with a mean wildly different from the population mean – so the bias
that comes from using the population mean equation to calculate sample
variance is bigger for small n and smaller for big n.

To correct for that bias, the sample variance equation divides the squared
errors not by the number of observations but by the number of observations
that are free to vary given the sample mean. That sounds like a very weird
concept, but hopefully this example will help:

If the mean of a set of five numbers is 3 and the first four numbers are
{1, 2, 4, 5}, what is the fifth number?

1 + 2 + 4 + 5 + x
3 =
5

15 = 12 + x

x = 3

This means that if we know the mean and all of the n values in a dataset but
one, then we can always find out what that one is. In turn, that means that if
you know the mean, then n − 1 of the values of a dataset are free to vary
except one: that last one has to be whatever value makes the sum of all the
values equal x/n.In general, the term that describes the number of things
¯

that are free to vary is degrees of freedom42, and for calculating a sample
mean, the degrees of freedom – abbreviated df – is equal to n − 1, so that is
the denominator we use for the sample variance.

On a technical note, the default calculation in statistical softwares is, unless


otherwise specified, the sample standard variance. If you happen to have
population-level data and want to find the population variance parameter,
just multiply the result you get from the software by (n − 1)/n to change the
denominator.

2.3.4.4 Standard Deviation

The standard deviation43 is the square root of the variance. It is a measure


of the typical deviation (reminder: deviation = error = residual) from the
mean (we can’t really say “average deviation from the mean,” because
technically the average deviation from any mean is zero). The standard
deviation has a special mathematical relationship with the normal
distribution, which we will cover in the unit on frequency distributions when
that kind of thing will make more sense.

As the standard deviation is the square root of the variance, there is both a
population parameter for standard deviation – which is the square root of the
population parameter for variance – and a sample statistic for standard
deviation – which is the square root of the sample statistic for variance.

2.3.4.4.1 Population Standard Deviation

2
∑ (x − μ)

σ = √σ =
2

2.3.4.4.2 Sample Standard Deviation


2
¯ ∑ (x − x)

s = √s =
2

n − 1

As with the variance, the default for statistical software is to give the sample
version of the standard deviation, so multiply the result by √n − 1/√n to
get the population parameter.

2.3.5 Skew

Like the variance, the skew44 or skewness is both a descriptor of the shape
of a distribution and a summary statistic that can be used to evaluate the way
a variable is distributed. Unlike the variance, the skewness statistic isn’t
used in many statistical tests, so here we will focus more on skewness as a
shape and less on skewness as a quantity.

The skew of a distribution be described in one of three ways. A positively


skewed distribution has relatively many small values and relatively few large
values, creating a distribution that appears to point in the positive direction
on the x axis. Positive skew is often a sign that a variable has a relatively
strong lower bound and a relatively weak upper bound – for example, we
can again think of income, which has an absolute lower bound at 0 and no
real upper bound (at least in a capitalistic system). A negatively skewed
distribution has relatively few small values and relatively many large values,
creating a distribution that appears to point in the negative direction on the x
axis. Negative skew is a sign that a variable has a relatively strong upper
bound and relatively weak lower bound – for example, the grades on a
particularly easy test. Finally, a balanced distribution is considered
symmetrical. Symmetry indicates a lack of bounds on the data or, at least,
that the bounds are far enough away from most of the observations to not
make much of a difference – for example, the speeds of cars on highways
tend to be symmetrically distributed: even though there is an obvious lower
bound (0 mph) and an upper bound on how fast commercially-produced cars
can go (based on a combination of physics, cost, and regard for the safety of
self and others), neither have much influence on the vast majority of
observations.
Figure 2.10 shows examples of a positively skewed distribution, a symmetric
distribution, and a negatively skewed distribution.

Figure 2.10: Distributions with Different Skews

When we talked about the mode, we talked about the peak (or peaks) of
distributions. Here we will introduce another physical feature of
distributions: tails45. A tail of a distribution is the longish, flattish part of a
distribution furthest away from the peak. For a positively skewed
distribution, there is a long tail on the positive side of the peak and a short
tail or no tail on the negative side of the peak. For a negatively skewed
distribution, there is a long tail on the negative side of the peak and a short
tail or no tail on the postive side of the peak. A symmetric distribution has
symmetric tails. We’ll talk lots more about tails in the section on kurtosis.

The term skewness is really most meaningful when talking about unimodal
distributions – as you can imagine, having multiple peaks would make it
difficult to evaluate the relative size of tails. If, for example, you have one
relatively large peak and one relatively small peak, is the small peak part of
the tail of the large peak? It’s best not to get into those kinds of philsophical
arguments when describing distributions: in a multimodal distribution, the
multimodality is likely a more important feature than the skewness.

2.3.5.1 Skewness statistics

As noted above, skewness can be quantified. The skewness statistic is rarely


used, and if it is used, it is to note that negative values indicate negative skew
and positive values indicate positive skew. So, let’s dive briefly into the
skewness statistic (and the statistic parameter), and if you should ever need
to make a statement about its value, you will know how it is calculated.

As with the variance and the standard deviation, there is an equation


describing (mostly theoretical) population-level skewness and another
equation describing sample skewness that includes an adjustment for bias in
sub-population-level samples.

2.3.5.1.1 Population Skewness

The skewness of population-level data is given by


3 1 3
x − μ ∑(x − μ)
n
μ̃ 3 = E [( ) ] =
3
σ σ

with μ̃ indicating that it is the standardized third moment of the distribution


3

(for more on what that barely-in-English phrase means, please see the bonus
content below) and σ is the standard deviation.

2.3.5.1.2 Sample Skewness

The sample skewness formula used by statistical software packages (all of


them, as far as I can tell from my research) is:
3
¯ n ∑(x − x)
sample skewness =
3
(n − 1)(n − 2) s

where s is the sample standard deviation

2.3.6 Kurtosis

Kurtosis46 is a measure of the lightness or heaviness of the tails of a


distribution. The heavier the tails, the more extreme values (relative to the
peak) will be observed. A light-tailed or platykurtic distribution will have
very few extreme values. A medium-tailed or mesokurtic distribution will
produce extreme values at the same rate as the normal distribution (which is
the standard for comparison), and a heavy-tailed or leptokurtic will produce
more extreme values than the normal distribution. An example of a
platykurtic distribution, an example of a mesokurtic distribution, and an
example of a leptokurtic distribution are shown in Figure 2.11.

Figure 2.11: Distributions with Different Kurtosis

2.3.6.1 Kurtosis statistics

The kurtosis statistic is used even less frequently than the skewness statistic
(this section is strictly for the curious reader). For a perfectly mesokurtic
(read: a normal distribution), the kurtosis is 347.

2.3.6.1.1 Population Kurtosis

The kurtosis of population-level data is given by^[The excess kurtosis


parameter is given by simply subtracting 3:
N 4
∑ (x − μ)
i=1
− 3
2
N
2
(∑ i=1 (x − μ) )

4 N 4
x − μ ∑ (x − μ)
i=1
Kurtosis(x) = E [( ) ] =
2
σ N
(∑ (x − μ) 2 )
i=1
2.3.6.1.2 Sample Kurtosis

The kurtosis of sample data is given by^[The sample excess kurtosis is given
by
n 4
2
n(n + 1)(n − 1) ∑ i=1 (x i − x)
¯ 3(n − 1)

(n − 2)(n − 3) s4 (n − 2)(n − 3)

n 4
¯ n(n + 1)(n − 1) ∑ i=1 (x i − x)
sample kurtosis =
4
(n − 2)(n − 3) s

and that’s all we’ll say about that.

2.4 R Commands
2.4.1 mean

mean()48

example:

x<-c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 6, 7)
mean(x)

## [1] 4

2.4.2 median
median(x)

## [1] 4

2.4.3 mode
There is no built-in R function to get the mode (there is a mode() function,
but it means something different). But, we can install the package DescTools
to get the mode we’re looking for.
install.packages(DescTools)49

library(DescTools) Mode()

Mode(x) #Be sure to capitalize "Mode" so you get the correct M

## [1] 4
## attr(,"freq")
## [1] 4

2.4.4 quantiles
`quantile(array, quantile)

Example 1: 80th percentile of x

quantile(x, 0.8)

## 80%
## 5

Example 2: quartiles of x

quantile(x, c(0.25, 0.5, 0.75))

## 25% 50% 75%


## 3 4 5

2.4.5 range
The range() command returns the endpoints of the range.

example:

range(x)

## [1] 1 7

to get the size of the range, you can use either range(x)[2]-range(x)[1]
or max(x)-min(x)
range(x)[2]-range(x)[1]

## [1] 6

max(x)-min(x)

## [1] 6

2.4.6 variance

var()

example:

var(x)

## [1] 2.666667

2.4.7 standard deviation


sd()

example:

sd(x)

## [1] 1.632993

2.4.8 Skewness and kurtosis


Skewness and kurtosis are not part of the base R functions, so we will just
need to install a package to calculate those (there are a couple packages that
can do that, I just picked e1071).

install.packages(e1071)50

library(e1071)

skewness(x)
skewness(x)

## [1] 0

kurtosis(x)

## [1] -0.9609375

2.5 Bonus Content


2.5.1 The mathematical link between mean, variance,
skewness, and kurtosis

The mean, variance, skewness, and kurtosis of a distribution are all ways to
describe different aspects of the distribution. They are also mathematically
related to each other: each is derived using the method of moments, a really
old idea (in terms of the history of statistics, which is a relatively young
branch of math) that isn’t actively used much anymore (for reasons we’ll
discuss in a bit). The basic idea of the method of moments is borrowed from
physics: for a object spinning around an axis, the zeroth moment is the
object’s mass, the first moment is the mass times the center of gravity, and the
second moment is the rotational inertia of the object. When applied to a
distribution of data, the rth moment m of a distribution of size n is given by:
n
1 r
mr =
¯ ∑ (x i − x)
n
i=1

The mean is defined as the first moment, and the variance is defined as the
second moment.

The skewness is the standardized third moment. The third moment by itself is
in terms of the units of x – if x is pounds, the third moment is in pounds; if x
is in volts, the third moment is in volts – and that doesn’t make a ton of sense
when talking about the shape of a histogram (it would be weird to say that the
distribution is skewed by three pounds to the left). So, instead, the third
moment is standardized by dividing by the cube (to match the cube in the
numerator) of the standard deviation (which is the variance to the power of
3/2 ). Similarly, the kurtosis is the standardized fourth moment: the ratio of
the fourth moment to the standard deviation to the fourth power (which is
square of the variance, or, the second moment).

1. Statistic: a number that summarizes information about a sample.↩

2. Sample: a subset of a population↩

3. Population: the entirety of things (including people, animals, etc.) of


interest in a scientific investigation↩

4. Parameter: a number that summarizes information about a population↩

5. Most importantly, whether we use fixed effects or random effects


models – this is covered in the page on ANOVA and and will be
discussed at length in PSY 208.↩

6. *there are probably exceptions, but I can’t really think of any right
now↩

7. The word data is the plural of the word datum and should take plural
verb forms but often doesn’t. It’s a good habit to say things like “the
data are” and “the data show” rather than things like “the data is” and
“the data shows” in scientific communication. It’s a bad habit to correct
people when they use data as a singular noun: try to let that stuff slide
because we live in a society.↩

8. Discrete: data with a limited set of values in a limited range.↩

9. Nominal or categorical Data: Data that refer to category membership


or to names.↩

10. Data with relative values.↩

11. Data that can take on infinite values in a limited range (i.e. data with
values that are infinitely divisible).↩
12. Interval data: Continuous data with meaningful mathematical
differences between values.↩

13. Ratio data: Continuous data with meaningful differences between


values and meaningful relative magnitudes among values.↩

14. _Cardinal (count) data:_Integers describing the number of occurrences


of events↩

15. *trenchant TV criticism↩

16. _Proportion:_Part of a whole, expressed as a fraction or decimal↩

17. For a proportion p with n observations, the variance is equal to p(1−p)

and the standard deviation is equal to √


p(1−p)

n

.

18. Binary (dichotomous) data: data that can take one of two values; those
values are often assigned the values of either 0 or 1.↩

19. Independent variable: A variable in an experimental or


quasiexperimental design that is manipulated.↩

20. Dependent variable: a variable in an experimental or


quasiexperimental design that is measured↩

21. Predictor variable: A variable associated with changes in an outcome.


Predicted (outcome) variable: A variable associated with changes in a
predicted variable↩

22. Histogram: A chart of data indicating the frequency of observations of a


variable↩

23. Univariate: having to do with one variable.↩

24. In case you’re interested: the two little bars at 1950 – 2000 points and
at 2300 – 2350 points each represent single players: Damian Lillard
(who scored 1,978 points) and James Harden (who scored 2,335
points). Congratulations, Dame and James!↩
25. Central tendency: a summarization of the overall position of a
distribution of data.↩

26. Mean: generally understood in statistics to be the arithmetic mean of a


distribution; the ratio of the sum of the values in a distribution to the
number of values in that distribution.↩

27. There are other means than the arithmetic mean, most famously the
geometric mean
1

n n

(∏ x 1 )

i=1

and the harmonic mean


n
.
n 1

i=1 xi

Both of those have important applications, but we won’t get to those in


this course.↩

28. Expected value: the average outcome of a probabilistic sample space↩

29. Error: the difference between a prediction and an observed value, also
known as residual and deviation↩

30. Outlier: a datum that is substantially different from the data to which it
belongs.↩

31. Information loss is also a major theme of this page – it’s what the
introduction was all about!↩

32. Median: The value for which an equal number of data in a set are less
than the value and greater than the value, also known as the 50th
percentile.↩
33. Let’s put income skew this way: if you put Jeff Bezos in a room with
any 30 people, on average, everybody in that room would be making
billions of dollars a year.↩

34. Mode: The most frequently-occurring value or values in a distribution


of data.↩

35. Maximum Likelihood Estimate: the most probable value of observed


data given a statistical model↩

36. The use of means and similar mathematical transformations including


the variance and standard deviation with interval data is a matter of
some debate. To illustrate, let’s imagine that you are asked to evaluate
the quality of your classes on a 1 – 5 integer scale (you won’t have to
imagine for long – it happens at the end of every semester at this
university). The argument in favor of using the mean is that if one class
has an average rating of 4.2 and another class has an average rating of
3.4, then of course the first class is generally preferred to the second
and the use of means is perfectly legitimate. The argument against using
the mean is that there is no way of knowing if, say, a rating of 2
indicates precisely twice as much quality as a rating of 1 or if the
difference between a 3 and a 4 is the same as the difference between a 4
and a 5, and that the use of means implies continuity that isn’t there. For
the record: I’m in the latter camp, and I would say just use the median or
the mode.↩

37. Quantiles: values that divide the distributions into equal parts↩

38. Range: The difference between the largest and smallest values in a
distribution of data.↩

39. Interquartile Range: The difference between the 75th percentile value
and the 25th percentile value of a distribution of data; the range of the
central 50% of values in a dataset.↩

40. Variance: A measure of the spread of a dataset equal to the average of


the squared deviations from the mean of the dataset.↩
41. A statistic that is consistently wrong in the same direction is known as a
biased estimator. The population variance equation is a biased
estimator of sample variance.↩

42. Degrees of freedom (df): the number of items in a set that are free to
vary.↩

43. Standard deviation: The typical deviation from the mean of the data set,
equal to the square root of the variance.↩

44. Skew (or skewness): The balance of a distribution about its center↩

45. Tail(s): the area or areas of a distribution furthest from the peak.↩

46. Kurtosis: The relative sizes of the tails of a symmetric distribution.↩

47. Some prefer a kurtosis statistic that is equal to zero for a mesokurtic
distribution and so sometimes you will see an excess kurtosis statistic
that is recentered at 0↩

48. 1. c() is the combine command, which tells R to combine everything


inside the parentheses into one object (in this case, an array called x).
2. The [1] is just an indicator that this is the first (and in this case, only)
line of the results↩

49. You only need to install a package once to your computer. After that,
every time you start a new R session, you just have to call
library(insert package name) to turn it on. The best analogy I
have encountered to describe the process is from Nathaniel D.
Phillips’s book YaRrr! The Pirate’s Guide to R: a package is like a
lightbulb – install.packages() puts the lightbulb in the socket and
then library() turns it on.↩

50. You only need to install a package once to your computer. After that,
every time you start a new R session, you just have to call
library(insert package name) to turn it on. The best analogy I
have encountered to describe the process is from Nathaniel D.
Phillips’s book YaRrr! The Pirate’s Guide to R: a package is like a
lightbulb – install.packages() puts the lightbulb in the socket and
then library() turns it on.↩
3 Visual Displays of Data
3.1 About this Page
All of the original visualizations on this page were made using R. Good
visualization goes far beyond the software used to make it! Good visualization
can be done with a pencil and paper, and it can certainly be done with all
kinds of different packages. However, R happens to be an excellent software
for data visualization because of all of the packages that have developed to
work in R, so all of the packages and code used for the original figures are
visible on this page.

3.1.1 Packages Used to Make The Figures in This Chapter


library(ggplot2)
library(ggthemes)
library(kableExtra)
library(cowplot)
library(knitr)
library(MASS)
library(usmap)
library(socviz)
library(tidyverse)
library(forestplot)
library(see)
library(riverplot)
library(RColorBrewer)

3.1.2 Datasets Created for the Figures in This Chapter


## Figure 1
set.seed(77) #Setting a seed makes every random sample you tak

figure1data<-data.frame(rnorm(10000, 0, 1)) #rnorm = random va


colnames(figure1data)<-"x"
## Figure 2
Data<-c(3.92, 3.30, 3.92, 3.60, 3.24, 3.22, 3.06, 3.37, 3.47,
City<-c(rep("Boston", 12), rep("Seattle", 12))
Month<-rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul",

rainfall<-data.frame(City, Month, Data)

## Figure 3

boxplot1.df<-data.frame(rnorm(1000))
colnames(boxplot1.df)<-"data"

## Figure 4

condition1<-rnorm(1000, 4, 4)
condition2<-rnorm(1000, 8, 6)
values<-c(condition1, condition2)
labels<-c(rep("Condition 1", 1000), rep("Condition 2", 1000))
boxplot.df<-data.frame(labels, values)

## Figure 5

barchart.df<-boxplot.df

## Figure 6

barchart.df$sublabels<-c(rep("A", 500), rep("B", 500), rep("A"

## Figure 7

samplehist.df<-data.frame(rnorm(10000))
colnames(samplehist.df)<-"x"

## Figure 8
x1<-rnorm(10000)
x2<-rnorm(10000, 3, 1)

### 8a
comp.hist.df<-data.frame(x1, x2)
### 8b
comp.hist.long<-data.frame(c(rep("Variable 1", 10000), rep("Va

## Figures 9, 10, 11 use the data from Figure 5

## Figure 12
N <- 200 ### Number of random samples

### Target parameters for univariate normal distributions


rho <- 0.8
mu1 <- 1; s1 <- 2
mu2 <- 1; s2 <- 2

### Parameters for bivariate normal distribution


mu <- c(mu1,mu2) ### Mean
sigma <- matrix(c(s1^2, s1*s2*rho, s1*s2*rho, s2^2),
2) ### Covariance matrix

scatterplot.df <- data.frame(mvrnorm(N, mu = mu, Sigma = sigma


colnames(scatterplot.df) <- c("x","y")

## Figure 13 uses data from Figure 2

## Figure 14 uses data from The Office Season 5 Episode 9

## Figure 15
clrs <- fpColors(box="royalblue",line="darkblue", summary="roy
labeltext<-c("Variable", "a", "b", "c", "d", "e")
mean<-c(NA, 0.2, 1.3, 0.4, -2.1, -2.0)
lower<-c(NA, -0.1, 0.7, 0, -2.4, -2.5)
upper<-c(NA, 0.5, 2, 0.8, -1.8, -1.3)

## Figure 16 is a reproduction

## Figure 17

county_full <- left_join(county_map, county_data, by = "id")

## Figure 18
edges = data.frame(N1 = c("A1", "A1", "A1", "B1", "B1", "B1"),
N2 = c("A2", "B2", "C2", "A2", "B2", "C2"),
Value = c(33, 33, 10, 21, 54, 13),
stringsAsFactors = F)

nodes = data.frame(ID = unique(c(edges$N1, edges$N2)), strings


nodes$x = c(1, 1, 2, 2, 2)
nodes$y = c(3, 2, 3, 2, 1)
rownames(nodes) = nodes$ID

## Figure 19 is a reproduction

## Figure 20

Values<-c(rchisq(10000, df=1),
rchisq(10000, df=2),
rchisq(10000, df=3),
rchisq(10000, df=4),
rchisq(10000, df=5),
rchisq(10000, df=6),
rchisq(10000, df=7),
rchisq(10000, df=8),
rchisq(10000, df=9),
rchisq(10000, df=10),
rchisq(10000, df=11),
rchisq(10000, df=12))

df<-rep(1:12, each=10000)

small.multiple.df<-data.frame(df, Values)

3.2 Making the Audience Smarter


Figure 3.1: Edward Tufte (b. 1942)

Statistician, data scientist, sculptor, and painter Edward Tufte (pictured at


right) is responsible for many of the advances in data visualization in the
latter part of the 20th century and early 21st century and coined a lot of the
vocabulary we will use to describe both good and bad elements of data
visualization. One of his guiding principles, paraphrased, is that our goal in
presenting statistical analysis is not to dumb down our content for the
consumer but to make our audience smarter. Data visualization is an
extension of summarization and categorization of data: it’s another tool we
have to tell stories about science. It is our responsibility to use data
visualization to help illustrate our stories. To that end, when we visualize
data, we should try at all times to:

1. include all relevant data,


2. make clear what is important about the data,
3. avoid extra items that may distract a reader, and
4. not present the data in misleading ways.

When we share data, we are teachers of the content of our science, and good
data visualization is one of our most powerful teaching tools. It’s also really
easy to mislead and distract with bad data visualization: the responsibility
lies with us to be effective and honest communicators.
3.3 Essentials of Good Visualization
Modern software is making it increasingly easy to create visualizations of all
kinds of data.1 Regardless of the simplicity or complexity of figures, there are
several principles that apply to all good data visualization.

3.3.1 Maximize the Data-ink Ratio

Data-ink ratio is one of Tufte’s contributions to our data visualization


vocabulary. Data-ink (which, I guess now should also be called “data-
pixels”) refers to all of the elements in a visualization that describe the data
themselves, including: the bars in histograms and bar charts, the points in a
scatterplot, the lines in a line plot, etc. There are other elements in figures that
aren’t specifically the data but help to put the data in context: including axes,
axis labels, arrows and annotations, and trendlines. The principle of
maximizing the data-ink ratio reminds us to make as much of our figures
about the data as possible.

For example, please see the pair of charts in Figure 3.2. Chart A is based on
the house style of the magazine The Economist. It’s not bad! But, there are a
few unnecessary elements. Please compare Chart A to Chart 1B: the exact
same data are represented, and you lose nothing by removing the background
color, nor by removing the horizontal gridlines, nor by removing the ticks on
the axes, nor by removing the axis lines. In fact, we gain focus on the data in
Chart B by removing all of the unnecessary elements. If possible, remove all
of the elements that you can while maintaining all of the information necessary
to understand the data, and when in doubt on whether to remove an element,
go ahead and try your figure without it: you may find that it wasn’t as
necessary as you thought.

lines<-ggplot(figure1data, aes(x))+geom_histogram(binwidth=0.1
theme_economist()+
ggtitle("A")
nolines<-ggplot(figure1data, aes(x))+geom_histogram(binwidth=0
theme_tufte(ticks=FALSE, base_size=16, base_family="sans")+
ggtitle("B")
keepitclean<-plot_grid(lines, nolines, ncol=2)
keepitclean
Figure 3.2: Keep it Clean!

So, how can we know precisely what the dimensions of our figures are
without gridlines? How do we know exactly how high a bar is, or where
exactly a point lies, without ticks on the axes for reference? Well, here’s the
thing: you don’t need to know any of that stuff, because:

Figures are for patterns and comparisons.

If the reader needs to know precise values, put them in text and/or a table.
The purpose of figures is not to show, for example, the means of two sets of
data but to help people get an idea of the relative magnitudes of those means.
Data visualizations that are meant to elicit careful examination from the reader
– to ask the reader to stare really closely at a figure to discern tiny differences
and distances from ticks and gridlines – are counterproductive because they
make your story harder to understand instead of easier. Precision is important,
but that’s what text and tables are for.

3.3.2 When Not to Visualize Data


Related to the above point, a single number does not necessitate data
visualization. And yet: that’s exactly what’s happening with an x out of y
figure like this:

Not only are such figures unnecessary, they’re often wrong: except in the rare
case that a proportion is exactly 9/10, there has to be rounding involved. So,
just use a number! If you want, you can make it really big to get people’s
attention, like this:
90%
People are generally pretty good about understanding single values – there’s
no reason to insult their intelligence (and potentially be inaccurate) with
paper-doll-looking figures.

3.3.3 Lines and Angles


As a species, humans are pretty good at understanding the relative length of
lines and pretty bad at discriminating between different angles. For that
reason, a bar chart will always be preferable to a pie chart, a dotplot will
always be preferable to a pie chart, an area plot will always be preferable to
a pie chart, in fact, everything will be preferable to a pie chart – for more,
please see the section on using pie charts below.

A close relative of the pie chart – and one with the same fundamental problem
as the pie chart – is the donut chart. Here’s an example that I got from
datavizcatalogue.com:

And here is its even more insidious cousin, the 3D donut chart (from
amcharts.com):
And here is an absolute monstrosity from slidemembers.com, look upon it and
gaze into the face of pure evil:
As mentioned earlier, angles are hard enough for us to process. We gain
nothing from seeing the side of a donut chart, or seeing it from multiple angles,
or torn apart and reassembled. Don’t do any of that stuff. Which brings us to…

3.3.4 Ducks
Duck is another Tufte term – it refers to any kind of ornamentation on a figure
that has no actual relevance to the data.
Figure 3.3: The Big Duck in Flanders, NY

Tufte got the term “duck” from the building pictured above. All of the ducky
elements of the building are functionally useless: they are just for decoration
(it used to be a place that sold ducks and duck eggs, now it’s a tourist
attraction).

Ducks are found in a lot of popular publications – USA Today is a frequent


offender and the most perfect example of the form is shown below – in the
form of illustrations and other adornments. You can occasionally find them in
scientific writing and reports: a shadow on a figure is a duck. An unnecessary
third dimension and/or perspective on a figure is a duck. Don’t take attention
away from the creative work that led to scientific discovery with creative
work that hides the important information.

3.3.5 Annotations

Not all additions to charts are ducks: some are quite useful. Annotations can
help draw attention to the important parts of a visualization. NBA analyst (and
one-time NBA executive, and geographer by training) Kirk Goldsberry does
really nice work with annotation in his basketball-themed data visualizations,
for example, this chart shows Stephen Curry’s shooting results from the 2012-
2013 NBA season: with annotations on top of the patterns, he provides more
information about the patterns in the data and calls attention to the main points.
The benefit of annotation can be as simple as replacing legends to improve
readability. In Figure 3.4, the chart on the right uses annotations instead of the
legend shown in the chart on the left.

rainfalla<-ggplot(rainfall, aes(x=Month, y=Data, group=City))+


geom_line(aes(color=City))+
geom_point(aes(color=City))+
scale_x_discrete(limits=c("Jan", "Feb", "Mar", "Apr", "May",
theme_tufte(base_size=12, ticks=FALSE, base_family="sans")+
labs(x="Month", y="Average Rainfall in Inches")+
theme(legend.position="bottom")+
ggtitle("A")+
ylim(0, NA)

rainfallb<-ggplot(rainfall, aes(x=Month, y=Data, group=City))+


geom_line(aes(color=City))+
geom_point(aes(color=City))+
scale_x_discrete(limits=c("Jan", "Feb", "Mar", "Apr", "May",
theme_tufte(base_size=12, ticks=FALSE, base_family="sans")+
labs(x="Month", y="Average Rainfall in Inches")+
theme(legend.position="none")+
annotate("text", x=c(9, 11), y=c(3, 5.8), label=c("Boston",
ggtitle("B")+
ylim(0, NA)

plot_grid(rainfalla, rainfallb, nrow=1)

Figure 3.4: Making Legends Into Annotations

Replacing the legend with annotations reduces the effort the reader has to
make2: their eyes don’t have to leave the lines of data. That may seem like an
extremely small level of effort to be saving, but here’s an important fact about
scientific writing:

Reading scientific writing is exhausting and every little bit of relief helps.

3.3.6 Lying
The classic example of misleading with data visualization is starting a y-axis
at a value other than zero. For example, the following chart uses a y-axis that
starts at 6,000,000 to exaggerate the differences between the two bars:

I had never thought of the possibility of messing with the x-axis to


misrepresent data, but then, in the Spring of 2020, the State of Georgia
showed us something new:
The Georgia Health Department used a color scheme that makes it hard to see
the dates on the x-axis (obviously not the biggest problem here), but if you
look closely you can see that the dates are not in chronological order, giving
the impression that COVID-19 cases were steadily decreasing over time.

3.3.7 Colors

There is nothing I can write about the use of colors in data visualization that
could possibly improve on this blog post by Lisa Charlotte Rost. Go read it.

3.3.8 Fonts

As noted above, the readers of scientific content are working hard, and any
little bit you can do to give them a break is a good deed done. Sans serif fonts,
generally speaking, are easier to read than serif fonts because they convey the
same amount of information with less ornamentation.3 So, if given the choice,
sans serif fonts are preferable. However, it’s also nicer-looking if the font in
the figures matches the text of a document. I tend to think that mismatched fonts
are jarring to the extent that it outweighs any benefit conferred by sans-serif
fonts, so if you know your text is going to be, say, written in Times New
Roman], I would use that in your figures as well.
3.4 Types of Visualization
Here we’re going to run through some of the more common and useful forms
of data visualization. This is by no means an exhaustive list; in fact, there
really is no exhaustive list of data visualizations because new ones can
always be created. But, using relatively popular forms like the ones below
(when appropriate) has the advantage of leveraging people’s experience with
these forms for understanding your data, which will likely save your reader
some cognitive effort.

3.4.1 Boxplots

The boxplot, also known as the box-and-whisker plot, is an invention of John


Tukey – you may have heard of his post-hoc ANOVA test. The boxplot is a
visual representation of summary statistics of a distribution of data. Because
it’s Tukey’s invention, those summary statistics are usually (and by default in
most statistics software) his preferred descriptors of distributions, which are
based on quartiles.

The lines in a boxplot are labeled in Figure 3.5. The horizontal line across the
box is usually the median and the lower and upper sides of the box are usually
the 25th and 75th percentiles, respectively. The whiskers – those lines that
extend on either side from the center of each box – represent a definition of
the range of values not considered to be outliers. Following Tukey’s
recommendations, the default in R is that the length of the top whisker is the
difference between the 75th percentile and the observed value that is
closest to 1.5 times the interquartile range plus the 75th percentile. In other
words, the length of the line is approximately 1.5 times the interquartile range
statistic: it can be a little more or less based on where the observed data lie.
Then, the length of the lower whisker is the difference between the 25th
percentile and the observed value that is closest to 1.5 times the
interquartile range subtracted from the 25th percentile.

Any value in a dataset that falls outside of the whiskers is considered an


outlier and is plotted individually.

ggplot(boxplot1.df, aes(y=data))+
geom_boxplot()+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(x="Dataset", y="Values")+
theme(axis.text.x=element_blank(), axis.title.x=element_blan
annotate("text", x=1.75, y=0, label="median", size=6)+
annotate("text", x=c(-1.75, -1.75), y=c(-0.652, 0.667), labe
annotate("text", x=c(0.8, 0.8), y=c(-2.508464,2.560529), lab
annotate("text", x=c(0.8, 0.8), y=c(-2.508464,2.560529), l
annotate("text", x=c(-0.8, -0.8), y=c(2.973052, -3.03701), l
geom_segment(x=1.2, xend=0.5, y=0, yend=0, arrow=arrow())+
geom_segment(x=-1.2, xend=-0.5, y=-0.652, yend=-0.652, arrow
geom_segment(x=-1.2, xend=-0.5, y=0.667, yend=0.6667, arrow=
geom_segment(x=0.76, xend=0.05, y=2.560529, yend=2.560529, a
geom_segment(x=-0.76, xend=-0.05, y=2.973052, yend=2.973052,
geom_segment(x=-0.76, xend=-0.05, y=-3.03701, yend=-3.03701,
xlim(-2, 2)

Figure 3.5: Anatomy of the Boxplot

It is possible to override the default values of the boxplot structure – you


could, for example, plot the mean instead of the median, or use a different
outlier definition – just remember that if you do so to make a clearly visible
caption explaining any deviation from the standard definitions to your reader
(actually, you might want to include a caption explaining the values even if
you use the defaults, because they are hardly common knowledge).

Boxplots are useful because they present a simplified view of the shape of a
distribution. For example, if the 25th percentile line is much closer to the
median line than the 75th percentile line is, that’s an indicator that the smaller
values are more bunched together and that the distribution has a positive
skew. Boxplots are even more useful when we compare boxes between
different datasets, as in the example in Figure 3.6. When we have boxes for
multiple groups, we can easily compare the median of one group to another,
the percentiles of one group to another, compare the outliers in each group to
each other, etc.

ggplot(boxplot.df, aes(x=labels, y=values))+


geom_boxplot()+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)+
labs(x="Condition", y="Values")

Figure 3.6: Boxplot of Two Groups of Data


3.4.2 Bar Charts

Like boxplots, bar charts are visualizations of summary statistics. Bar charts
differ from boxplots because they tend to represent fewer summary stats than
do boxplots: typically, they show means and sometimes indicate measures of
variance about the means.

Figure 3.7 is a sample bar chart: each bar represents a mean. These are the
same data used in Figure 3.6, so we can compare the bars to the boxplots
above.

ggplot(barchart.df, aes(x=labels, y=values))+


stat_summary(fun="mean", geom="bar")+
stat_summary(fun.data="mean_se", geom="errorbar", width=0.25
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)+
labs(x="Group", y="Values", caption="Error Bars Indicate Sta

Figure 3.7: A Sample Bar Chart


The bars in Figure 3.7 allow us to compare the means of the two groups.
Overlain on the tops of the bars are error bars, which are used to give some
idea of the uncertainty of the measurement on each bar. In this case, the length
of the part of the error bar above the mean and the length of the part of the
error bar below the mean are each equal to the standard error of the mean
(which will be discussed at length in the page on differences between two
things. Essentially, an error bar is saying the bar is our estimate of the
statistic based on our data, but given repeated measurement the statistic
could likely be anywhere in this range. The best practice for bar charts is
always to include error bars to give context to your data.

Figure 3.8 shows a common variation on the simple bar chart: the grouped bar
chart. In this type of chart, we group bars together for easy comparison: for
example, in two experiments with two conditions in each experiment, it makes
sense to use two groups of two bars (with error bars on each bar, naturally).

ggplot(barchart.df, aes(x=labels, y=values, fill=sublabels))+


stat_summary(fun="mean", geom="bar", position="dodge")+
stat_summary(fun.data="mean_se", geom="errorbar", width=0.25
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)+
labs(x="Group", y="Values", caption="Error Bars Indicate Sta
scale_fill_discrete(name="Subgroup")+
theme(legend.position="bottom")
Figure 3.8: A Sample Grouped Bar Chart

3.4.3 Histograms
Histograms, as described in the page on categorizing and summarizing
information, are simple and effective ways of showing the entire distribution
of a single variable in visual form. The bars of a histogram represent on the y-
axis the frequency or proportion (either is fine but make sure you label which
one you’re using on the y-axis!) of values in ranges – known as bins – defined
on the x-axis. The bars on a histogram are adjacent to each other, indicating
that membership in a bin is based only on the way the bins are defined: values
in one bin are greater than the values in the bin on the left and less than the
values in the bin on the right but are not categorically different (as they are in
bar charts). Apparent gaps in the x-axis indicate that there are no values in the
data that fit that bin (but the bin is still there). An example histogram is shown
in Figure 3.9:

ggplot(samplehist.df, aes(x))+
geom_histogram(binwidth=0.1)+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)+
labs(x="Variable Values", y="Frequency")

Figure 3.9: Sample Histogram

Visualizing the data in Figure 3.9 shows us that the distribution of the variable
is roughly symmetric, with a peak around 0 and tails reaching to −4 and 4.
There are relatively many observations between −1 and 1 and relatively few
less than −2 or greater than 2.

Sometimes, it is helpful to compare histograms (in the same way we compare


boxplots). Figure 3.10 depicts two ways to compare two histograms: in Chart
A, the histograms are stacked (with the same x-axis for proper comparison);
in Chart B, the histograms are overlaid.

set.seed(77)

colnames(comp.hist.long)<-c("Variable", "Values")
hist1<-ggplot(comp.hist.df, aes(x1))+
geom_histogram(binwidth=0.1)+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(x="Values of Variable 1", y="Frequency")+
scale_x_continuous(limits = c(-4, 7))+
ggtitle("A: Stacked")

hist2<-ggplot(comp.hist.df, aes(x2))+
geom_histogram(binwidth=0.1)+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(x="Values of Variable 2", y="Frequency")+
scale_x_continuous(limits = c(-4, 7))+
ggtitle(" ")

stacked_hist<-plot_grid(hist1, hist2, nrow=2)

overlay.hist<-ggplot(comp.hist.long, aes(Values, fill=Variable


geom_histogram(alpha=0.5, position="identity")+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(x="Values", y="Frequency")+
theme(legend.position="bottom")+
scale_fill_manual(values=c("dodgerblue3", "gray40"))

plot_grid(stacked_hist, overlay.hist+ggtitle("B: Overlain"), n


Figure 3.10: Two Forms of Comparative Histograms

3.4.4 Combining Histogram Elements with Bar Chart


Elements
As noted above, bar charts are based on summary statistics, and as discussed
at length in the page on categorizing and summarizing information, the act of
summarizing invariably involves information loss. Taken together, that means
that while bar charts are great for comparing a few aspects of different groups
of data, a lot of details about those groups are lost.

There are several options that combine the benefits conferred by the shapes of
histograms with bar charts. One method is known as a jitterplot, where the
individual data points are represented by literal points. Figure 3.11 is an
example using the data used to make the bar chart in Figure 3.7.

ggplot(barchart.df, aes(x=labels, y=values))+


geom_jitter()+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)
Figure 3.11: An Example Jitterplot

The jitterplot gives a visual representation of the dispersion of the data that
may be more salient than error bars. One potential drawback of a jitterplot is
that the horizontal distribution of the data is artificial – if multiple points in a
dataset have the same value, a jitterplot forces them apart horizontally so that
they don’t occupy the same space and all points are visible. Thus, the width
adds another dimension that might overestimate the perception of data
dispersion.

Another option is the violin plot. Like a jitterplot, a violin plot uses width to
indicate concentration of data within a set. Figure 3.12 is an example violin
plot, again using the same data as in Figure 3.7 and Figure 3.11:

ggplot(boxplot.df, aes(x=labels, y=values))+


geom_violin()+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)

Figure 3.12: An Example Violin Plot


Violinplots share with jitterplots the problem of overestimating perceived
dispersion via the addition of the dimension of width. For a violinplot, that
issue can be ameliorated by using a half violin plot, as shown in Figure 3.13.
By forcing all of the width to go in one direction, it makes comparing relative
width easier.

ggplot(boxplot.df, aes(x=labels, y=values))+


geom_violinhalf()+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)

Figure 3.13: An Example Half-violin Plot

A final option is to use comparative histograms as in Figure 3.10. That tends


to be my preferred option, but hey, people love bar charts.

3.4.5 Scatterplots
Scatterplots are visualizations of pairs of measurements. Each point in a
scatterplot represents two measurements related to the same element: one
measurement is represented by the x-coordinate and the other is represented
by the y coordinate. Scatterplots are closely affiliated with the statistical
analyses of correlation and regression: a line of best fit (also known as a
least-squares regression line) is often included in scatterplots to highlight the
predominant trend in the data, and that line is determined on the basis of
correlation and regression analysis statistics. Figure 3.14 presents an example
of a scatterplot with a line of best fit based on a linear regression model.

ggplot(scatterplot.df, aes(x=x, y=y))+


geom_point()+
geom_smooth(method="lm", se=FALSE)+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)

Figure 3.14: A Scatterplot of Randomly Generated Data with r ≈ 0.8


Between x and y

3.4.6 Line Charts


Line charts are most frequently time-series charts: they indicate change in a
variable y across regularly spaced intervals of x (time is a natural candidate
for x). An example line chart is depicted in Figure 3.15.
rainfallb+
scale_x_discrete(limits=c("Jan", "Feb", "Mar", "Apr", "May",

## Scale for 'x' is already present. Adding another scale for


'x', which will
## replace the existing scale.

Figure 3.15: A Sample Line Chart Indicating Time Series

While line charts are fairly straightforward, I would recommend one bit of
caution: watch out for the post hoc ergo propter hoc4 fallacy: just because
event B comes after event A does not mean that A caused B. While time-
based charts help us put events in chronological context, we must always be
aware that other factors may be at play.

3.4.7 Pie Charts


Figure 3.16: On the Proper Use of Pie Charts

3.4.8 Forestplots
Forestplots are kind of like box-and-whisker plots where the boxes are
simpler and the whiskers get most of the attention. They are visualizations of
interval estimates5. For example, Figure 3.17 shows interval estimates for
five variables (labeled A, B, C , D, and E).

forestplot(labeltext, mean, lower, upper, xlab="Bars indicate


Figure 3.17: Sample Forestplot

3.4.9 Heatmaps
Heatmaps use different hues to indicate patterns of intensity. They are useful
for visualizing data that vary by region – if you were to make a heatmap of the
places in your home where you spend the most time, and you’re like me, you
might have a high-intensity area on your favorite part of the couch and areas of
slightly lower intensity on either side of your favorite part of the couch.

The combination of spatial data and intensity data make heatmaps well-suited
to visualizing functional magnetic resonance imaging (fMRI) data. In mapping
the flow of oxygenated blood to parts of the brain that are active during tasks
of interest, fMRI data shows which regions are most intensely activated
during a task, those which are somewhat less activated, and those that are not
activated at all (relative to baseline activity, that is. For example, Figure 3.18,
taken from an article on brain activity in songbirds indicates the areas of the
brains of finches that respond to auditory stimuli (the article describes how
you get the bird into the brain scanner, if you’re curious). The white areas
indicate more intense activation, and the red areas indicate less intense
responses.
Figure 3.18: Sample Heatmap from fMRI Data

3.4.10 Choropleth maps


Choropleth maps are visualizations of data with geographic
components.^[Choropleth is a word invented in the 20th century made by
combining the Greek words for (approximately) “place” and “many things”
(the “pleth” is the same root as in the word “plethora”). One of my all-time
favorite choropleth maps is this one from Maps by Nik that shows all of the
areas in the United States where nobody lives:
Choropleth maps have excellent data-ink ratios. Choropleth maps can show
millions of bits of data in addition to all of the geographic coordinates
needed to place them. They’re also, relative to other data visualizations,
pretty to look at. However, some restraint must be exercised when using maps
to visualize geographic data.
3.4.10.1 Three Cautionary Points About Maps

1. Things Tend to Happen Where The People Are

Here is a map, courtesy of ScrapeHero.com, of the locations of the 13,816


McDonald’s locations in the United States as of September 1, 2020:

Here is another map, this one of the locations of Hospitals in the USA
according to 2017 data from the American Hospital Association:
Those maps look suspiciously similar! Are they building hospitals near
places with lots of McDonald’s locations? Like, is McDonald’s food so
unhealthy that there is higher demand for health services? IS THERE A
CONSPIRACY BETWEEN BIG FAST FOOD AND BIG HOSPITAL?

No. No, there is not. There is a clear mediating variable, and it’s population
density. Figure 3.19 is a map of population density in the United States:
p <- ggplot(data = county_full,
mapping = aes(x = long, y = lat,
fill = pop_dens,
group = group))

p1 <- p + geom_polygon(color = "gray90", size = 0.05) + coord_

p2 <- p1 + scale_fill_viridis_d(labels = c("0-10", "10-50", "5


"500-1,000", "1,000-5,
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(fill="Population per Square Mile")+
theme(legend.position="bottom", axis.text=element_blank(), a

p2

Figure 3.19: Population Density in the USA


Any good map involving demographic information has to take population
patterns into account; to not do so is a form of base rate neglect. We as media
consumers have to take population patterns into account when presented with
geographic information, too, for example: be wary of any assertion that Los
Angeles County has the most of anything because LA County is by far the
most populous county in the USA.

2. Not Everything is a Weather Map

Statistical maps are designed to indicate patterns. It is tempting to infer things


like spread or contagion between regions. In the case of a weather map, this
is a reasonable thing to do. For example, here is a sample weather map for the
United States on September 13, 2020:

From a weather map, we can learn things about local climates and make
inferences about future weather events; for example: the weather in New York
City at any given time is a good predictor of the weather in Boston a few
hours later.

However, there are lots of things that don’t spread like weather. Below is a
map from the US Census indicating the rate of uninsured individuals by state.
This is a good example of how policy differences aren’t necessarily
predictable between neighboring entities. In the map below, we can see that
Arkansas has a low rate of uninsured people relative to neighboring states,
particularly Oklahoma, Texas, and Mississippi – all of which, unlike
Arkansas, have rejected the expansion of Medicaid, the federally-funded
insurance program for low-income and/or disabled individuals. For
somebody who is quite familiar with the locations of US states on a map, a
map like this might serve as an easy-to-read reference, but it is misleading to
think of the uninsured rate as a feature of geography, as things like climate,
latitude, proximity to bodies of water, etc. are far less relevant than local
policy initatives that alter conditions based on state boundaries.

[Source: U.S. Census Bureau]


3. Maps, themselves, can be misleading.

Maps are a 2D representation of a 3D world, and as such require distortion.


For relatively small areas of the world, this isn’t a huge problem: a city map
isn’t too much affected by the distortion involved in representing a 3D surface
with a 2D one. On the scale of countries and of the world, though, the choice
of distortion used to flatten the Earth can lead to serious misrepresentations.

Famously, the Mercator Projection wildly overestimates the surface area of


the Earth close to the poles relative to the area of the Earth close to the
Equator. As a result, the areas of countries in Europe and North America look
much bigger relative to countries in Asia, Africa, and South America. The
Mercator Projection is useful for navigation because it preserves shapes and
directions – good for 16th century sailors – and promotes a Eurocentric view
of the world – bad for the truth. Please compare below the Mercator
Projection on the left with the Gall-Peters Projection on the right, which
preserves the relative areas of landmasses.

Figure 3.20: The Mercator Projection, left, and the Gall-Peters Equal Area
Projection, right
Geography can also be misleading in visualizing data about people.
Geographic maps can show us where people live, but if we are interested in,
say, frequencies of person-level data per US state, the relative sizes of states
can obscure the relative populations of states.

3.4.11 Alluvial Diagrams (aka Sankey Plots, aka Riverplots,


aka Ribbonplots)

Alluvial diagrams, also known as Sankey plots, Riverplots, and/or


Ribbonplots (any of those are fine)6 are visualizations of flow. For example,
Figure 3.21 shows a voting pattern of two fictional elections. In this
hypothetical scenario, there are two candidates who received votes in the first
election and those same two candidates received votes in the second election
along with a third, new candidate. The alluvial diagram shows the pattern of
the people who voted in both elections. Each candidate is represented on the
sides of the diagram: the points representing each candidate are known as
nodes. Some people voted for the same candidate: these are represented by
lines known as edges that connect one candidate in the first election to the
same candidate in the second election. People who changed their vote are
represented by edges that flow from the node of one candidate in the first
election to a different candidate in the second election.

palette = paste0(brewer.pal(7, "Set2"), "80")


styles = lapply(nodes$y, function(n) {
list(col = palette[n+1], lty = 0, textcol = "black")
})
names(styles) = nodes$ID

rp <- makeRiver(nodes, edges, node_labels = c("Candidate A", "

class(rp) <- c(class(rp), "riverplot")

plot(rp, plot_area = 1, yscale=0.01)

Figure 3.21: A Sample Alluvial Diagram


The main caution I would offer regarding the use of alluvial diagrams is that
they can become very difficult to follow as the numbers of nodes and edges
increase (in terms of the hypothetical example represented in Figure 3.21, we
could add several more candidates to each election and/or add more elections
– the flow can become quite confusing). Thus, try to use them for relatively
simple flow patterns and always label as clearly as possible.

3.4.11.1 Minard’s Map

In 1869, Charles Minard, a French Civil Engineer and pioneer in the field of
data visualization, published a visual summary of the 1812 French invasion of
and retreat from Russia that combined elements of time-series charts, alluvial
diagrams, and choropleth maps. I’ll let Tufte describe it:

…the classic of Charles Minard (1781-1870), the French Engineer,


shows the terrible fate of Napoleon’s Army in Russia. Described by E.J.
Marey as seeming to defy the pen of the historial by its brutal
eloquence,7 this combination of data map and time-series, drawn in
1869, portrays a sequence of devastating losses suffered in Napoleon’s
Russian campaign of 1812. Beginning at left on the Polish-Russian
border near the Niemen River, the thick tan flow-line shows the size of
the Grand Army (422,000) as it invaded Russia in June 1812. The width
of this band indicates the size of the army at each place on the map. In
September, the army reaches Moscow, which was by then sacked and
deserted, with 100,000 men. The path of Napoleon’s retreat from
Moscow is depicted by the darker, lower band, which is linked to a
temperature scale and dates at the bottom of the chart. It was a bitterly
cold winter, and many froze on the march out of Russia. As the graphic
shows, the crossing of the Berezina River was a disaster, and the army
finally struggled back into Poland with only 10,000 men remaining. Also
shown are the movements of auxiliary troops, as they sought to protect
the rear and the flank of the advancing army. Minard’s graphic tells a
rich, coherent story with its multivariate data, for more enlightening thatn
just a single number bouncing along over time. Six variables are plotted:
the size of the army, its location on a two-dimensional surface, direction
of the army’s movement, and temperature on various dates during the
retreat from Moscow…
It may well be the best statistical graphic ever drawn.

– from The Visual Display of Quantitative Information, Second Edition,


p. 40

Figure 3.22: Charles Minard’s Time-Series Alluvial Choropleth Map of the


French Invasion of and Retreat From Russia

3.4.12 Small Multiples

The last category of data visualization tools we will cover on this page is
small multiples. Small multiples are arrangements of small versions of the
same visualization that vary by levels of a variable or variables. The idea is
that the reader can view the entire arrangement of figures at once in order to
easily make comparisons. Figure 3.23 is an example of the use of small
multiples: each of the 12 miniature figures is a histogram of random values
drawn from the same class of frequency distribution (namely, the χ 2

distribution). The specifications of the random draws differs only by the


single parameter that determines the shape of a given χ distribution (the
2

degrees of freedom, abbreviated df). In this case, we can easily follow how
the distribution migrates to the right and becomes decreasingly skewed as the
df increases thanks to the arrangement of the small multiples of histograms.
ggplot(small.multiple.df, aes(Values))+
geom_histogram()+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
labs(title=bquote(Random~Samples~From~Various~chi^2~Distribu
facet_wrap(~df, labeller = label_bquote(cols=italic(df)==.(d

Figure 3.23: An Example of Small Multiples

3.5 Closing Remarks


There are many more ways of visualizing data. The examples discussed here
cover a lot of ground, but this is by no means an exhaustive list. As with all
scientific tools, I encourage you to use methods that best tell the story that you
want to tell about your data rather than to let the tools you are familiar with
tell the story for you.

And now, to end the page in grand fashion, I give you…

3.6 The Worst Data Visualization Ever Made


1. There is a rich history of hand-drawn data visualization. Data science
before computers was not limited by the options available in software
packages but it was much harder to make images that were proportional
to data, thus, there were fewer visualizations but some of them are super-
creative. One example is the famous map made by Charles Minard in
1869 discussed below. I can’t recommend enough the recently-published
W.E.B. Du Bois’s Data Portraits: Visualizing Black America, an
absolutely spectacular and historigraphically essential collection of
W.E.B. Dubois’s sociological analysis of Black Americans at the turn of
the 20th century.↩

2. Removing the legend also gives you more space for the data, which is
usually desirable in and of itself.↩
3. Sans serif fonts are the data-ink minimizers of the typeface world↩

4. post hoc ergo propter hoc, roughly translated, means “after the thing,
therefore because of the thing”↩

5. The bars in a forestplot are a little like error bars, but…well, interval
estimates are a whole thing and we’ll talk about them later↩

6. Alluvial refers to a river delta, which such plots can resemble. Sankey is
an engineer who used plots to visualize the movement of fluids. The
plots also look like rivers or ribbons to people.↩

7. E.J. Marey, La méthode graphique (Paris, 1885), p. 73. For more on


Minard, see Arthur H. Robinson, “The Thematic Maps of Charles Joseph
Minard,” Imago Mundi, 21 (1967), 95–108.↩
4 Probability Theory
4.1 Probability, Statistics, and Scientific Inquiry
Let’s do a thought experiment. Please imagine that I am a charming rapscallion who has drawn international curiosity
for my boastful claims that I can read minds and that you are a skeptical scientist interested in investigating my
abilities or lack thereof. Suppose I offer to prove my telepathy by asking you to draw from a well-shuffled deck of
52 standard playing cards: you are to draw a card, to look at it, and not to show me what it is.

Figure 4.1: your card

In this hypothetical scenario, your card is the five of hearts.

Now suppose I make one of the following claims:

1. You have drawn either a red card or a black card.

2. You have drawn a red card.

3. You have drawn a heart.

4. You have drawn the five of hearts.

All of those claims are equally correct. The five of hearts – which again, is your card – is either red or black, is red,
is a heart, and is the five of hearts. But as we proceed from claim 1 down to claim 4, the statements probably seem
more convincing to you. We may be convinced by, say, claim 4 but not claim 1, is that each claim differs not in
whether it is correct or not but by the boldness of each claim. The bolder of these claims offer more details; with
more details, the probability of being correct by guessing goes down. Let’s evaluate the probability that I could
correctly make each claim if I didn’t have the power to read minds:

4.1.0.0.1 You have drawn either a red card or a black card.

All playing cards are either red (hearts or diamonds) or black (clubs or spades). So this claim has to be true. The
probability of getting this right is 100%. This one is dumb. Let’s move on.

4.1.0.0.2 You have drawn a red card.


Half of the playing cards in a deck are red (hearts and diamonds). If I were guessing at random, I would have a 50%
chance of getting this one right. That’s not terribly impressive but it’s slightly better than the first claim.

4.1.0.0.3 You have drawn a heart.

One-fourth of the playing cards in a deck are hearts, so the probability of getting this right by guessing randomly is
25%. That might make you say something like, “hey, good guess,” but it’s probably not enough to make you think that
I have mutant powers. Next!

4.1.0.0.4 You have drawn the five of hearts.

There’s only one five of hearts in the deck. Given that there are 52 cards in a standard deck, the probability of
randomly guessing any individual card and matching the one that was drawn is therefore 1/52: a little less than 2%.
Since there are 51 cards in the deck that are not the five of hearts, the probability that I am wrong is 51/52: a little
more than 98%. In other words, it is 51 times more likely that I would guess incorrectly by naming a single, specific
card than that I would guess correctly. That’s fairly impressive – at this point one might be wondering if it’s a magic
trick or I am otherwise cheating or possibly that I might have an extra sense, but either way one might start ruling out
the possibility that I am guessing purely at random.

Now let’s say that I have made claim 4 – that you have drawn the five of hearts – and that I am correct. You, still the
scientist in this scenario, skeptical though you may be, decide to share your assessment of my unusual (and, to be
clear, fictional) talent with the world. You write up the results of this card-selecting-and-guessing experiment and
have it published in an esteemed scientific journal. In that article, you say that I have the ability to read minds, or, at
least, the part of minds that store information about recently-drawn playing cards. But, you include the eminently
reasonable caveat that you could be wrong about that; it may be a false alarm. In fact, you know that if I were a
fraud and I were merely guessing, that given the assumptions that the deck was standard and every card was equally
likely to be drawn, the likelihood of me being right was 1/52.

On the other hand, there is no way to be 100% certain that I have the ability to read minds. Maybe you’re not
satisfied with the terms of our little experiment and you think they would lead to false alarms too often for your
liking. So, maybe we do the same thing with two decks of cards. The probability of me naming your card from one
deck and repeating with another deck is much smaller than performing the feat with just one deck (we’ll learn about
this later on, but it’s ×
1

52
=
1

52
, or about 0.04%). As we add more decks to the experiment, the probability of
1

2,704

correctly guessing one card from each one approaches zero – but never equals zero.

There is one other implication to this thought experiment that I would like to point out. Often it is not practical,
feasible, or possible to replicate our scientific studies. Let’s say we only have one shot at our mind-reading
experiment: we only have one deck of cards, and for some reason, we can only use it once (so, no replacing the card,
reshuffling, and trying again). If we would not consider identifying a single card correctly as sufficient evidence of
telepathy – and, to be sure, a 1/52 chance is by no means out of the realm of possibility – there is a way to
incorporate our skepticism into the evaluation of the probability of true telepathy with the observation of card
identification. We’ll talk about this in the section on Bayesian Inference.

As you (back to real you, no longer the hypothetical you of our fantastical scenario) may have inferred, this thought
experiment is meant to serve as an extended metaphor for probability, statistics, and scientific inquiry. We are nearly
never certain about the results of our studies: there is always some probability involved, be it a probability that our
hypotheses are correct and supported by the data, a probability that our hypotheses are correct but not supported by
the data, a probability that our hypotheses are incorrect and not supported by the data, and/or a probability that our
hypotheses are incorrect but supported by the data anyway.

Statistics are our way to assess those probabilities. Not every investigation is based on knowable and calculable
probabilities like our example of guessing cards – in fact, almost all of them are not. There are many, many statistical
procedures precisely because most scientific studies are not associated with relatively simple probability structures
like there is a one-out-of-fifty-two chance that the response is just a correct guess.
So: first, we are going to learn the rules of probability and we are going to use a lot of examples like decks of cards
and marbles and jars and coin flips (SO MANY COIN FLIPS) to help us. As we proceed on to learning more
statistical procedures, the fundamental goal will be the same: understanding the probability that the thing that we are
observing is real.

4.2 The Three Kolmogorov (1933) Axioms


4.2.1 1. Non-negativity

The probability of an event1 must be a non-negative real number.

p(A) ∈ R; p(A) ≥ 0

In other words, a probability value can’t be negative. I might say that there is a −99 chance that I am going to go to a
voluntary staff meeting at 6 am, but that is just hyperbole and is not physically possible (the probability is actually
0). Imaginary numbers are impossible for probabilities, too, so if at any point in this semester you find yourself
answering a probability problem with p = √−0.87, please check your work.

4.2.2 2. Normalization

The sum of all possible mutually exclusive events in a sample space 2 is 1.

So, as long as multiple events can’t happen at the same time, the sum of the probability of all of the events is one.

∑Ω = 1

4.2.3 3. Finite Additivity

The probability of the co-occurrence of two mutually exclusive events3 is equal to the sum of the probabilities of
each event:

if p(A and B) = 0, p(A + B) = p(A) + p(B)

and since the principle extends to more than two mutually exclusive events, we could also extend the above equation
to p(A ∪ B ∪ C), and p(A ∪ B ∪ C ∪ D), etc.

This axiom allows us to add the probabilities of events based on the fact that the sum of mutually exclusive events is
the sum of the probabilities: for example, a coin can’t land heads and tails on the same flip, so the probability of
heads or tails is 0.5 + 0.5 = 1.

4.3 Methods of Assigning Probability


There are several methods of assigning a probability value to an event. All methods agree on the probability
assigned to heads vs. tails in a flip of a fair coin, but they may disagree on the probabilities of other events. We can
largely categorize these methods as objective – methods that depend solely on mathematical calculations – or
subjective – methods that take into account what people believe about events.

4.3.1 Objective Methods

4.3.1.1 Equal Assignment


In the equal assignment of probability, each elementary event in a sample space is considered equally probable. This
is the method used for games of chance, including coin flips, dice rolls, and card games. So, when we say that the
probability of heads is 0.5, or the probability of rolling a 7 with a pair of dice is 1/6, or the probability of drawing
an ace from a shuffled deck is 1/13, we are assigning equal probability to each possible elementary outcome.

4.3.1.2 Relative Frequency

Relative frequency theory bases the probability of something happening on the number of times it’s happened in the
past divided by the number of times it could have happened. For example, imagine that you had never seen or heard
of a fair coin before, let alone thought to flip one, and did not know that there were two equally probable outcomes
to a coin flip. Without being able to assign equal probability theory, you could come to understand the probability of
heads and tails by flipping the coin (presumably after being shown how to do so) many times. After enough flips, the
relative numbers of heads and tails would converge to an equal number of occurrences.

4.3.2 Subjective Probability

A subjective probability is a statement of the degree of belief in the probability of events. Any time we put a number
to the chances of an event in our personal lives – the probability that we remembered to lock the door when we left
home, the probability that we will finish a paper by Friday, the probability that we will feel less-than-perfect after a
night of revelry – we are making a statement of subjective probability.

Subjective probability is not based on wild guesses – when we make subjective probability statements about our
own lives they tend to be based on some degree of self-awareness – and in science, subjective probability judgments
tend to incorporate data and some objective measures of probability as well. Meteorologists are a classic example:
a statement like there is a 40% chance of rain is based on a subjective interpretation of weather models that
incorporate terabytes of data on things like climate, atmospheric conditions, prior weather events, etc. A statistician
who uses subjective methods may use their judgment in choosing probabilistic models and starting points, but
ultimately makes their decisions based on objective data.

4.4 Intersections and Unions


The distinction between the conjunctions and and or has important implications in interpersonal communication: if
somebody asks you want for dessert, whether you reply with I want cake OR pie or I want cake AND pie may have
serious consequences for the amount of food you get. The same is true for determining the probability of multiple
events.

4.4.1 Intersections
In probability theory (as well as in formal logic and other related fields) the concept of co-occurring events, e.g.,
event A and event B happening, is known as the intersection4 of those events and is represented by the symbol cap
∩. The probability of event A and event B happening is thus equivalently expressed as the intersection probability

of A and B, as p(A) ∩ p(B), and as p(A ∩ B).

The intersection probability of two events A and B is given by the equation:

p(A ∩ B) = p(A)p(B|A)

In that equation, we have introduced a new symbol: |, meaning given (see sidebar for a collection of symbols for
expressing probability-related concepts)5. The full equation can be expressed as the probability of A and B is
equal to the product of the probability of A and the probability of B given A. In the context of probability theory,
given means given that something else has happened. Generally speaking, a probability that depends on a given
event is a conditional probability6. We will discuss conditional probability at length in the section not-
coincidentally named Conditional Probability.
To illustrate the intersectional probability of dependent events: please imagine a jar with two marbles in it – one
blue marble and one orange marble – and to make the imagining easier, here is a picture of one blue marble and one
orange marble both floating mysteriously in a poorly-rendered jar:

Figure 4.2: A Poorly-rendered Jar

Let’s say that we propose some defined experiments7 regarding pulling a marble out of the jar (without peeking into
the jar, of course). The first is what is the probability of drawing a blue marble out of the jar? Since there are two
marbles in the jar, and, assuming that one draws one of them (I suppose one could reach in and miss both marbles
and come out empty-handed, but let’s ignore that possibility for the purposes of this exercise), we may intuit that the
probability of drawing one specific marble is 1/2, or 0.5, or 50.8 More formally, we can define the sample space
as:

Ω = {Blue, Orange}

since the probability of the sample space is 1 (thanks to the axiom of normalization) and the probability of either of
these mutually exclusive events is the sum of the probabilities of each individual event (thanks to the axiom of finite
additivity):

p(Ω) = 1 = p(Blue) + p(Orange)

Assuming equal probability of each event (which is reasonable if we posit that there is no way to distinguish one
marble from the other just be reaching into the jar without looking), then algebra so simple that we will leave out the
steps tells us that:
1
p(Blue) = p(Orange) =
2

Thus, in this defined experiment, the probability of drawing a blue marble out of the jar in a singe draw is 1/2.

Let’s define another experiment: we reach into the jar twice and pull out a single marble each time. Now, we have a
defined experiment with two trials9 What is the probability of drawing the blue marble twice? That probability
depends on one important consideration: do we put the marble from the first draw back into the jar before drawing
again? If the answer is yes, then the two trials are the same:
Figure 4.3: Sampling with Replacement

If we put whatever marble we draw on the first trial back into the jar before the second trial – or, in more formal
terms, we sample with replacement10 from the jar. That means that whatever event is observed in the first trial –
whichever marble was chosen, in this example – has no effect on the probabilities of each event in the second trial –
whichever marble is chosen second, in this example. If we put the first marble back in the jar, then the probability of
choosing a blue marble on the second trial is the same as it was on the first trial and it does not matter which marble
was selected in the first trial. The events drawing a marble in Trial 1 and drawing a marble in Trial 2 are therefore
independent events11 By definition, if events A and B are independent, then the probability of event A happening
given that event B has occurred is exactly the same as the probability of event A happening – since they are
independent, the fact that B may or may not have happened doesn’t matter at all for A and vice versa:

A and B are independent if and only if p(A|B) = p(A) and p(B|A) = p(B)

12Now, we can modify the equation for determining the probability of A ∩ B for independent events:

if p(B|A) = p(B), p(A ∩ B) = p(A)p(B)

And back to our marble example: if we are sampling with replacement, then we can call blue on the first draw event
A and blue on the second draw event B, the probability of each is 1/2, and:

1 1 1
p(Blue trial 1 ∩ Blue trial 2) = p(Blue trial 1 )p(Blue trial 2 ) = ( )( ) = .
2 2 4

Now let’s consider the other possible sampling method: what if we don’t put the first marble back in the jar? In that
case, we have sampled without replacement13, and the probabilities associated with events in the second trial
depend on what happens in the first trial:
Figure 4.4: Sampling without Replacement

In this case, the probabilities of the events in Trial 2 are different given what happened in Trial 1 – they depend on
what happened on Trial 1 – and thus are considered dependent events.14 In the case of dependent events,
p(A|B) ≠ p(A) and p(B|A) ≠ p(B). So, based on the observed event in Trial 1, there are two possible sample

spaces for Trial 2. If the blue marble is drawn on Trial 1 and not replaced, then on Trial 2:

Ω = {Orange}; p(Blue T rial 2) = 0, p(Orange T rial 2) = 1

and if the orange marble is drawn in Trial 1 and not replaced, then on Trial 215:

Ω = {Blue}; p(Blue T rial 2) = 1, p(Orange T rial 2) = 0.

So: what is the probability of drawing the blue marble twice without replacement? We can know – and Figure 4.4
can help illustrate – that it is impossible: if we take out the blue marble in Trial 1 and don’t put it back in, there is no
way to draw a blue marble in Trial 2. The probability is 0. The math backs this up: the second draw depends on the
result of the first draw, and the probability of drawing a blue marble in Trial 2 given that a blue marble was drawn
in Trial 1 without replacement. Using the formula for and probabilities16:

1
p(Blue trial 1 ∩ Blue trial 2) = p(Blue trial 1 )p(Blue trial 2 |Blue trial 1 ) = ( ) (0) = 0.
2

4.4.1.1 Intersections of more than two events

The logic of intersection probability for two events holds for the intersection probabilities for more than two events,
although the equations we use to evaluate them look more complex. The intersection probability of two events is
given by the product of the probability of the first event and the probability of the second event given the first
event. The intersection probability of three events is given by the product of the probability of the first event and the
probability of the second event and the probability of the third event given the first event and the second event:

p(A ∩ B ∩ C) = p(A)p(B|A)p(C|A ∩ B).

The intersection probability of four events is given by the product of the probability of the first event and the
probability of the second event and the probability of the third event given the first event and the second event and
the probability of the fourth event given the first event and the second event and the third event):

p(A ∩ B ∩ C ∩ D) = p(A)p(B|A)p(C|A ∩ B)p(D|A ∩ B ∩ C)


and so on. Please note that the names that we give to events – A, B, C , etc. – are arbitrary. When assessing the
probability of multiple events, we can give any letter name to any event and our designations by themselves are not
that important. What is important is understanding which events depend on which. It’s helpful to use letter labels to
write out equations in a general sense, but when doing actual problems, I recommend using more descriptive labels
(as in: full or abbreviated names of the events themselves) so that it’s easier to keep track of the relationships
between the events.

To help us understand the intersection of more than two events, first let’s look at a case where all of the events are
independent of each other. In this example there are four events: A, B, C , and D. Because all of the events are
independent of each other, then by the definition of independent events: p(B|A) = p(B), p(C|A ∩ B) = p(C), and
p(D|A ∩ B ∩ C) = p(D). Thus, the equation for the intersection probability of these events simplifies to:
17

p(A ∩ B ∩ C ∩ D) = p(A)p(B)p(C)p(D)

This example is a personal favorite of mine, because I think it links a somewhat complex intersection probability
with a more intuitive understanding of probability as x chances in y. It’s about gambling.18 Specifically, it’s about
one of the lottery games run by the Commonwealth of Massachusetts: The Numbers Game. In that game, there are
four spinning wheels (pictured in Figure 4.5), each with 10 slots representing each digit from 0 – 9 one time, and
there is a ball placed in each wheel. The wheels spin for a period of time, and when they stop spinning, the ball in
each wheel comes to rest on one of the digits.

Figure 4.5: The Massachusetts Lottery Numbers Game, known in Boston-area locales as THE NUMBAH

The result is a four-digit number, and to win the jackpot, one has to pick all four digits in the correct order.19 Each
wheel is equally likely to land on each of the 10 digits (0 – 9), and each wheel spins independently so that the
outcome on each wheel is literally independent of the outcomes on any of the other wheels. Thus, the probability of
picking the correct four digits in order is the intersection probability of picking each digit correctly. In other
words: it’s the probability of picking the first digit correctly and picking the second digit correctly and picking the
third digit correctly and picking the fourth digit correctly:

p(jackpot) = p(1st digit ∪ 2nd digit ∪ 3rd digit ∪ 4th digit).

Since we know that each digit is an independent event, we need not concern ourselves with conditional
probabilities: the probability any digit is 1/10, and is exactly the same regardless of any of the other digits
(symbolically: p(2nd digit) = p(2nd digit|1st digit), p(3rd digit) = p(3rd digit|1st digit ∩ 2nd digit), and
p(4th digit) = p(4th digit|1st digit ∩ 2nd digit ∩ 3rd digit)). So, the probability of the jackpot is:

p(jackpot) = p(1st digit)p(2nd digit)p(3rd digit)p(4th digit)


1 1 1 1 1
= ( )( )( )( ) = .
10 10 10 10 10000

Thus, the probability of picking the winning series of numbers is 1/10000: if you bought a ticket for the numbers
8334, then there’s a 1/10000 probability that that set of numbers comes up; if you bought a ticket for the numbers

5782, there’s a 1/10000 chance that that number wins as well.


20 Which brings us to the reason why I like this
example so much: the combinations of digits in the numbers game perfectly resemble four-digit numbers (although
sometimes they have leading zeroes). The possible combinations of digits in the game go from 0000 to 9999. How
many integers exist between 0 and 9,999, 0 included? Exactly as many numbers as between 1 and 10,000: 10,000. If,
instead of asking you to wager on four digits in a particular order, you were asked to pick a number between 0 and
9,999, what would your probability be of choosing the right one? It would be 1/10000 – the same result as we got
above. Which is nice.

Another fun feature of using The Numbers Game as an example is that since each of the wheels is the same, spinning
all of the wheels at the same time gives the same expected outcomes that spinning one of the wheels four times would
do – it would just take longer to spin one four times – so effectively it’s an example of sampling with replacement.

Now let’s examine how intersections of more than two events work when events are dependent with an example of
sampling without replacement. For this example, we’ll talk about playing cards again. Let’s say that you are dealt
four (and only four) cards from a well-shuffled deck of ordinary playing cards.21 What is the probability that you
are dealt four aces? Please note: since you are being dealt these cards, they are not going back into the deck: this is
sampling without replacement.

There are four aces in a deck of 52 playing cards, so the probability of being dealt an ace on the first draw is 4/52.
If you are dealt an ace on the first draw (if you aren’t dealt an ace on the first draw, the probability of getting four
aces in four cards is zero so that doesn’t matter), then there will be three aces left in a deck of 51 cards, so the
probability of being dealt an ace on the second draw will be 3/51. If you are dealt aces on each of the first two
draws, then there will be two aces left in a deck of 50 cards, so the probability of being dealt a third ace will be
2/50. Finally, if you are lucky enough to be dealt aces on each of the first three draws, then there will be one ace

left in a deck of 49 cards, so the probability of being dealt a fourth ace will be 1/49. We can express that whole last
paragraph in math-symbol terms like:

4
p(Ace f irst ) =
52

3
p(Ace second |Ace f irst ) =
51

2
p(Ace third |Ace f irst ∩ Ace second ) =
50

1
p(Ace f ourth |Ace f irst ∩ Ace second ∩ Ace third ) =
49

Being dealt four aces out of four cards is equivalent to saying being dealt an ace on the first draw and being dealt
an ace on the second draw and being dealt an ace on the third draw and being dealt an ace on the fourth draw:
it’s the intersection probability of those four related events. Using the equation
p(A ∩ B ∩ C ∩ D) = p(A)p(B|A)p(C|A ∩ B)p(D|A ∩ B ∩ C ∩ D) and substituting the probabilities of each

Ace-drawing event outlined above, the probability of four consecutive aces from a well-shuffled 52-card deck is:

4 3 2 1 24
p(4 Aces) = ( )( )( )( ) = ≈ 0.00000369
52 51 50 49 6497400

which is a very small number. Drawing four consecutive aces is not very likely to happen. And yet: it’s exactly as
likely as drawing four consecutive jacks or four consecutive 9’s, and 24 times more likely than the combination of
any four specific cards (like, the queen of spades and the five of hearts] and the 10 of clubs and the 2 of diamonds22
I noted above that keeping track of which events may depend on others is important (far more important than which
event you call A and which event your call B). The examples of the lottery game (independent events) and of dealing
aces (dependent events) are relatively simple ones. It can get extremely complicated to keep track of not only what
depends on what but also the ways in which those dependencies change sample spaces – that is: what the probability
of one event is given other events. Probability trees are visualizations of relationships between events that I find
very helpful for both keeping track of things and for calculating complex probabilities, and we will get to those, but
first, we need to talk about unions

4.4.2 Unions

The total occurrences of events, e.g., event A or event B happening, is known as the union23 The probability of
event A or event B happening is thus equivalently expressed as the union probability of A and B, as p(A) ∪ p(B),
and as p(A ∪ B).

We know already from the finite additivity axiom of probability that the total probability of mutually exclusive
events is the sum of of the probabilities of each of the events. Now that we are more familiar with terminology and
symbology, we can write that axiom as:

if p(A ∩ B) = 0, p(A ∪ B) = p(A) + p(B)

where the term p(A ∩ B) = 0 represents the condition of A and B being mutually exclusive (and thus the
probability of A and B both happening being 0).

You may have be wondering, “what happens if the events are not mutually exclusive?” and even if you are not
wondering that, I will tell you anyway. Let’s say we are interested in the combined probability of two events A and
B that can co-occur – for example, the probability that, in two flips of a fair coin, the coin will turn up heads both

times. In that example, the coin could turn up heads on the first flip and turn up heads on the second flip and turn up
heads on both flips. If we incorrectly treated the events H eads f lip1 and H eads as independent events, we
f lip2

would incorrectly conclude that the probability of observing heads on either of two flips is
p(H eads f lip 1) + p(H eads ) = 0.5 + 0.5 = 1. That conclusion, in addition to being wrong, doesn’t make
f lip 2

sense: it obviously is possible to observe 0 heads in two flips of a fair coin. If we want to get even more absurd
about it, we can imagine three flips of a fair coin: surely the probability of three consecutive heads is not
p(H eads f lip 1) + p(H eads ) + p(H eads
f lip 2 ) because that would be 0.5 + 0.5 + 0.5 = 1.5 and thus violate
f lip 3

the axiom of Normalization.

What is causing that madness of seemingly overinflated union probabilities? The union probability of A and B can
be thought of as the probability of A or B, but it can also be considered the total probability of A and B, which
means that it’s the probability of:

1. A happening and B not happening (A ∪ ¬B),


2. B happening and A not happening (B ∪ ¬A), and
3. A and B happening (A ∪ B).

When we add the probabilities of events that are not mutually exclusive, we are effectively double-counting the
probability of A ∪ B. Consider the following problem: two people – Alice and Bambadjan (conveniently for the
math notation we have been using, we can call them A and B) – have consistent and different sleep schedules. Alice
goes to sleep every night at 10 pm and wakes up at every morning at 6 am. Bambadjan goes to sleep every night at 4
am and wakes up every day at 12 pm. During any given 24-hour period, what is the probability that Alice is asleep
or Bambadjan is asleep?

The probability of the co-occurrence of two events – whether the events are mutually exclusive are not – is given by
the equation:

p(A ∪ B) = p(A) + p(B) − p(A ∩ B)

Let’s visualize what this equation means by representing the sleep schedules of Alice (A) and Bamnadjan (B) as
timelines:
Figure 4.6: The Sleep Schedules of Alice and Bambadjan

Hopefully, the figure makes it easy to see that if we are interested in the times that either Alice or Bambadjan is
asleep, then then the time when their sleep overlaps does not matter. More than that: that overlap will mess up our
calculations. If we add up the times that A. Alice is asleep, B. Bambadjan is asleep, and C. both Alice and
Bambadjan are asleep, we get
8 hours + 8 hours + 2 hours = 18 hours.

If you add to that result to the 10 hours per day when neither Alice nor Bambadjan is asleep (between 12 pm and 10
pm), you get a 28-hour day, which is not possible on Planet Earth.
Figure 4.7: only 24 hours per day here

So, we have to account for the double-counting of the time when both are asleep – the intersection of event A (Alice
being asleep) and event B (Bambadjan being asleep). Using the above equation for intersection probabilities for
non-exclusive events:

8 8 2 14
p(A ∪ B) = p(A) + p(B) − p(A ∩ B) = + − =
24 24 24 24

4.4.2.1 Unions of More than Two Events

When two events are mutually exclusive, calculating their union probability is simple: since there is no intersection
probability of the two events, the intersection term of the equation drops out and we are left with
p(A ∪ B) = p(A) + p(B). That simplicity holds when calculating the union probability of more than two mutually

exclusive events:

if p(A ∩ B) = p(A ∩ C) = p(B ∩ C) = p(A ∩ B ∩ C) = 0, then

p(A ∪ B ∪ C) = p(A) + p(B) + p(C).

For example, let’s consider the probability of a person’s birthday falling on one day or another. Assuming that the
probability of having a birthday on any one given day of the week is 1/7,24 what is the probability that a person’s
birthday falls on a Monday or on a Wednesday or on a Friday? These three events (1. birthday on Monday, 2.
birthday on Wednesday, 3. birthday on Friday) are mutually exclusive, so, the calculation for that probability is
straightforward:
1 1 1 3
p(M onday ∪ T uesday ∪ W ednesday) = + + =
7 7 7 7

And, of course, the probability that a person’s birthday falls on any day of the week is
1/7 + 1/7 + 1/7 + 1/7 + 1/7 + 1/7 + 1/7 = 1.

As you might imagine, the union probabilities of more than two events when some or all are not mutually exclusive is
a bit more difficult. Here is a diagram representing the union probability of two intersecting events; it is similar to
the one we used to examine the union probability of Alice and Bambadjan being asleep, but a little more generic:

Figure 4.8: Generic Representation of Union Probability of Two Intersecting Events


As we can see in Figure 4.8, when we add up the occurrence of two intersecting events (part A), we double-count
the intersection (part B), so we have to subtract the intersection from the sum to get the union of two events (part C).

In Figure 4.9, we add a third intersecting event C to intersecting events A and B. The range of outcomes for the three
events separately are represented in part A, and the four intersections between the events – A ∪ B, A ∪ C , B ∪ C ,
and A ∪ B ∪ C – are shown in Figure 4.9, part B.

Figure 4.9: Generic Representation of Union Probability of Three Intersecting Events

In Figure 4.9, part C, we see what happens when we add up the events and subtract the pairwise intersections A ∩ B
, B ∩ C , and A ∩ C : we end up under-counting the three-way intersection A ∩ B ∩ C . That happens because the
sum of A and B and C has two areas of overlap in the A ∩ B ∩ C section (that is, in that area, you only need one of
the three but you have three, so, two are extra) and when we subtract the three pairwise intersections from the sum –
which we have to do to account for the double-counting between pairs of events – we take three away from that
section, leaving it empty. Thus, we have to add the three-way intersection (the blue line in part C) back in to get the
whole union probability shown in part D.

Thus, the general equation for the union probability of three events is:
p(A ∪ B ∪ C) =

p(A) + p(B) + p(C)

−p(A ∩ B) − p(A ∩ C) − p(B ∩ C)

+p(A ∩ B ∩ C)

and25 for any combinations of those events that are mutually exclusive, the interaction term for those combinations is
0 and drop out of the equation.

Just like knowing which events depend on which other events and how in calculating intersection probabilities can
be tricky, keeping track of which events intersect with each other and how in calculating union probabilities can also
be tricky, and keeping track of all of those things at the same time can be multiplicatively tricky. In the next section,
we will talk about a visual aid that can help with all of that trickiness.

4.5 Expected Value and Variance


The terms Expected Value and Variance were introduced in the page on categorizing and summarizing information.
Both terms apply to general probability theory as well.

The expected value of a defined experiment is the long-term average outcome of that experiment over multiple
iterations. It is calculated as the mean outcome weighted by the probability of the possible outcomes. If x is the i

value of each event i among N possible outcomes in the sample space, then:
N

E(x) = ∑ x i p(x i )

i=1

For example, consider a roll of a six-sided die. The sample space for such a roll is Ω = {1, 2, 3, 4, 5, 6}: thus,
x = 1,
1 x = 2,
2 x = 3,3 x = 4, 4x = 5, and x = 6. The probability of each event is 1/6, so
5 6

p(x ) = p(x ) = p(x ) = p(x ) = p(x ) = p(x ) = 1/6. The expected value of a roll of a six-sided die is
1 2 3 4 5 6

therefore:

1 1 1 1 1 1 21
E(x) = 1 ( ) + 2( ) + 3( ) + 4( ) + 5( ) + 6( ) = = 3.5.
6 6 6 6 6 6 6

Note that you can never roll a 3.5 with a six-sided die! But, in the long run, that’s on average what you can expect.

Next, let’s consider a roll of two six-sided die. Each die has a sample space of Ω = {1, 2, 3, 4, 5, 6}, so the sample
space of the two dice combined is:

Die 2
Die 1 1 2 3 4 5 6
1234 5 6 7
2345 6 7 8
3456 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

The expected value of a roll of two six-sided dice is the sum of the values in the table (it’s 252) divided by the total
number of possible outcomes (36), or 7.
We could also arrive at the same number by defining the sample space as Ω = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, and
noting that there are more ways to arrive at some of those values than others. The number 7, for example, can happen
6 different ways, and so p(7) = 6/36 = 1/626 Multiplying each possible sum of the two dice by their respective
probabilities gives us:

1 2 1 252
E(x) = 2 ( ) + 3( ) + ... + 12 ( ) = = 7.
36 36 36 36

The variance of a defined experiment is given by:


N
2
V (x) = ∑ (x i − E(x)) p(x i )

i=1

which is a way to describe the range of possible outcomes in a defined experiment.

4.5.1 Probability Trees

Probability Trees, also known as Tree Diagrams, are visual representations of the structure of probabilistic events.
There are three elements to a probability tree:27

1. Node: a point on the diagram indicating a trial.

2. Branch: the connection between nodes and events; we indicate on branches the probability of the events for a
given trial.

3. Event (or Outcome): an event (or outcome). Given that we’ve already defined event (or outcome), this
definition is pretty straightforward and a little anticlimactic.

Figure 4.10 is a probability tree for a single flip of a fair coin with labels for the one node, the two branches, and the
two possible outcomes of this trial.

Figure 4.10: Probability Tree Depicting One Flip of a Fair Coin

We can think of the node as a starting point for the probabilistic paths that lead to events, with the probability of
taking each path written right on the branches. In the case of a coin flip, once the coin is flipped (the node), we can
see from the diagram that there is a probability of 0.5 that we will take the path to heads and a probability of 0.5 that
we will take the path that ends in tails. Using a tree diagram to map out the probabilities associated with a single
coin flip is perfectly legitimate, but not terribly necessary – it’s a pretty simple example. Probability trees can be –
and usually are – more complex (but tend to be less complex than algebraic representations of the same problems,
otherwise, that would defeat the point of using a tree to help us understand the math). A node can take on any number
of branches, and each event can, in turn, serve as a node for other branches and events. For example, in Figure 4.11,
we have a probability tree representation of two flips of a fair coin.

Figure 4.11: Probability Tree Depicting Two Flips of a Fair Coin

Connected branches represent intersections. In Figure 4.11, the path that goes from the first node to Heads and then
continues on to Heads again represents the probability p(H eads ∩ H eads ). The sum of probabilities of
f lip 1 f lip 2

events are unions. The event of getting exactly one heads and one tails in two flips is represented in two ways in
Figure 4.11: one could get heads first and then tails or get tails first and then heads. We can also see from the
diagram that these events are mutually exclusive (on neither flip can you get heads and tails), so, the probability of
one heads and one tails in two flips is the sum of the intersection probabilities p(H eads ∩ T ails ) andf lip 1 f lip 2

p(T ails ∩ H eads


f lip 1 ):
f lip 2

p(1 H eads, 1 T ails) = p(H eads f lip 1 ∩ T ails f lip 2 ∪ T ails f lip 1 ∩ H eads f lip 2) = (0.5)(0.5) + (0.5)(0.5) = 0.5.

In general, then, the rule for working with probability trees is multiply across, add down.

The probabilities on the branches of tree diagrams can take on any possible probability value and therefore can be
adjusted to reflect conditional probabilities. For example: let’s say there is a jar with 3 marbles in it: 1 red marble,
1 yellow marble, and 1 green marble. If, without looking into the jar, we took one marble out at a time without
replacement, then the probability of drawing each marble on each draw would be represented by the tree diagram in
Figure 4.12.
Figure 4.12: Probability Tree Depicting Sampling Without Replacement

We can use the diagram in Figure 4.12 to answer any probability questions we have about pulling marbles out of the
jar. What is the probability of pulling out marbles in the order Red, Yellow, Green? We can follow the topmost path
to find out: the probability of drawing Red first is 1/3. Given that we have drawn Red first, the probability of
drawing Yellow next is 1/2. Finally, given that Red and Yellow have been drawn, the probability of drawing Green
on the third draw is 1. Multiplying across the path (as we do), the probability of drawing marbles in the order Red,
Yellow, Green is (1/3)(1/2)(1) = 1/6. What is the probability of drawing Yellow last? That is given by adding
the products of the two paths that end in Yellow: adding down (as we do), the probability is
(1/3)(1/2)(1) + (1/3)(1/2)(1) = 2/6 = 1/3. We can use the tree to examine probabilities before the

terminal events as well. For example: what is the probability of drawing Green on the first or second draw? We can
take the probability of Green on draw 1 – 1/3 – and add the probabilities of drawing Red on the first draw and
Green on the second draw – (1/3)(1/2) = 1/6 and of drawing Yellow first and Green second – also
(1/3)(1/2) = 1/6 and we get (1/3) + (1/6) + (1/6) = 4/6 = 2/3.

4.6 Elementary and Compound (or Composite) Events

Figure 4.13: Jean-Baptiste le Rond d’Alembert.

Jean d’Alembert (1717 – 1783) was a French philosophe and mathematician. He was a co-editor (with Denis
Diderot) of the Encyclopedia and contributed to it over a thousand articles, one of which was Heads or Tails, an
essay about games of chance. In that article, he gives a sophisticated and ahead-of-its-time analysis of The
St. Petersburg Paradox, one of the critical probability problems in the field of behavioral economics. However, that
article isn’t known for that at all. It’s known for one of the most famous errors in probabilistic reasoning ever made
by a big-deal mathematician in print.

Here’s why d’Alembert gets d’unked on. Say we’re playing a coin-flipping game (imagine it’s mid-18th-century
France and this is the kind of thing we do for fun) where I flip a coin twice and you win if it comes up heads on
either flip. d’Alembert argued that if you got heads on the first flip then the game would stop there, thus, there were
three possible outcomes of the game: heads on the first flip (and no second flip), tails on the first flip and heads on
the second flip, and tails on the first flip and tails on the second flip. In sample-space terms, d’Alembert was
describing the game as Ω = {H , T H , T T } and – here’s the big problem – said that each of those outcomes was
equally probable (p = 1/3 each). We can pretty easily show why that conclusion is wrong (and we can refer to
Figure 4.11 – to help): the probability of getting heads on the first flip is 1/2 regardless of what happens after that
flip,28 and the probability of both heads after tails and tails and tails is 1/4.

I bring d’Alembert’s error up for two reasons:

1. I find it comforting to know that a historically great mathematician can make mistakes, too, and I hope you do as
well, and
2. Without meaning to, d’Alembert’s calculation conflated elementary events29 and composite (or compound)
events30

An elementary event is an event that can only happen in one specific way. If I were to ask what is the probability
that I (Dan, the person writing this right now) will win the gold medal in men’s singles figure skating at the next
Winter Olympics?, then I would be asking about the probability of an elementary event: there is only one me, that is
only one Olympic sport, and there is only one next Winter Olympics. Also, that probability would be 0, but that’s
beside the point (If it weren’t for the axiom of nonnegativity, the probability of me winning a skating competition
would be negative – I can barely stand up on skates). If I were to ask what is the probability that an American wins
the gold medal in men’s figure skating at the next Winter Olympics, I would be asking about the probability of a
composite event: there are multiple ways that could happen because there are multiple men in America, each of
whom represents a possible elementary event of winning the gold in that particular sport in that particular Olympics.

Back to our man d’Alembert and the problem of two coin flips: in two flips of a fair coin, we might say that there are
three possible outcomes:

1. Two heads (and zero tails)


2. One heads (and one tails)
3. Zero heads (and two tails)

And we can describe the sample space as:

Ω = {2H , 1H , 0H }

That’s all technically true, but misleading, because one of those outcomes is composite while the other two are
elementary. That is: there are two ways to get one heads in two flips: you can get heads then tails or you can get
tails then heads. There is only one way to get two heads – you must get heads then heads again – and there is only
one way to get zero heads – you must get tails then tails again – so there are twice as many ways to get one heads
than two get either two heads or zero heads.

Thus, a better, less misleading way to characterize the outcomes and the sample space of two coin flips is to do so in
terms of the four elementary outcomes:

1. heads on flip one and heads on flip two


2. heads on flip one and tails on flip two
3. tails on flip one and heads on flip two
4. tails on flip one and tails on flip two

{ }
Ω = {H H , H T , T H , T T }

The probability of each of those elementary events is equal31 and by breaking down events into their elementary
components, we avoid making the same mistake as d’Alembert did.

4.7 Permutations and Combinations


As we saw in the previous section on elementary and compound events, defining sample spaces is really important
for properly understanding the probabilities involved with a given set of events. In relatively simple defined
experiments like two flips of a fair coin or pulling a green marble out of a jar or drawing a card from a deck, the
sample spaces are straightforward to define. When there are lots more possibilities and contingencies between those
possibilities, defining the sample space can be much trickier. In this section, we are going to talk about two such
situations and the mathematical tools we have to make those tricky tasks a lot simpler.

4.7.1 Permutations

Here’s an example to describe what we’re talking about with complex sample spaces: imagine five people are
standing in a line.

Figure 4.14: Pictured: Five People Standing in Line in a Stock Photo that I did not Care to Pay For.

How many different ways can those five people arrange themselves? Let’s start with the front of the line. There are
five options for who stands in the first position. One of those five goes to the front of the line, and now there are four
people left who could stand in the second position. One of those four goes behind the first person, and then there are
three left for the third position, then there will be two left for the fourth position, and finally there will be only one
person available for the end of the line.

Let’s call the people A, B, C , D, and E, which would be rude to do in real life but they’re fictional so they don’t
have feelings. For each of the five possible people to stand in the front of the line, there are four possibilities for
people to stand behind them:

If A is in the first position, then B ∪ C ∪ D ∪ E can be in the second position


If B is in the first position, then A ∪ C ∪ D ∪ E can be in the second position
If C is in the first position, then A ∪ B ∪ D ∪ E can be in the second position
If D is in the first position, then A ∪ B ∪ C ∪ E can be in the second position
If E is in the first position, then A ∪ B ∪ C ∪ D can be in the second position

That’s 5 × 4 = 20 possibilities of two people in the first two parts of the line. For each of those 20 possibilities,
there are three possible people that could go in the third position, so for the first three positions we have
5 × 4 × 3 = 60 possibilities. The pattern continues: for the first four positions we have 5 × 4 × 3 × 2 = 120

possibilities, and for each of those 120 possibilities there is only one person left to add at the end so we end up with
a total of 5 × 4 × 3 × 2 × 1 = 120 possibilities for the order of five people standing in a line.
In general, the number of possible orders of n things is, as in our example of five people standing in a line,
n × (n − 1) × (n − 2) × ... × 1). That expression is equivalent to the factorial of n, symbolized as n! .
32 In our
example, there were n = 5 people standing in line so there were 5! = 120 possible orders. If you had two items to
put on a shelf, there would be 2! = 2 ways to arrange them; if you had six items to put on a shelf, there would be
6! = 720 ways to arrange them.

Now let’s say that we had our same five people from above but only three could get in the line. How many ways
could three people selected out of the five people stand in order? The math starts out the same: there are five
possibilities for the first position, and four possibilities for the second for each of the five in the first, and three
possibilities for each of the five in the first and four in the second: 5 × 4 × 3 = 60 possibilities. What happens
next? Nothing. There are no more spots in the line, so whomever is left out doesn’t get a spot in the line (which
could be sad, but again: fictional people don’t have emotions so don’t feel too bad for them).

At this point – and this may be overdue – we can define the term permutation.33 Permutations are combinations of
things where the order of those things matters. The term includes orders of n objects, but also combinations of fewer
than n of those objects with order.

Note what happened when we removed two possible positions from the line where five people were to stand. We
took the calculation of the number of possible orders – 5! = 5 × 4 × 3 × 2 × 1 = 120 – and we removed the last
two terms (because we ran out of room in the line). We didn’t subtract them, but rather, we canceled them out. To do
that mathematically, we use division:
5 × 4 × 3 × 2 × 1
60 = = 5 × 4 × 3
2 × 1

which is equivalent to:

5!
60 = .
2!

Let’s briefly look at another example: imagine you had 100 items and you had a small shelf with space for only two
of those items; how many different ways could that shelf look? For the first spot on the shelf, you would have 100
options. For the second spot, whatever choice you made for the first spot would leave 99 possibilities for the second
spot. And then you would be out of shelf space, so the total number of options would be 100 × 99 = 9, 900. Another
way to think of that is that from the 100! possible orders of all of your items, you canceled out the last 100 − 2 = 98
possibilities:

100!
9, 990 = 100 × 99 =
98!

In general, then, we can say that the number of permutations of n things given that they will be put in groups of size r
– or as we officially say it, n things Permuted r at a time – and as we symbolically write it P – is:
n r

n!
nP r =
(n − r)!

That equation works just as well when r – the number of things being permuted at a time – is the same as n – the
number of things available to permute. In our original example, we had five people to arrange themselves in a line of
five. The number of permutations is given in that case by:

5! 5!
5 P5 = = = 120
(5 − 5)! 1

keeping in mind the fact noted in the sidenote above and in the bonus content below that the factorial of 0 is 1.

4.7.2 Combinations
Permutations are arrangements of things for which the order matters. When things are arranged in ways and the order
doesn’t matter, those arrangements are called combinations.34 Above, we calculated the number of possible
permutations for five people standing in a line: P = 5! /0! = 120. But what if order doesn’t matter – what if it’s
5 5

just five people, standing not in a line, but just scattered around? How many ways can you have a combination of
five people given that you started with five people? Just one. If their relative positions don’t matter, there’s only one
way to combine five people in a group of five. The number of combinations is always reduced by a factor of the
number of possible orders relative to the number of permutations. As mentioned in the previous section, the number
of orders is the factorial of the number of things in the group: r! . Thus, we multiply the permutation formula by 1/r! ,
putting r! in the denominator, to get the combination formula for n things Combined r at a time (also known as the
combinatorial formula):

n!
nCr =
(n − r)! r!

Thus, while there are 5! /0! = 120 possible permutations of five people grouped five at a time, there is just
5! /(5! 1! ) = 1 possible combination of five people grouped five at a time.

Here’s another example: imagine that you are going on a trip and you have five books that you are meaning to read
but you can only bring three in your bag. How many combinations of three books can you bring?35 The order of the
books doesn’t matter, so we have five things combined three at a time:

5!
5 C3 = = 10.
2! 3!

The combination is going to be super-important later on when we talk about binomial probability, so you can go
ahead and start getting excited for that.

4.8 Odds
Odds36 are an expression of relative probability. We have three primary ways of expressing odds: odds in favor,
odds against, and odds ratios. The first two – odds in favor and _odds against – are conceptually the same but,
annoyingly, are completely opposite.

4.8.1 Odds in Favor/Against

Let’s assume that we have two possible events: event A and event B. The odds in favor of event A is expressed as
the numerator of the unsimplfied ratio of the probabilities of A and B, followed by a colon, followed by the
denominator of the ratio of the probabilities of A and B. That probably sounds more complicated than it is – for
example, if A is twice as probable as B, then the odds in favor of A are 2 : 1. And, odds are almost always
expressed in terms of integers, so if A is 3.5 times as probable as B, then the odds in favor of A are 7 : 2.

Another way of thinking of odds in favor is that it is the number of times that A will happen in relation to the number
of times that B will happen. For example, if team A is three times as good as team B, then in the long run team A
would be expected to win three times for every one time that team B wins, and the odds in favor of A are 3 : 1.

Yet another way of thinking about it – and odds are a total gambling thing so this is likely the classic way of thinking
about it – is that the odds are related to the amount of money that each player should bet if they are wagering on two
outcomes. Imagine two gamblers are making a wager on a contest between team A and team B, where team A is
considered to be three times as good as team B. Each gambler will put a sum of money into a pot; if team A wins,
then the gambler who bet on team A to win takes all of the money in the pot, and if team B wins, then the gambler
who bet on team B to win takes all of the money in the pot. It would be unfair for both gamblers to risk the same
amount in order to win the same amount – with team A being three times as good, somebody who bets on team A is
taking on much less risk. Thus, in gambling situations, to bet on the better team costs more for the same reward and
to bet on the worse team costs less to win the same reward. A fairer setup is for the gambler betting on team A to
pay three times as much as the gambler betting on team B.
Figure 4.15: Making things even more annoying is that phrase the odds being in your favor has nothing to do with
odds in favor of you – it means that the odds are favorable, that is, that you are likely to win.

Now, this is kind of stupid, but odds against are the _exact opposite as odds in favor. The odds against is the
relative probability of an event not happening to the probability of an event happening. If the odds in favor of event
A are 3 : 1, then the odds against event A are 1 : 3.

Since we’ve been talking about gambling, I feel the need to point out here that gambling odds are expressed as odds
against. #### Odds and probabilities

If the odds in favor of an event are known, the probability of the event can be recovered. Recall the example of a
contest in which the odds in favor of team A are 3 : 1, and that team A would therefore be expected to win three
contests for every one that they lose. If team A wins three for every one they lose, then they would be expected to
win 3 out of every 4 contests, so the probability of team A winning would be 3/4.

4.8.2 Odds Ratios

Another way of expressing odds is in terms of an odds ratio. For a single event, the odds ratio is the odds in favor
expressed as a fraction:

p(A)
OR f or A vs.¬A =
1 − p(A)

When we have separate sample spaces, then odds ratio are the literal ratio of the odds in favor of each event
(expressed as ratios). For example, imagine two groups of 100 students each that are given a statistics exam. The
first group took a statistics course, and 90 out of 100 passed the exam. The second group did not take the statistics
exam, and 75 out of 100 did not pass. What is the odds ratio of passing the course between the two groups?

The odds in favor of passing having taken the course, based on the observed results, are 9 : 1: nine people passed
for every one that failed. The odds in favor of passing having not taken the course are 1 : 3: 1 person passed for
every 3 who failed. The odds ratio is therefore:

9/1
OR = = 27
1/3

Thus, the odds of passing the course are 27 times greater for people who took the course.

4.9 Conditional Probability


As noted above, a conditional probability is the probablity of an event that depends on other events that change the
sample space. For example, imagine that I have two dice: one is a six-sided die, and the other is a twenty-sided die.

Figure 4.16: A six-sided die (left) and a twenty-sided die (right)


The sample space of possible events depends quite a bit on which die we are rolling: the sample space for the six-
sided die is {1, 2, 3, 4, 5, 6}, and the sample space for the twenty-sided die is
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}. Thus, the probability of rolling a 5 given that the

six-sided die is rolled is much different than the probability of rolling a 5 given that the twenty-sided die is rolled. In
other words, the conditional probability of rolling a 5 with a six-sided die differs from the conditional probability
or rolling a 5 with a twenty-sided die (the two conditional probabilities are 1/6 and 1/20, respectively).

Conditional probabilities are also called likelihoods37. Any time you see the word “likelihood” in a statistical
context, it is referring to a conditional probability.38

Sometimes, we can infer likelihoods based on the structure of a problem, as in the case of the six-sided vs. the
twenty-sided dice. Other times, we can count on (ha, ha) mathematical tools to assist. Two of the most important
tools in probability theory and – more importantly for our purposes – for statistical inference are Binomial
Probability and Bayes’s Theorem.

4.9.1 Binomial Probability

Let’s talk some more about flipping coins (I promise this will lead to non-coin applications of an important
principle). The probability of each outcome of a coin flip is a conditional probability: it is conditioned on the
assumption that heads and tails are equally likely. The probability associated with multiple coin flips is also a
conditional probability: it is conditioned on the assumption of equal probability of heads and tails and the number
of flips. For example: the likelihood of getting one heads in one flip of a fair coin is the probability of heads given
that the probability of heads in each flip is 0.5 and also given that there is one flip is 0.5. Let’s write that out
symbolically and introduce the notation π, defined as the probability of an event on any one given trial (and not the
ratio of the circumference of a circle to its diameter) and use N to mean the total number of trials:

p(H |π = 0.5, N = 1) = 0.5

The probability of getting one heads in two trials is the probability of getting heads on the first flip and tails on the
second flip or of getting tails on the first flip and heads on the second flip:

p(1 H |π = 0.5, N = 2) = p(H T ∪ T H ) = (0.5)(0.5) + (0.5)(0.5) = 0.5

The probability of getting one heads in three trials is the probability of getting heads on the first flip and tails on the
second flip and tails on the third flip or of getting tails on the first flip and heads on the second flip and tails on the
third flip or of getting tails on the first flip and tails on the second flip and heads on the third flip:

p(1 H |π = 0.5, N = 3) = p(H T T ∪ T H T ∪ T T H )

= (0.5)(0.5)(0.5) + (0.5)(0.5)(0.5) + (0.5)(0.5)(0.5) = 0.375

We could calculate the probability of any set of outcomes given any number of flips of coins by identifying the
probability of each possible outcome given the number of flip, identifying all of the ways we could get to those
outcomes, and adding all of those probabilities up. But, there is a better way.

A flip of a fair coin is an example of a binomial trial39. A binomial trial is any defined experiment where we are
interested in two outcomes. Other examples of binomial trials include mazes in which an animal could turn left or
right and survey questions for which an individual could respond yes or no. A binomial trial could also involve
more than two possible outcomes that are arranged into two groups: for example, a patient in a clinical setting could
be said to improve, stay the same, or decline; those data could be treated as binomial by arranging them into the
binary categories improve and stay the same or decline. Similarly, an exam grade for a student can take on any value
on a continuous scale from 0 – 100, but the value could be treated as the binary pair pass or fail.

The probability π can be any legitimate probability value: it can range from 0 to 1. Figure 4.17 adapts Figure 4.11 –
which was a probability tree depicting two flips of a fair coin – to generally describe two consecutive binomial
trials. Instead of the probability value of 0.5 that was specific to the probabilities of flipping either heads or tails,
we’ll call the probability of one outcome π, which makes the probability of the other outcome 1 − π. Instead of
heads and tails, we’ll use s and f , which stand for success and failure. Yes, “success” and “failure” sound
judgmental. But, we can define either of a pair of binomial outcomes to be the “success,” leaving the other to be the
“failure” – it’s a totally arbitrary designation and it just depends on which outcome we are interested in.

Figure 4.17: Two Binomial Trials

We can use Figure 4.17 to help us see that:


2
p(2s) = π

p(1s) = (π)(1 − π) + (1 − π)(π)

2
p(0s) = (1 − π)

In general, the likelihood of a set of outcomes is the probability of getting to each outcome times the number of ways
to get to that outcome. The probability of each path is known as the kernel probability40 and is given by:
s f
kernel probability = π (1 − π)

Given the probability of each path, the overall probability is the sum of the paths. In other words, the binomial
likelihood is the product of the kernel probability and the number of possible combinations represented by the kernel
probability. The number of possible combinations is given by the combination formula:

N!
N Cs =
s! f !

Put them together, and we have…

4.9.1.1 The Binomial Likelihood Function

N! s f
p(s|N = s + f , π) = π (1 − π)
s! f !

The binomial likelihood function makes it easy to find the likelihood of a set of binomial outcomes knowing only (a)
the probability of success on any one trial and (b) the number of trials. For example:

1. What is the probability of getting exactly 7 heads in 10 flips of a fair coin?


We’ll call heads a success, we know that in the case of a fair coin the probability of heads is 0.5, and N = 10

trials.

10! 7 3
p(s = 7|π = 0.5, N = 10) = (0.5) (0.5) = 0.1171875
7! 3!

2. What is the probability of drawing three consecutive blue marbles with replacement41 from a jar containing 1
blue marble and 4 orange marbles?

We’ll call blue a success, we know the probability of drawing a blue marble from a jar with one blue marble and
four orange marbles is 1/5 = 0.2, and N = 5 trials.

5! 3 2
p(s = 3|π = 0.2, N = 5) = (0.2) (0.8) = 0.0512
3! 2!

3. The probability of winning a game of craps at a casino is approximately 49%. If 15 games are played, what is the
probability of winning at least 12?

We’ll call winning a success (not much of a stretch there), we are given that π = 0.49, and there are N = 15 trials.
This question specifically asks for the probability of at least 12 successes, which means we are looking for the
probability of winning 12 _or 13 or 14 _or 15 games. In other words, we have union probabilities of mutually
exclusive events (you can’t win 12 and 13 games out of 15), so we add them.

15! 15!
12 3 13 2
p(s ≥ 12|π = 0.49, N = 15) = (0.49) (0.51) + (0.49) (0.51)
12! 3! 13! 2!

15! 14 1
15! 15 0
+ (0.49) (0.51) + (0.49) (0.51) = 0.01450131
14! 1! 15! 0!

4.9.1.2 Expected Value and Variance of Binomial Events

A nice property of binomial probability is that the expected value and the variance are especially simple to find. We
can still use the typical equation for the expected value:
N

E(x) = ∑ x i p(x i )

i=1

but consider that binomial outcomes are binary data and can be assigned values of 1 and 0. For example, let’s
consider 2 flips of a fair coin where heads is considered success and is assigned a value of 1. The sample space for
s is Ω = {0, 1, 2}. As we have noted elsewhere, the probability of 0 heads in 2 flips is 0.25, the probability of 1
heads in 2 flips is 0.5, and the probability of 2 heads in 2 flips is 0.25. Thus, the expected value of s is:

E(x) = (0)(0.25) + (1)(0.5) + (2)(0.25) = 1

Thus, in two flips of a fair coin, we can expect one heads. In general, the expected value of a set of N binomial
trials with a p(s) = π where s = 1 and f = 0 is:

E(x) = N π

The variance is similarly easy. The variance of a set of N binomial trials with a p(s) = π where s = 1 and f = 0

is:

V (x) = N π(1 − π)

To illustrate using the general variance formula for the case of two flips of a fair coin (where we have already
shown that E(x) = 1:
2 2 2
V (x) = (0 − 1) (0.25) + (1 − 1) (0.5) + (2 − 1) (0.25) = 0.25 + 0.25 = 0.5

which is equal to N π(1 − π) = (2)(0.5)(0.5).

4.9.2 Bayes’s Theorem

Figure 4.18: The Setup to a Brief Series of Relatively Easy Probability Problems

Please examine the contents of the two poorly-drawn jars in Figure 4.18: Jar 1 is filled exclusively with orange
marbles and Jar 2 is filled exclusively with blue marbles. Then, please consider the following questions regarding
conditional probabilities:

1. What is the probability of drawing an orange marble given that you have Jar 1?

2. What is the probability of drawing a blue marble given that you have Jar 1?

3. What is the probability of drawing an orange marble given that you have Jar 2?

4. What is the probability of drawing a blue marble given that you have Jar 2?

If we are drawing from Jar 1, we can only draw orange marbles, so there is a 100% chance of drawing orange given
that we have Jar 1 and a 0% chance of drawing blue given that we have Jar 1. The reverse is true for Jar 2: there is
a 100% chance of drawing blue and a 0% chance of drawing orange given that we have Jar 2. In this example, the
conditional probability of drawing an orange marble or a blue marble depends entirely on what jar we have. In more
formal terms, the sample space of the possible events is conditional on what jar we have: if we have Jar 1, then the
sample space is {Orange, Orange, Orange, ...}; if we have Jar 2, then the sample space is
{Blue, Blue, Blue, ...}. Yet another fancy way of saying that is to say that the choice of jar reconditions the

sample space.

Now let’s add a fifth question about these jars. Imagine that all of the marbles in Jar 1 are still orange and all of the
marbles in Jar 2 are still blue, but now the jars are opaque and a marble is drawn without looking inside.

5. Given that a blue marble is drawn, what is the probability that we have Jar 2?

Since there are no blue marbles in Jar 1 and there are blue marbles (and nothing but) in Jar 2, we must conclude that
there is a 100% chance that we have the now-much-less-transparent Jar 2.

Things would be a little trickier if there were a mixture of orange and blue marbles in each jar, or if one jar was
somehow more probable to have than the other before drawing, or if there were three, four, or more jars. Fortunately,
we have math to help us. Specifically, there is one equation that helps us calculate conditional probabilities, and it’s
a pretty important one.

Figure 4.19: Hilariously, this may or may not be a picture of the Reverend Thomas Bayes. Even more hilariously,
Thomas Bayes may or may not have been the one who first derived Bayes’s Theorem. The Bayes brand is uncertainty
and it is strong

The equation that gives the conditional probability of one event given the other is known as Bayes’s Theorem.
Bayes’s theorem follows directly from the definition of conditional probability mentioned abote, that is:

The probability of A and B happening p(A ∩ B) is the product of the probability of A given B and the
probability of B:

p(A ∩ B) = p(A|B)p(B)

Since the designation of the names A and B is arbitrary, we can also rewrite that as:

p(A ∩ B) = p(B|A)p(A)

and since both p(A|B) and p(B|A) are equal to p(A ∩ B), it follows that:

p(A|B)p(B) = p(B|A)p(A)

We can divide both side of that equation by p(B), resulting in:

p(B|A)p(A)
p(A|B) =
p(B)

which is Bayes’s Theorem, although it’s more commonly written with the top two terms of that numerator switched:

p(A)p(B|A)
p(A|B) =
p(B)

Let’s bring back our pair of marble jars to demonstrate Bayes’s Theorem in action. This time, let’s stipulate that
there are 40 marbles in each jar. Jar 1 contains 10 orange marbles and 30 blue marbles; Jar 2 contains 30 orange
marbles and 10 blue marbles:
Figure 4.20: The Setup to a Brief Series of Conditional Probability Problems

In this set of examples, let’s assume that one of these jars is chosen at random, but which jar is chosen is unknown.
That means that we can reasonably assume that the probability of choosing Jar 1 is the same as the probability of
choosing Jar 2. As above, a marble is drawn – without looking inside of the jar – from one of the jars. Here is the
first of a new set of questions:

1. If a blue marble is drawn, what is the probability that the jar that was chosen is Jar 1?

This problem is ideally set up for using Bayes’s Theorem.42 The (slightly rephrased) question:

What is the probability of Jar 1 given a blue marble?

is equivalent to asking what p(A|B) is, where event A is choosing Jar 1 and event B is drawing a blue marble. To
use Bayes’s Theorem to solve for p(A|B), we are going to need to first find p(A), p(B|A), and p(B).

p(A) :

Event A is Jar 1, so p(A) is the probability of choosing Jar 1. We have stipulated that the probability of choosing
Jar 1 is the same as the probability of choosing Jar 2. Because one or the other must be chosen (i.e., the sample
space Ω = {J ar 1, J ar 2} and the probability of choosing Jar 1 or Jar 2 is one (i.e., p(J ar 1 ∪ J ar 2) = 1, or
equivalently, ∑ Ω = 1), then:

p(A) p(A) 1
p(A) = = = = 0.5
p(A) + p(B) 2p(A) 2

p(B|A) :

Event B is (drawing a) blue (marble), so p(B|A) is the probability of blue given Jar 1. If Jar 1 is given, then the
sample space for drawing a blue marble is restricted to the probability of drawing a blue marble from Jar 1: Jar 2
and the marbles in it are ignored. Because there are 30 blue marbles and 10 orange marbles in Jar 1 (i.e.,
Ω = {30 Blue, 10 Orange}), the probability of B|A – drawing a blue marble given that the jar is Jar 1, is:

30 3
p(B|A) = = = 0.75
40 4

p(B) :

Again, event B is drawing a blue marble. There is no conditionality to the term p(B): it refers to the overall
probability of drawing a blue marble from either jar. Finding p(B) is going to take a bit of math, and we can use a
probability tree to help us, too.

Figure 4.21: Probability Tree for Choosing Jars and Then Drawing Orange and Blue Marbles From Those Jars

There are two ways to get to a blue marble: (1) drawing Jar 1 with 50% probability and then drawing a blue marble
with 75% probability, and (2) drawing Jar 2 with 50% probability and then drawing a blue marble with 25%
probability. As we do with probability trees, we multiply across to find the intersection probabilities of
J ar 1 ∩ Blue and J ar 2 ∩ Blue
43 and add down to find the union probabilities of the two paths that lead to blue
marbles to get the overall probability of drawing a blue marble (p(B)):

p(B) = (0.5)(0.75) + (0.5)(0.25) = 0.5

Now that we know p(A), p(B|A), and p(B), we can calculate p(A|B): the probability that Jar 1 was chosen given
that a blue marble was drawn:

p(A)p(B|A) (0.5)(0.75)
p(A|B) = = = 0.75
p(B) 0.5

Here’s another question:

2. What is the probability that Jar 2 was chosen given that a blue marble was drawn?

We can approach this question in two different ways. First, we can use the same steps that we did to solve the first
question, with Jar 2 now being event A and drawing a blue marble still being event B. Since we know that
p(J ar 1) = p(J ar 2), p(A) is still 0.5. p(B|A) is now the probability of drawing a blue marble from Jar 2, so

p(B|A) in this case is equal to 10/40 = 1/4 = 0.25. p(B) is still the overall probability of drawing a blue

marble, which remains unchanged: p(B) = 0.5. Thus, the probability of Jar 2 is:

p(A)p(B|A) (0.5)(0.25)
p(A|B) = = = 0.25
p(B) 0.5

The other way we could solve that problem – the much easier way – is to note that if the probability that we have Jar
1 is 0.75, then the probability that we don’t have Jar 1 is:

1 − p(J ar 1|Blue) = 1 − 0.75 = 0.25


I probably could have led with that. Anyway, the probability that we have Jar 1 is 0.75 and the probability that we
have Jar 2 is 0.25, meaning that it is three times more probable that we have Jar 1 than that we have Jar 2.44 That
lines up with the fact that the proportion of blue marbles is three times greater in Jar 1 than it is in Jar 2.

Now let’s try a slightly trickier question:

3. After drawing a blue marble from one of the jars, the marble is replaced in the same jar, the jar is shaken, and a
second marble is drawn. The second marble drawn is also blue. What is the probability that the jar is Jar 1?

To reduce confusion, let’s add some subscripts to A and B: we can add 1 to refer to anything that happened on the
first draw, and to refer to what’s happening on the second draw.
2

p(A 2 ) :

Event A is again Jar 1, but this time, p(A) is different. Having already drawn a blue marble from that same jar, we
must update the probability that we have Jar 1 based on that new information. In the last problem, we found the
probability that we have Jar 1 given that a blue marble was drawn. That probability is our new p(A). Thus, our new
p(A) – denoted p(A )
45 – is p(A |B ): the updated probability of having Jar 1 _in this second draw given that a
2 2 1

blue marble was drawn on the first draw:

p(A 1 )p(B 1 |A 1 ) (0.5)(0.75)


p(A 2 ) = p(A 2 |B 1 ) = = = 0.75
p(B 1 ) 0.5

p(B 2 |A 2 ) :

Event B is still (drawing a) blue (marble), and the relative numbers of marbles in Jar 1 is still the same (because
we replaced the marble), so p(B|A) is still 30/40 = 0.75.

p(B 2 ) :

The denominator of Bayes’s Theorem, like the p(A) term, also needs to be updated because we no longer believe
that having either jar is equally probable. The change happens between the first and second nodes of the probability
tree:

Figure 4.22: Probability Tree for Having a Particular Jar and Then Drawing Orange and Blue Marbles From That
Jar Having Once Drawn a Blue Marble From That Jar
There are still two ways to get to a blue marble, but the probability associated with each path has changed: (1)
drawing Jar 1 with 75% probability and then drawing a blue marble with 75% probability, and (2) drawing Jar 2
with 25% probability and then drawing a blue marble with 25% probability. As before, we multiply across to find
the intersection probabilities of J ar 1 ∩ Blue and J ar 2 ∩ Blue46 and add down to find the union probabilities of
the two paths that lead to blue marbles to get the updated overall probability of drawing a blue marble (p(B)):

p(B 2 ) = (0.75)(0.75) + (0.25)(0.25) = 0.625

With new values for each term of Bayes’s Theorem – p(A ), p(B |A ), and p(B ) – we can calculate
2 2 2 2 :
p(A 2 |B 2 )

the probability that Jar 1 was chosen given that a blue marble was drawn on consecutive draws:

p(A 2 )p(B 2 |A 2 ) (0.75)(0.75)


p(A 2 |B 2 ) = = = 0.9
p(B 2 ) 0.625

Now we are more certain that we have Jar 1 in light of additional evidence: the probability was 0.75 after drawing
one blue marble from the jar, and now that we have drawn two consecutive blue marbles (with replacement) from
that jar, the probability is 90%. In other words, drawing two marbles leads to greater confidence that we have the
blue-heavy jar in our hands. Additionally, we know that if there is a 90% probability that we have Jar 1, then there
is a 1 – 90% = 10% chance that we have Jar 2: it is now nine times more likely that we have Jar 1 than Jar 2. Let’s
ask one more question – this one a two-parter – before we move on:

4. What is the probability that the jar is Jar 1 if the next draw (after replacing the blue marble) is:

a. a blue marble?
b. an orange marble?

I will leave going through all of the steps again as practice for the interested reader and skip to the Bayes’s Theorem
results. For part (a), if another blue marble is drawn, then A is having Jar 1 given the two previous draws and B
3 3

is drawing a third blue marble:

p(A 3 )p(B 3 |A 3 ) (0.9)(0.75) 0.675


p(A 3 |B 3 ) = = = = 0.96
p(B 3 ) (0.9)(0.75) + (0.1)(0.25) 0.675 + 0.025

With three consecutive draws of blue marbles, the probability is high – 96%, to be precise – that we have the
majority-blue jar.

Let’s address part (b) of the question: what happens if we draw an orange marble? If we, in fact, were drawing from
Jar 1, drawing an orange marble would be far from impossible – there are 10 orange marbles in Jar 1. But, since
orange marbles are much more likely to come from Jar 2, having Jar 1 would be less certain than if we had drawn a
third consecutive blue marble.

For this part of the question, event B is drawing an orange marble on the third draw, and event A is having Jar 1
3 3

given the information we got from the first two draws. Again, we’ll leave all of the steps as an exercise for the
interested reader and skip to the final Bayes’s Theorem equation:

(A 3 )(B 3 |A 3 ) (0.9)(0.25)
p(A 3 |B 3 ) = = = 0.75
p(B 3 ) (0.9)(0.25) + (0.1)(0.75)

…and so the probability that we have Jar 1 takes a step back.47 Updating is an important feature of conditional
probabilities calculated using Bayes’s Theorem. Drawing marbles with replacement from a jar doesn’t change the
probabilities associated with those marbles: if the jar contains a lot more orange marbles than blue marbles and we
pick three blue marbles in a row, we’d just say “huh, that’s weird” and move on with our lives – low-probability
events are not impossible events. But, if we know the relative numbers of marbles in several jars but which jar we
have is unknown, then the chances of having one of those jars can constantly be updated as we obtain more
information.
If you are still reading this, you may have noticed that I’ve belabored the discussion of jars and marbles to an extent
that might suggest that I’m talking about jars and marbles but at the same time I’m not really talking about jars and
marbles.

I’m talking about science.

4.9.2.1 Bayes’s Theorem and Statistical Inference

In the labored analogy above, the jars represent hypotheses about the world and the marbles represent data that can
support or contradict those hypotheses. A hypothesis can represent a model of how a scientific process works and/or
parameters that cannot be directly measured. In the Bayesian framework of evaluating scientific hypotheses, testing
different hypotheses is like choosing an opaque, unlabeled jar: there are probabilities associated with each
hypothesis being correct (like the probabilities of choosing different jars), we have some idea of the probability of
what the data would look like given each hypothesis (like the probability of drawing a kind of marble given the jar),
and an idea of the overall probability of the data (like the overall probability of drawing a kind of marble).

When we’re using Bayes’s Theorem to refer to scientific inference, we change the notation in two subtle but
important ways: instead of A we use H to mean hypothesis and instead of B we use D to mean data. The resulting
equation is:

p(H )p(D|H )
p(H |D) =
p(D)

Each of the terms of Bayes’s Theorem has a special name that reflects its role in the inferential process.

4.9.2.1.1 Terms of Bayes’s Theorem

4.9.2.1.1.1 Prior Probability

p(H ) , is the probability of the hypothesis; more specifically, it is the prior probability48 The prior probability
(often referred to simply as the prior) is the probability of a given hypothesis before data are collected in a given
experiment or investigation. We may initially have no reason to believe that one hypothesis is any more likely than
the others under consideration: in that case, prior probabilities can be based on equal assignment.49 On the other
hand, we may have reason to consider some hypotheses as much more likely than others based on our scientific
beliefs and/or existing evidence. And, just as we updated p(A) for jars based on marbles, we can use the results of
one Bayesian investigation to update p(H ) for subsequent investigations – more on that below.

4.9.2.1.1.2 Likelihood

p(D|H ) is the probability of the data given the hypothesis, and is known as the likelihood. Any conditional
probability is a likelihood – the left side of our Bayes’s Theorem equation (p(H |D)) is also a likelihood, but it has
its own name so we almost never refer to it as such – hence the name. The likelihood informs us how likely or
unlikely the observed data are if a hypothesis is true. The likelihood of the data is typically determined by a
likelihood function: if the observed data are close to what the likelihood function associated with a hypothesis
would predict, then the likelihood of the data given the hypothesis will be relatively high; if the observed data are
way off from the predictions that the likelihood function makes, the likelihood of the data given the hypothesis will
be relatively low. How do we know what the likelihood function for a hypothesis should be? Well, sometimes it’s
obvious, and sometimes it’s tricky, and sometimes there isn’t one so we use computer simulations to get the
likelihood by brute force. In any event, it’s usually not as simple as the marble-and-jar analogy might make it seem,
but we will cover that at length in later content.

4.9.2.1.1.3 Base Rate

p(D) is the overall probability of the data and is known as the base rate.^[_Base rate:_The unconditioned
probability of observed events.<> In theoretical terms, the base rate – like the probability of drawing a certain type
of marble from all jars – is the combined probability of observing the data under all possible hypotheses. In
practical terms, the base rate is the value that makes the results of Bayes’s Theorem (as discussed just below, the
left-hand side of Bayes’s Theorem is known as the posterior probability) for all of the hypotheses sum to one. That
is, if we have n hypotheses, and we have a prior and a likelihood for each of hypotheses i, then the base rate is:
n

∑ (p(H i )p(D|H i )

i=1

Please note that this is precisely the calculation we used for Bayes’s Theorem denominator in the marbles-in-jars
problems, just using different terms:

p(J ar 1)p(Blue M arble|J ar 1)

+ p(J ar 2)p(Blue M arble|J ar 2)



n

∑ (p(H i )p(D|H i )

i=1

4.9.2.1.1.4 Posterior Probability

The final term to discuss is the left-hand side of Bayes’s Theorem: the probability of the hypothesis given the data
p(H |D), which is known as the posterior probability
50.The posterior probability (often just called the posterior)
is the updated probability of a hypothesis after taking into account the prior, the likelihood, and the base rate. In the
Bayesian framework of scientific inference, it is the answer we have after conducting an investigation. However, it
is not necessarily our final answer: as illustrated in our marbles-in-jars problems, we can update probabilities by
obtaining more evidence. In the scientific context, this means that we can use our posterior for one study as the prior
for the next study.

4.10 Monte Carlo Methods


I don’t know if you noticed, but the citation date on Kolmogorov’s axioms is 1933. That is, on the scale of
mathematical discoveries, incredibly recent! The concept of randomness is not one that humans naturally understand
very well (ahem), and it wasn’t until the 20th century that we learned to use randomness to solve problems rather
than just to be a problem. Luckily (ha, ha), it was also in the 20th century that we developed the computing power to
use repeated random sampling to solve problems that are intractable with analytic methods like calculus.

Monte Carlo methods are named after the Casino de Monte-Carlo (pictured above). One of the developers of the
technique had an uncle with a gambling problem who habitually frequented the Monte Carlo Casino.

The practice of using repeated random sampling to solve problems in statistics (and other fields, including physics
and biology) is known as using Monte Carlo methods or, equivalently, as using Monte Carlo simulations or Monte
Carlo sampling. Here, we will introduce Monte Carlo methods with some relatively straightforward examples.
4.10.1 Random Walk Models

Sticking with the casino theme: let’s imagine a gambler playing craps for $10 per game. On any given game, she has
a 49% chance of winning (in which case her total money will increase by $10) and a 51% chance of losing (she will
lose $\10)51. How can we project how her money will be affected by playing the game?

We know that, over the long run, she will win 49% of her games, and we can calculate that the expected value of
each gamble is ($10)(0.49) + (−$10)(0.51) = −$0.20, meaning that she can expect to lose 20 cents per game. But,
gambling doesn’t really work that way: if the gambling experience consisted of periodically handing 20 cents to the
casino, nobody would gamble (I think). Our gambler is going to win some games, and she is going to lose some
games. It’s possible that she will, over a given stretch, win more games than she loses and will leave the craps table
when she hits a certain amount of winnings. It’s also possible that she will lose more games than she wins and will
leave the craps table when she is out of money. Or, she might not win as much as she wants nor lose all her money,
but get tired after a certain amount of games.

We can model our hypothetical gambler’s monetary journey using a random walk model52. A random walk model
(sometimes more colorfully called the drunkard’s walk) is a (usually visual) representation of a stochastic53
process, and has been applied in psychology to the process of arriving at a decision, in physics to the motion of
particles, in finance to the movement of markets, and in other fields.

In the case of our gambler, let’s say she has $100 with her. She will stop playing if:

1. she wins a total of $100 (and finishes with $200)


2. she loses the $100 (and finishes with $0)
3. she plays 200 games and gets tired (honestly this one is just so the graphs fit on the page)

We can see how one simulation based on Monte Carlo sampling plays out in Figure 4.23.

Figure 4.23: A Random Walk

On this specific walk, our gambler starts strong, winning her first four games and nine of her first 11 to go up $70 to
a peak of $70. The rest of her session doesn’t go quite so well, but she is still not broke after 200 games: she walks
away with $70, presumably to be spent on overpriced adult beverages by the pool.
But, it is important to note that this is just one possible set of circumstances. To get a better idea of what she could
expect, we would want to simulate the experience lots of times. The expected value here (still −$0.20) is less
interesting a prediction than how many times she leaves having hit her goal earnings, how many times she loses all
her money, and how many times she ends up somewhere inbetween. A small set of multiple random walk
simulations is visualized in Figure 4.24.

Figure 4.24: 9 Random Walks

In these 9 simulations, our gambler leaves the game up $100 three times, loses her whole bankroll four times, and
leaves after 200 games twice.

Typically, simulations are run in greater numbers – just as one run can appear to be lucky or unlucky, a handful of
simulations is rarely compelling. Thus, let’s try a greater number of random walks. In a run of 1,000 simulations
(don’t worry, we won’t make a figure for this one), in 424 of them (42.4%), our gambler left with $200, in 449
(44.9%) of them, she left with $0, and in 127 (12.7%), she left the table after 200 games having neither hit her goal
nor having lost all of her money. Please note how much different this is than what would be predicted using the
expected value of each gamble: if she lost $0.20 every time she played (which, of course, is impossible, because she
could either win $10 or lose $10), then she would be guaranteed to leave with $60. The simulation results are likely
a more meaningful result because it describes likely outcomes based on realistic behavior.

4.10.2 Markov Chain Monte Carlo (MCMC)

Figure 4.25: Andrey Markov


A Markov Chain, like the random walk model (the random walk model is considered a special case of a Markov
chain), is a representation of a stochastic process that moves in steps. In the Markov chain, the steps are known as
transitions between different states that occur with some probability. Markov chains have specific applications in
psychology and group dynamic, but, more importantly for us, will have important implications for the incorporation
of stochastic searches in statistics.

One psychological application of Markov Chain models is in learning and memory. For example, in William Estes’s
(1953) 3-state learning model54 there are three possible states for an item to be learned: the item could be in the
unlearned state, in a short-term state (where it may be temporarily memorable but not for long), or in a long-term
state (that is relatively permanent but from which things can be forgotten).

According to the Estes model (see Figure 4.26), during the course of learning, an item starts out in an unlearned state
– which we will call 1 – and will stay in state 1 with probability a. Given that it is assumed that the item stays state
1 with probability a, it follows that it will leave that state with probability 1 − a. If the item leaves the unlearned

state, it has two places to go: the short-term state – state 2 – and the long-term state – state 3. We can designate the
probability of going to state 2 as b and the probability of going to state 3 as 1 − b: therefore the probability of an
item leaving state 1 and going to state 2 as (1 − a)(b) and the probability of an item leaving state 1 and going to
state 3 as (1 − a)(1 − b). We can then continue, designating the probability of an item in state 2 and staying there as
c and of an item leaving state 2 as 1 − c, etc., etc., as shown in Figure 4.26.

Figure 4.26: A Three-State Markov Chain Model

Just as we designated the probabilities of winning and losing in the random walk, we can designate the probabilities
a, b, c, d, e, and f and use those probabilities to repeatedly simulate the stochastic motion of items between memory

states using the Markov Chain model.

Far more important for us, though, is that we can use Markov Chain Monte Carlo methods to generate probability
distributions about scientific hypotheses by moving numeric estimates from beginning states to increasingly more
probable states. We’ll talk about that later.

4.11 Summary Information


4.11.1 Glossary

Event (or Outcome - we will use them interchangeably): A thing that happens. Roll two dice and they come up 7 –-
that’s an event (or, an outcome). Flip a coin and it comes up heads – that’s an outcome (or, an event)

p(A) : the probability of event A. For example, if event A is the probability of a coin landing heads, we can write:

p(A) = 0.5
or we can spell out heads instead of using A:

p(heads) = 0.5

¬ A: a symbol for not A, as in “event A does not happen.” If event A is a coin landing heads, we can say

p(¬ A) = 0.5 = p(tails)

Trial: A single occurrence where an event can happen. “One flip of a coin” is an example of a trial.

Defined experiment: An occurrence or set of occurrences where events happen with some probability. “One flip of a
coin” is a defined experiment. “Three rolls of a single sided die” is a defined experiment. “On Tuesday” is a defined
experiment if you’re asking, “what’s the probability that it will rain on Tuesday.” Please note that a defined
experiment could be one trial or it could comprise multiple trials.

Sample space (symbolized by Ω or S ): All possible outcomes of a defined experiment. For example, if our defined
experiment is “one flip of a coin,” then the sample space is one heads and one tails, which we would symbolize as:
ω = H, T

Elementary event: An event that can only happen one way. If we ask, “what is the probability that a coin lands heads
twice in two flips, there’s only one way that can happen: the first flip must land heads and the second flip must also
land heads.

Composite (or compound) event: An event that can happen multiple ways. If we ask, “what is the probability that a
coin lands heads twice in three flips, there are three ways that can happen: the flips go heads, heads, tails, the flips
go heads, tails, heads, or the flips go tails, heads, heads.

Mutual exclusivity: A condition by which two (or more) events cannot both (or all) occur. In one flip of a coin,
heads and tails are mutually exclusive events.

Collective exhaustivity: A condition by which at least one of a set of events must occur in a defined experiment. For
one flip of a coin, heads and tails are said to be collectively exhaustive events because one of them must happen.

Independent events: Events where the probability of one event does not affect the probability of another. For
example, the probability that your birthday is July 20 is totally unaffected by the probability that a randomly chosen
person that is unrelated to you is also July 20.

Dependent events: Events where the probability of one event does affect the probability of another. The probability
of a twin having a birthday on July 20 is very closely related to the probability of her twin sister having a birthday
on July 20.

Union (symbolized by ∪): or. As in, “what is the probability that it will rain on Tuesday or Friday of next week?”
We may rephrase that question as “what is the union probability of rain on Tuesday or Friday?”

Intersection (symbolized by ∩): and. As in, “what is the probability that it will rain on Tuesday and Friday of next
week?” or “what is the intersection probability of rain on Tuesday and Friday?” Intersection probability is also
referred to as conjoint or conjunction probability.

4.11.2 Formulas

4.11.2.1 Intersections and Unions

Probability of A and B:
If A and B are independent, then p(B|A) = p(B)

p(A ∩ B) = p(A)p(B|A)

Probability of A and B and C:


p(A ∪ B ∪ C) = p(A)p(B|A)p(C|A ∩ B)

Probability of A or B or C (if A and B are mutually exclusive):


Note: if any pair of events are mutually exclusive, then the probability of any interaction involving them is 0. For
example, given events A, B, and C , if A and B are mutually exclusive, then p(A ∩ B) = 0 and p(A ∩ B ∩ C) = 0

p(A ∪ B) = p(A) + p(B) − p(A ∩ B)

4.11.2.2 Permutations and Combinations

n things permuted r at a time:

n!
n
Pr =
(n − r)!

n things combined r at a time:

n
Cr can also be written:

(s + f )!
s+f C s =
s! f !

which will be helpful in the context of the Binomial Likelihood Function

n!
n Cr =
r! (n − r! )

4.11.2.3 Binomial Probability

The Binomial Likelihood Function (the probability of s successes and f failures in N = s + f trials):

N!
s f
p(s|N = s + f , π) = π (1 − π)
s! f !

4.11.2.4 Conditional Probability

Bayes’s Theorem:
also known as

p(H )p(D|H )
p(H |D) =
p(D)

where H is a scientific hypothesis and D is the observed data

p(A)p(B|A)
p(A|B) =
p(B)

4.12 Bonus Content


4.12.1 Why is the factorial of zero equal to one?

1. Algebraic answer:

The factorial of any number is that number multiplied by the factorial of that number minus one:

4! = 4 × 3 × 2 × 1 = 4 × 3!
3! = 3 × 2 × 1 = 3 × 2!

2! = 2 × 1 = 2 × 1!

n! = n(n − 1)!

Thus:

1! = 1 × 0!

throw in some simple division:

1! 1 × 0!
=
1 1

and thus:

1 = 0!

2. Calculus answer

The factorial function n! is generalized to the gamma function Γ (n + 1). R software has both a built-in factorial
function and a built-in gamma function: you can use both to see the relationship, for example:

gamma(6)

## [1] 120

factorial(5)

## [1] 120

gamma(1)

## [1] 1

factorial(0)

## [1] 1

The gamma function of n + 1, is defined as:



n −t
Γ (n + 1) = ∫ t e dt.
0

Plugging in n = 0, t 0
= 1 , and thus:


−t −t −t −∞ 0
Γ (1) = ∫ e dt = e = [−e ] = −e + e = 0 + 1 = 1
0
0

4.12.2 Excerpts from Statistics for Everybody by D. Barch, reprinted with permission from
the author

4.12.2.1 The Law of Large Numbers and the Gambler’s Fallacy

The tendency to believe that event A is less likely to occur after a long run of event A occurring is known as the
Gambler’s Fallacy. That fallacy (a fallacy is an illogical line of reasoning) takes its name from the tendency of
gamblers to think that after a long string of losses that they are due for a win. Well, they are due for wins – if they
continue to play the game forever (and have the means to do so) – but not on any one given trial. So, after ten
consecutive heads, would you put a lot of money on tails? You should not change your betting behavior in any way (if
anything, shouldn’t you suspect that the coin is messed up and bet on heads?).

4.12.2.2 The Linda Problem and the Conjunction Fallacy

In the early 1980s, Cognitive Psychologist Daniel Kahneman, working with his longtime collaborator Amos Tversky,
handed a flyer to 88 undergraduates at the University of British Columbia (UBC). In that flyer read the following
description of a fictional woman named Linda:

“Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was very
concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.”
(Tversky & Kahneman, 1983)

Below that passage was a list of 10 statements about Linda, including “Linda is a teacher in an elementary school,”
and “Linda works in a bookstore and takes Yoga classes,” and the students were asked to rank-order the probability
of each of the statements. Tversky and Kahneman were really only interested in responses these three of the 10
statements:

Figure 4.27: A Linda. Not necessarily THE Linda, though.

1. Linda is a bank teller


2. Linda is active in the feminist movement
3. Linda is a bank teller and is active in the feminist movement.

As Tversky and Kahneman had hypothesized, the majority (85%) of students ranked “active in the feminist
movement” above “bank teller and is active in the feminist movement” and, most importantly, ranked both above
“bank teller.” The researchers pointed out that such a ranking was impossible according to the rules of intersection
probability. Let’s let “active in the feminist movement” be event A and “bank teller” be event B. The intersection (or
conjunction, as noted above in the definition for intersection probability) probability of events A and B is:

p(A ∩ B) = p(A)p(B)

Since probabilities are values between 0 and 1, there are no two probabilities that you can multiply together to get a
number greater than any one of those probabilities. Thus, it is always the case that:

p(A ∩ B) ≤ p(B) where p(B) < p(A)

With the only ways that p(A ∩ B) can be equal to either p(A) or p(B) is if the probability of one or both of the
events is equal to 1. This is known as the conjunction rule.

Tversky and Kahneman pointed out that by ranking the probability that Linda was a feminist bank teller higher than
the probability that Linda was simply a bank teller, the UBC undergraduates were violating the conjunction rule and
committing the conjunction fallacy: assessing the probability of the conjunction as greater than the probability of one
of the events. This, they argued, was irrational based on the rules of probability theory, and showed that people were
making judgments based on heuristics – quick rules of thumb that we may rely on in place of analytic reasoning.
Linda, Tversky and Kahneman argued, was so representative of a person that would be active in the feminist
movement that participants in the experiment thought, “well, she doesn’t seem like the bank teller type, but if she is,
surely she is active in the feminist movement in her spare time and is not just a bank teller.”

Since the seminal paper from Tversky and Kahneman, other research has been conducted to explain why people
commit the conjunction fallacy. The principal antagonist to the Tversky and Kahneman viewpoint has been Gerd
Gigerenzer, who has argued (Gigerenzer, 1996; Hertwig & Gigerenzer) that those same heuristics help us make quick
and relatively smart decisions about the world around us. Still, for our purposes, let’s keep in mind that, statistically
speaking, the conjunction rule should not be violated.

1. Event (or Outcome - we will use them interchangeably): A thing that happens. Roll two dice and they come up
7 –- that’s an event (or, an outcome). Flip a coin and it comes up heads – that’s an outcome (or, an event)↩

2. Sample space (symbolized by Ω or S ): All possible outcomes of a defined experiment. For example, if our
defined experiment is “one flip of a coin,” then the sample space is one heads and one tails, which we would
symbolize as: ω = H , T ↩

3. Mutual Exclusivity: A condition by which two (or more) events cannot both (or all) occur. In one flip of a coin,
heads and tails are mutually exclusive events.↩

4. Intersection (∩): the co-occurence of events; the intersection probability of event A and event B is the
probability of A and B occurring.↩

5. Probability Symbols: p(x): the probability of x. ∩ (cap): and; intersection. ∪ (cup): or, union. : not. |:
¬

given↩

6. Conditional Probability: a probability that depends on an event, a set of events, or any other restriction of the
sample space↩

7. Defined experiment: An occurrence or set of occurrences where events happen with some probability.↩

8. Please note that there are several ways to express probabilities – as fractions, as proportions, or as percentages
– and all are equally fine.↩

9. Trial: A single occurrence where an event can happen.↩


10. Sampling with replacement: The observation of events that do not change the sample space↩

11. Independent Events: Events where the probability of one event does not affect the probability of another.↩

12. The concepts of sampling with replacement and independent events are closely related: we can define
sampling with replacement as sampling in such a way that each sample is independent of the last.↩

13. Sampling without Replacement: The observation of events that change the sample space↩

14. Dependent Events: Events where the probability of one event affects the probability of another.↩

15. Sampling without replacement and dependent events are related in the same way that sampling with
replacement and independent events are: sampling without replacement implies dependence between events.↩

16. note that we cannot use the simplification p(B|A) = p(B) because these are not independent events↩

17. In general, the intersection probability of independent events is the product of their individual probabilities:

f or independent A, B, C :

p(A ∩ B ∩ C...) = p(A)p(B)p(C)...

18. There are going to be several gambling-related examples in our discussion of probability theory, and I really
don’t mean to encourage gambling. But much of our understanding of probability theory comes from people
trying to figure out how games of chance work (that and insurance). When we consider how much of the rest of
the field of statistics was derived to justify and promote bigotry, though, promoting gambling isn’t bad by
comparison. ↩

19. One can win smaller amounts of money for picking all four digits in any order, or for picking the first three
digits in the correct order, or for picking the last three digits in the correct order…it gets complicated. I know
all this because my maternal grandparents loved The Numbers Game (almost as much as they hated each other)
and their strategies were frequent topic of conversation whenever I was at their house. Anyway, for
simplicity’s sake, let’s just focus on the jackpot.↩

20. We know from studying probability theory that outcomes that seem more random are not more random at all:
“3984” and “7777” are precisely as likely as each other to happen. So why do numbers like “7777” or “1234”
stand out? It’s the cognitive illusion caused by salience: there are just a lot more number combinations that
don’t look cool than number combinations that do, and the cool ones stick out while all the uncool ones don’t
get remembered.↩

21. “A well-shuffled deck of ordinary playing cards” is a term that teachers are legally required (I think – don’t
google that) to use when teaching probability theory – it just means that each card in the deck has an equal (
1/52) chance of being dealt.↩

22. The value of card combinations in games like poker mostly track with the probabilities of those combinations –
for example, a pair of aces is more likely and also less valuable than three aces – and sometimes are based on
arbitrary rules to improve game play – for example, a pair of aces is more valuable but just as probable to get
as a pair of kings, but having relative ranks of cards limits ties and improves gameplay.↩

23. Union (∪): the total sum of the occurrence of events; the union probability of event A and event B is the
probability of A or B occurring.↩

24. The days and times that people are born are not really random but let’s temporarily assume they are for the sake
of this example.↩
25. The union probability of three mutually exclusive events p(A) + p(B) + p(C) uses the exact same equation,
it’s just that all of the intersection terms are 0.↩

26. As the most likely outcome, 7, in addition to being the expected value, is also the mode. And, since there are
precisely as many outcomes less than 7 as greater than 7, it’s also the median.↩

27. Only one of the three elements is named after an actual tree part, a fact that I never thought about before now but
it kind of bums me out.↩

28. for more on similar cognitive illusions regarding probability, see the bonus content below↩

29. Elementary Events: events that can only happen in a single way.↩

30. Composite (or Compound) Events: events that can be broken down into multiple outcomes; collections of
elementary events (note: in this context composite and compound are identical and will be used
interchangeably).↩

31. That is not to say that the probabilities of all elementary events in a sample space are always equal to each
other – for example, the probability of the elementary events me winning gold in men’s figure skating at the
next Winter Olympics and an actually good skater winning gold in men’s figure skating at the next Winter
Olympics are most certainly not the same – but in this case, they are.↩

32. This is a good time to point out that 0! = 1. You can live a long, healthy, and scientifically fruitful life just
taking my word for it on that, but for two (hopefully compelling) explanations please see the bonus content
down below.↩

33. Permutation: a combination with a specified order↩

34. Combinations: arrangements of items in no specified order_↩

35. This was, verbatim, a question that I got right for my team at pub trivia mainly because I had just taught a lecture
on probability theory and the combination formula was fresh in my head. Also, unless a trip is going to be like
years long, anything more than one book is too ambitious for me↩

36. Odds: Relative probability ↩

37. Likelihood: a conditional probability↩

38. In common conversation, the terms probability, likelihood, and odds are used interchangeably and without too
much confusion. In statistics, the distinctions between the terms are meaningful: probability can refer to any
probability, while likelihood refers specifically to a conditional probability and odds refers specifically to a
relative probability.↩

39. Binomial Trial: a defined experiment with two possible outcomes; also known as a Bernoulli trial↩

40. Kernel Probability: The probability of an elementary outcome↩

41. Binomial trials are always based on the assumption of sampling with replacement.↩

42. The term inverse probability has been used – mostly in the past and originally derisively – for the kinds of
broader probability problems to which this problem is an analogy. If what is the probability that marble x is
drawn from jar y is a probability problem, then what is the probability that the jar is jar y given that marble
x was drawn from it is the inverse of that. But, today, such problems are more appropriately referred to as
Bayesian probability problems.↩

43. We could also find the intersection probabilities of J ar 1 ∩ Orange and J ar 2 ∩ Orange , but those aren’t
really that important for this problem right now.↩
44. Another way of saying the probability is three times greater is the odds in favor of having Jar 1 are 3:1, and
another way of saying that is the odds ratio for having Jar 1 is 3. But, we’ll get to that in the section on odds.↩

45. We could also go without the subscripts and just write

p(A) = 0.75,

keeping in mind the p(A) has been updated. Mathematical notation is supposed to clarify problems rather than
to make them more difficult, so whichever approach is more understandable for you is best.↩

46. We still don’t really care about p(Orange).↩

47. For this particular example, that step goes back precisely to where the probability was after the first draw of a
single blue marble, but that’s not necessarily the case: the probability will go down, but not necessarily by the
same amount that it went up before – that just happened because of the symmetry of 0.75 and 0.25 around 0.5.↩

48. Prior Probability: The probability of a hypothesis before being conditioned on a specified set of evidence.↩

49. Equal-assignment-based prior probabilities are fairly common and are known as flat priors↩

50. Posterior Probability: The probability of a hypothesis conditioned on a set of evidence.↩

51. Dear gambling aficonados: for the purposes of this example, we’ll ignore all the side bets and all that.↩

52. Random walk model: a representation of a series of probabilistic shifts between states.↩

53. Stochastic: Random or describing a process that involves randomness↩

54. this is a not an endorsement of this model on my part↩


5 Probability Distributions
5.1 Weirdly-shaped Jars and the Marbles Inside
In the page on probability theory, there is much discussion of the probability of
drawing various marbles from various jars and a vague promise that learning
about phenomena like drawing various marbles from various jars would be made
broadly relevant to the learning statistical analyses to support scientific
research*1

In one such example, the question of the respective probabilities that a drawn
blue marble came from one of two jars (see Figure 1 below) was posed.

## Warning: It is deprecated to specify `guide = FALSE` to


remove a guide. Please
## use `guide = "none"` instead.

## Warning: It is deprecated to specify `guide = FALSE` to


remove a guide. Please
## use `guide = "none"` instead.
Figure 5.1: Figure 1: A Callback to Some Conditional Probability Problems

The probability that a blue marble was drawn from Jar 1 is 0.75 and the
probability that it was drawn from Jar 2 is 0.25: it is three times more probable
that the marble came from Jar 1 than from Jar 2.

Now, let’s say we have a jar with a more unusual shape, perhaps something like
this…

Figure 5.2: Figure 2. An Unusually-shaped Jar

…and the marbles we were interested in weren’t randomly mixed in the jar but
happened to sit in the upper corner of the jar in the space shaded in blue…
Figure 5.3: Figure 3. Areas of Residence for Interesting Marbles in An
Unusually-shaped Jar

…and somehow it were equally easy to reach every point in this jar (the analogy
is beginning to weaken). If you were to draw an orange marble from this strange
jar, you might think nothing of it. If you were to draw a blue marble, you might
think the idea that it came from this jar rather odd. You might even reach a
conclusion like the following:

The probability that the blue marble came from that mostly-orange jar is so
small that – while there is a small chance that I am wrong about this – I
reject the hypothesis that this blue marble came from this jar in favor of the
hypothesis that the blue marble came from some other jar.

If we turn that jar upside-down, and we put it on a coordinate plane, then we will
have something like this:
Figure 5.4: Figure 4. Metamorphosis of an Unusually-shaped Jar

Swap out the concept of drawing a marble for observing a sample mean, and
this is now a diagram of a one-tailed t test.

The above thought exercise was meant to be the statistical version of this:

This page provides some connection between the marbles and jars (and the coin
flips and the dice rolls and the draws of playing cards from well-shuffled decks)
and statistical analyses. Frequency distributions are tools that will help us
understand the events and sample spaces needed to make probabilistic inferences
for statistical analysis.

5.2 The Binomial Distribution


The binomial likelihood formula:

N! s f
p(s|N − s + f , π) = π (1 − π)
s! f !

was derived in the page on Probability Theory. The binomial likelihood gives the
probability of s events of interest (which we have called successes) in N trials
where the probability of success on each trial is π and f represents the number of
trials where success does not occur (i.e., f = N − s).

For example, imagine that the manager of a supermarket wants to know what the
probability is that shoppers will choose an item placed in the middle of a display
(box B in Figure 5.5 below) of three items rather than the items on either side (A
or C ):

Figure 5.5: Item A, Item B, and Item C

If each item were equally likely to be chosen, then the probability of a person
choosing B is p(B) = 1/3, and the probability of a person choosing not B is
p(¬B) = p(A ∪ C) = 1/3 + 1/3 = 2/3. Suppose, then, that 10 people walk

by the display. What is the probability that 0 people choose item B? That is, what
is the probability that B occurs 0 times in 10 trials? According to the binomial
likelihood function:
0 10
1 10! 1 2
p (0|π = , N = 10) = ( ) ( ) = 0.017
3 0! 10! 3 3

Were we so inclined – and we are – we could construct a table for each possible
number of s given π = 1/3 and N = 10:

s p(s|π, N )

0 0.01734
1 0.08671
2 0.19509
3 0.26012
4 0.22761
5 0.13656
6 0.05690
7 0.01626
8 0.00305
9 0.00034
10 0.00002
∑ 0 → 10 1

These eleven values represent a distribution of values, namely, a binomial


distribution. We can represent this distribution as a bar chart with each possible
value of s on the x-axis and the probability of each s on the y-axis:
Figure 5.6: Binomial Distribution for N = 10 and π = 1/3

5.2.1 Discrete Probability

The distribution of binomial probabilities listed in Table ?? has eleven distinct


values in it, and the visual representation of it in Figure 5.6 comprises eleven
distinct bars. The binomial distribution is an example of a discrete probability
distribution: each value is an expression of the probability of a discrete event.
There are no probability values listed for, say, 2.3 successes or 4.734 successes,
because those aren’t feasible events.

5.2.2 Features of the Binomial Distribution

Binomial distributions are positively skewed, like the one in Figure 5.6 above,
whenever π is between 0 and 0.5, negatively skewed whenever π is between 0.5
and 1, and symmetrical whenever π is exactly 0.5.2 The effect of changing π is
shown below in Figure 5.7. For each distribution represented in Figure 5.7,
N = 20, and the only thing that changes is pi.
Figure 5.7: The Binomial Distribution for N = 20 and Various Values of π

A coin flip is represented by a binomial distribution with π = 0.5 and N = 1. If


we define s as the coin landing heads, then there are two possible values of s – 0
and 1 – and each of those values has a probability of 0.5. If that distribution is
represented with an elegant figure such as Figure 5.8, it looks like two large and
indistinct blocks of probability. As N grows larger, the blocks become more
distinct as more events are possible and have an increasingly large variety of
likelihoods.
Figure 5.8: Binomial Distribution for a Coin Flip

Figure 5.9 shows a series of binomial distributions each with π = 0.5 and with
N ranging between 1 and 20.
Figure 5.9: Binomial Distributions for π = 0.5 and Various Values of N

As N gets bigger, the binomial distribution looks decreasingly like the blocky
figure representing a coin flip and increasingly like a curve: specifically like a
normal curve. That is not a coincidence. In fact, there are meaningful connections
between the binomial distribution and the normal distribution, as discussed
below

5.2.3 Sufficient Statistics for the Binomial


In Figure 5.7 and Figure 5.9, the binomial changes as a function of π and of N ,
respectively. Those are the only two numbers that can affect the binomial
distribution and, conversely, they are the only two numbers needed to define a
binomial distribution. Thus, they are known as the sufficient statistics3 for the
binomial distribution. We can think of the sufficient statistics for a distribution as
the numbers we need to know (and the only numbers we need to know) to draw
an accurate figure of a distribution. For the binomial distribution, if we know N ,
then we know the range of the x-axis of the figure: it will go from s = 0 to
s = N . If we know π as well, then for each value of s on the x-axis, we can

calculate the height of each bar on the y-axis.

5.2.3.1 Expected Value of the Binomial Distribution

In probability theory, we noted that the expected value of a discrete probability


distribution is:
N

E(x) = ∑ x i p(x i )

i=1

and the variance is:


N

2
V (x) = ∑ (x i − x) p(x i )

i=1

For the case of the binomial distribution, we showed that:


N

E(x) = ∑ x i p(x i ) = N π

i=1

It’s a little more complex to derive, but in the end, the variance of a binomial
distribution has a similarly simple form:
N

2
V (x) = ∑ (x i − x) p(x i ) = N π(1 − π)

i=1

Thus, the mean of the binomial distribution is N π, and the variance is


N π(1 − π). Please note that those summary statistic formulae apply only for

binomial data – it’s just something special about the binomial. Please also note
that these equations will come in handy later.

5.2.4 Cumulative Binomial Probability

A cumulative probability is a union probability for a set of events. For example,


the probability of rolling a 6 with a six-sided die is 1/6; the cumulative
probability of rolling 4, 5, or 6 is 3/6. Unless otherwise specified, the range of
the cumulative probability of s is the union probability of s or any outcome with
a value less than s. In the case of a discrete probability distribution such as the
binomial:

Cumulative p(s) = p(s) + p(s − 1) + p(s − 2) + ... + p(0)

That is, for a discrete probability distribution, we get the cumulative probability
at event s by adding up all the probabilities s and each possible smaller s. We
can replace each value in Table ?? – which showed the discrete probabilities for
each possible value of s for a binomial distribution with N = 10 and π = 1/3 –
with the cumulative probability for each value of s. In Table ??, the cumulative
probability for s = 0 equals the discrete probability for s = 0 because there are
no smaller possible values of s than 0. The cumulative probability for s = 1 is
equal to p(s = 1) + p(s = 0), the cumulative probability for s = 2 is equal to
p(s = 2) + p(s = 1) + p(s = 0), etc., until the largest possible value of s –

s = 10, where the cumulative probability is 1.*


4

s p(s ≤ s|π, N )

0 0.01734
1 0.08671
2 0.19509
3 0.26012
4 0.22761
5 0.13656
6 0.05690
7 0.01626
8 0.00305
9 0.00034
10 0.00002
∑ 0 → 10 1

Figure 5.10 is a chart of the cumulative binomial distribution for


N = 20, π = 1/3:
Figure 5.10: Cumulative Binomial Distribution for N = 10 and π = 1/3

5.2.5 Finding Binomial Probabilities with R Commands


The binomial likelihood function is relatively easy to type into the R console or
to include in R scripts, notebooks, and markdowns. We can translate right-hand
side of the binomial equation:

N! s f
p(s|N , π) = π (1 − π)
s! f !

as:

(factorial(N)/(factorial(s)*factorial(f)))*(pi^s)*(1-pi)^f

replacing N, s, f, and pi with the values we need.

But R also has a built-in set of commands to work with several important
probability distributions, the binomial included. This set of commands is based
around a set of prefixes – indicating the feature of the distribution to be evaluated
– a set of roots – indicating the distribution in question – and parameters inside
of parentheses. There are a number of helpful guides to the built-in distribution
functions in R, and we will have plenty of practice with them.

To find a single binomial probability, we use the dbinom() function. d is the


prefix for probability density: we will describe the term “probability density” in
more detail in the section on the normal distribution, but for discrete probability
distributions like the binomial, the “density” is equal to the probability of any
given event s. binom is the function root meaning binomial (they’re usually pretty
well-named), and in the parentheses we put in s, N , and π.

For example, if we want to know p(s = 3|N = 10, π = 1/3), we can enter:
dbinom(3, 10, (1/3)) which will return 0.2601229.

To find a cumulative probability, we can either add several dbinom() commands


or simply use the pbinom() command, where the prefix p indicates a cumulative
probability. If we want to know p(s ≤ 3|N = 10, π = 1/3), we can enter:
pbinom(3, 10, (1/3)) which will return 0.5592643. By default, pbinom()
returns the probability of s or smaller. For the probability of events greater than
s, we enter the parameter lower.tail=FALSE to our pbinom() parentheses

(lower.tail=TRUE is the default, so if we leave it out then lower tail is


assumed). Thus, pbinom(3, 10, (1/3)) + pbinom(3, 10, (1/3),
lower.tail=FALSE) will return 1. If we want to know the probability of values
greater than or equal to 3 using pbinom() and lower.tail=FALSE, then we need
to use s − 1,5 as in: p(s ≥ 3|N = 10, π = (1/3))=pbinom(2, 10, (1/3),
lower.tail=FALSE).

5.3 The Normal Distribution


We can think of the normal distribution as what happens when the average event
is most likely to happen and the likelihoods of other events smoothly decrease as
they get farther from that average event. That is: the average is the peak
likelihood, and every event to the left or right is just a little bit less likely than the
one before.
As a rule, we’re not going to go very deep into Calculus in this course, but if you
are interested in the derivation of the functional form of the normal distribution,
this article is the easiest-to-follow description that I have encountered.

Here is the form of the function that describes the normal distribution:
1 −
1
(
x−μ
)
2

f (x|μ, σ) = e 2 σ

σ√ 2π

As equations go, it’s kind of a lot. Please feel free not to memorize it: you won’t
need to produce it in this course, and when and if you need to use it in the future,
it comes up after a pretty quick internet search. But, while you’re here, I would
like to point out a couple of features of it:

1. The left side of the equation – f (x|μ, σ) – indicates that the normal
distribution is a function of x given μ and σ.

2. x, μ and σ are the only variables in the equation (π is the constant 3.1415…
and e is the constant 2.7183…).

Points (1) and (2) together mean that μ and σ (or σ – it doesn’t really matter
2

because if you know one, then you know the other) are the sufficient statistics for
the normal distribution. All you need to know about the shape of a normal
distribution are those two statistics. You might ask: what about the skewness and
the kurtosis? It’s a good question and the answer is quasitautological6: if the
distribution has any skew and/or any excess kurtosis, then it’s not a normal
distribution by definition.

3. The equation describes the probability density at each point on the x-axis,
not the probability of any point on the x-axis.

The probability density is a measure of the relative likelihood of different values:


it shows, for example, the peak of the distribution at the mean, and the relatively
low likelihoods of the tails. However, at no point on the curve does f (x) (or, y)
represent a probability in the sense of the probability of an event.

Probability density is a little like physical density7. The density of an object is


the mass per unit of area of that object: it’s an important characteristic for certain
applications (like, say, building a boat), but if you asked somebody how heavy an
object is and they replied with how dense it is, that wouldn’t help you very much.
Probability density is the probability per unit of x: it’s helpful for a couple of
applications but it doesn’t really help us know what the probability of x is. The
probability p(x) at any one given point is not equal to f (x) because of the next
point.
4. For any continuous variable x, the probability of any single value of x is 0.

This is one of the more counterintuitive facts about probability theory – I think
it’s counterintuitive because we seldom think about numeric values as area-less
points in a continuous range. Here’s a hopefully-helpful point (ha, ha) to
illustrate: please imagine that you are interested in finding the probability that a
person selected at random is 30 years old. Based on the proposition described in
the previous sentence, we wouldn’t naturally think that the probability is
extremely small or even zero: there are lots of people who are 30 years old.
Now imagine that you were interested in finding the probability that a person
selected at random is 37 years, 239 days, 4 hours, 12 minutes, and 55.43218
seconds old: we would naturally think that that probability would be
infinitesimally small, and as the age of interest got even more precise (i.e., more
and more digits got added to the right of the decimal on the number of second)

What’s the difference between those two questions? When we’re not in stats
class, the phrase 30 years old implies between 30 years and 30 years and
364.9999 days old: the implication is of a range. A range of values can have a
meaningful probability. A range of values has an area under the curve, which
corresponds to meaningful probability, sort of like how the physical density of an
object can be translated into a meaningful mass (or weight) when we know how
large the object is. If that’s not clear yet (or even if it is), an illustration of the
field of plane geometry might be helpful. In plane geometry, a point has no area.
A line has no area, either. But a shape does have an area:

Figure 5.11: A Point, a Line, and a Shape


Having established that there is no probability for a single value of a continuous
variable, let’s clarify one potentially-lingering question: why is it that, in the case
of the binomial distribution, that there are probabilities associated with single
values? The answer is: the binomial distribution is a discrete probability
distribution, not a continuous probability distribution. When we find the
binomial likelihood of say, s = 4, we mean s = 4, not s = 4.0000000 or
s = 4.0000001. In terms of area, the bars in the visual representation of the

binomial distribution have width as well as height: the width of each bar
associated with a value of s is 1; so the area is equal to the height of each bar
times 1, which simplifies to the height of each bar. We will revisit that fact when
we discuss the connections between the normal and the binomial below.

Given that meaningful probabilities for continuous variables are associated with
areas, it is reasonable to infer that we should talk about how to find areas under
the normal curve, and therefore how to find probabilities. That is true, but there
is just one problem:

5. There is no equation for the area under the normal curve.

Typically, if we want to find the area under a curve, we would use integral
calculus to come up with a formula. But, the formula for a normal curve –
x−μ 2
1

f (x|μ, σ) =
1
e

2 – is one of those formulas that simply does not have
(
σ
)

σ√ 2π

a closed-form integral solution.

But that won’t stop us! Despite not being able to use a formula for the precise
integral, we can approximate the area under a curve to a high level of precision.
There are two tools available to us to find areas: using software and _calculating
areas based using standardized scores.

For example, imagine a standardized test where the average score is 1000 and
the standard deviation of the scores is 100: the distribution of scores for this
hypothetical test is visualized in Figure 5.12. What is the probability that
somebody who took the test that is selected at random scored between 1000 and
1100?
Figure 5.12: Distribution of Scores for a Hypothetical Standardized Test With
μ = 1000 and σ = 100

First, let’s do this with the benefit of modern technology. The absolute easiest
way to answer the question is to use an online calculator like this one, where you
can enter the mean and standard deviation of the distribution and the value(s) of
interest. Slightly, but I don’t think much more, difficult, is to use built-in R
functions.

Recall from the discussion of using R’s built-in distribution functions for the
binomial that the cumulative probability for a distribution is given by the family
of commands with the prefix p. The root for the normal distribution is norm, so
we can use the pnorm() function to find probabilities of ranges of x values. The
pnorm() function takes the parameter values (x, mean, sd), where x is the
value of interest and mean and sd are the mean and standard deviation of the
distribution. As with the pbinom() command, the default value for pnorm() is
lower.tail=TRUE, so without specifying lower.tail, pnorm() will return the
cumulative probability from 0 to x8 The mean of this distribution (as given in the
description of the problem above) is 1000 and the standard deviation is 100.
Thus, to find the probability of finding an individual with a score between 1,000
and 1,100 where the scores are normally distributed with a mean of 1,000 and a
standard deviation of 100, we can use pnorm(1100, 1000, 100) to find the
cumulative probability of a score of 1,100 and pnorm(1000, 1000, 100) to
find the cumulative probability of a score of 1,000, and subtract the latter from
the former:

pnorm(1100, 1000, 100) - pnorm(1000, 1000, 100)

[1] 0.3413447

Figure 5.13 is a visual representation of the math:

Figure 5.13: Visual Representation of Finding the Area Between 1,000 and
1,1000 Under a Normal Curve with μ = 1000 and σ = 100

The second way to handle this problem – the old-school way – is to solve it
using what we know about the area under one specific normal curve: the
standard normal distribution. which merits its own section header:

5.3.1 The Standard Normal Distribution


The standard normal distribution is the normal distribution with a mean of 0 and
a standard deviation of 1 (the variance is also 1 because 1 = 1). Here it is right
2

here:

Figure 5.14: The Standard Normal Distribution

The mean and the standard deviation of the standard normal distribution mean
that the x-axis for the standard normal represents precisely the number of
standard deviations from the mean of any value of x: x = 1 is 1 standard
deviation above the mean, x = −2.5 is 2.5 standard deviations below the
mean, etc.. For a long time, the standard normal distribution was the only normal
distribution for which areas under the curve were easily available. There are
infinite possible normal distributions: one for each possible combination of
infinite possible means and infinite possible standard deviations (all greater than
1, but still infinite), and where there is no formula to determine the area under a
normal curve, it makes little sense to try take the time to approximate the
cumulative probabilities associated for any set of those infinite possibilities.
Instead, the cumulative probabilities associated with the standard normal
distribution were calculated9: if one wanted to know the area under the curve
between two points in a non-standard normal distribution, one needed only to
know what those two points represented in terms of distance from the mean of
that distribution in units of the standard deviation of that distribution. Then,
one could (and still can) consult the areas under the standard normal curve to
know the area between the x values corresponding to the numbers of standard
deviations from the mean. To use our example of standardized test scores –
where the mean of the scores is 1,000 and the standard deviation is 100 – a score
of 1,100 is one standard deviation greater than the mean, and a score of 1,000 is
0 standard deviations from the mean (it’s exactly the mean). For the standard
normal distribution, the area under the curve for x < 1 – corresponding to a
value 1 standard deviation greater than the mean – is 0.84, and the area under the
curve for x = 0 – corresponding to a value at the mean – is 0.5. Thus, since the
area under the standard normal curve for x = 0 → x = 1 is 0.84 − 0.5 = 0.34,
the area under the normal curve with μ = 1000andσ = 100 between the mean
and a value 1 standard deviation above the mean is also 0.34.

5.3.1.1 z-scores

The value of knowing how many standard deviations from a mean a given value
lies in a normal distribution is so important that the distance between any x and
the mean in terms of standard deviations is given its own name: the z-score. A z-
score is precisely the number of standard deviations a value is from the mean of a
distribution. This is reflected in the z-score equation:
x − μ
z =
σ

which is also written:

x − x
z =
s

when we have the mean and standard deviation of a sample and not the entire
population. In other words, we take the difference between any value x of
interest and the mean of the distribution and divide that difference by the standard
deviation to know how many standard deviations away from the mean that x is.

The z-score is important for finding areas under the normal curve using tables
like this one, but by “important,” I mean “sort of important but much less
important in light of modern technology.” A key note on tables like the one
linked: to save space and improve readability, such tables usually don’t include
all z-scores and full cumulative probabilities, but rather z-scores greater than 0
and the cumulative probability between z and the mean and sometimes the
cumulative probability between z and the tail (which is to say, to infinity (but not
beyond)). To read those tables, one must keep two things in mind: (1) the normal
distribution is symmetric and (2) the cumulative probability of either half –below
the mean and above the mean – of the standard normal is 0.5 (and thus the total
area under the curve is 1).

For example – sticking yet again with the standardized test example – the z-score
for a test score of 1,100 is:

x − μ 1, 100 − 1000
z = = = 1
σ 100

In the linked table, we can look up 1.00 in Column A. Column B has the Area to
mean for that z-score: it’s 0.3413. That means that – for this distribution –
34.13% of the distribution is between the mean and a z-score of 1, or, between
1,000 and 1,100. To find the cumulative probability of all values less than or
equal to 1,100, we have to add the area that lives under the curve on the other
side of the mean, which is 0.5. Thus, the cumulative probability of a score of
1,100 is 0.3413 + 0.5 = 0.8413. The cumulative probability of scores greater
than 1,100 is given in Column C: in this case, it’s 0.1587. Please note that for
each row in the table, the sum of the values in Column B and in Column C is 0.5
because the area under the curve on either side of the mean is 0.5. For that
reason, some tables don’t include both columns, because if you know one value,
you know the other.

Now, what if we were concerned with a score of 900 in a distribution with a


mean of 1,000 and a standard deviation of 100? The z-score would be:
x − μ 900 − 1000
z = = = −1
sd 100

The negative sign on the z-score indicates that the score is less than the mean.
The linked table is one of those that doesn’t include negative z-scores (as the
maker of the table: my deepest apologies). But that’s not a problem because the
normal distribution is symmetric: we check Column A for positive 1, and again
the area to the mean is 0.3413 and the area to the tail is 0.1587. In this case, the
area to the mean is the area under the curve from −1 up to the mean, and the area
in the tail is the cumulative probability between −1 down to negative infinity.
In addition to finding areas under curves – which as noted earlier isn’t quite as
important as it used to be given that we can use software to find those areas more
easily – the z-score is an important tool for making comparisons. Because the z-
score is a measure for a value of how many standard deviations of its own
distribution from the mean of its own distribution, we can use z-scores to show
how typical or unusual a value is relative to its peer values.

For example, imagine two Girl Scouts – Kimberly from Troop 37 and Susanna
from Troop 105 – who are selling cookies to raise funds to facilitate scouting-
type activities. Kimberly sells $200 worth of cookies and Susanna sells $175
worth of cookies. We might conclude that Kimberly was the better-performing
cookie-seller. However, perhaps Kimberly and Susanna are selling in different
markets – maybe Kimberly’s Troop 37 is in a much less-densely populated area
and she has fewer potential buyers. In that case, it would make sense to look at
the distribution of sales for each troop:

Troop Mean Sales SD of Sales


37 $250 $25
105 $150 $10

The z-score for Kimberly is = −2 and the z-score for Susanna is


$200−$250

$25

= 2.5. Thus, compared to their peers, Susanna outperformed the


$175−$150

$10

average by 2.5 standard deviations and Kimberly underperformed the average by


2 standard deviations.

You may have noticed that while the sales numbers for the above examples – the
respective sales for Kimberly and Susanna, the means, and the standard
deviations – had dollar signs indicating the units of sales, the z-scores had no
units. In the z-score formula, any units that are used cancel out in the numerator
and in the denominator. Thus, we can make comparisons between values from
distributions with different units, for example: height and weight, manufacturing
output and currency value, heart rate and cholesterol level, etc, which really
comes in handy when analyzing data with correlation and regression.

5.3.2 Features of the Normal Distribution

As noted above, the normal distribution is always symmetric:


Figure 5.15: Symmetry of the Normal Distribution

The tails of the normal curve come really, really close to touching the x-axis but
they never touch it. The statistical implication of this is that although the
probability of observing values many standard deviations away from the mean
can be extremely small, it is never impossible: the normal distribution excludes
no values.
Figure 5.16: The Tails Don’t Touch the Axis

Figure 5.17: Pictured: Kevin Garnett, moments after the conclusion of the 2008
NBA Finals, describing the range of x values in a normal distribution.

The standard deviation – as alluded to in Categorizing and Summarizing


Information – is especially important to the normal distribution. It is (along with
the mean) one of the two variables in the equation used to draw the curve. It is
also the location of the inflection points of the curve: on either side of the mean,
the x value that indicates one standard deviation away from the mean is the point
where the curve goes from convex down to concave up. The mathematical aspect
of that fact isn’t going to play too much in this course, but it is good to know
visually where 1SD and −1SD live on the curve.
Figure 5.18: Inflection Points of the Normal Curve

The 68-95-99.7 Rule is an old rule-of-thumb that tells us the approximate area
under the normal curve within one standard deviation of the mean, within two
standard deviations of the mean, and within three standard deviations of the
mean, respectively. With all we know about calculating areas under the curves,
the 68-95-99.7 rule is less important (we can always look those values up or
calculate them with software), but still helpful to know offhand.
Figure 5.19: The 68-95-99.7 Rule

5.3.3 The Cumulative Normal Distribution


As with the cumulative binomial distribution, the cumulative normal distribution
gives at each point the probability of the range of all values up to that point. Also
like the cumulative binomial distribution, the cumulative normal distribution
starts small and grows to 1 (thanks to the normalization axiom. Unlike the
cumulative binomial, the cumulative normal is smooth and continuous. For
example, Figure 5.20is a visual representation of the cumulative standard normal
distribution. For the case of the cumulative standard normal, please note that the
values on the y axis correspond to the areas under the curve for values of z.

Figure 5.20: The Cumulative Standard Normal Distribution

5.3.4 Percentiles with the Normal Distribution


Percentiles are equal-area divisions of distributions where each division is
equal to 1/100 of the total distribution (for more, see Categorizing and
Summarizing Information). For normally-distributed data, the percentile
associated with a given value of x is equal to the cumulative probability of x,
multiplied by 100 to put the number in percent form.

For example, the average height of adult US residents assigned male at birth is 70
inches, with a standard deviation of 3 inches, and the distribution of heights is
normal. If we wanted to know the percentile of height among adult US residents
assigned male at birth for somebody who is 6’2 – or 74 inches – tall, we would
take the cumulative probability of 74 in a normal distribution with a mean of 70
and a standard deviation of 3 either by using software:

pnorm(74, 70, 3)

[1] 0.9087888

Or by finding the z-score and using a table:

74 − 70
z 74 = = 1.33
3

Area to mean(1.33) = 0.4082

T otal area = Area to mean + Area beyond mean = 0.4082 + 0.5 = 0.9082

Then, we multiply the area by 100 and round to the nearest whole number (these
things are typically rounded) to get 91: an adult US resident assigned male at
birth who is 6’2" is in the 91st percentile for height. They are taller than 91% of
people in their demographic category.

5.3.5 The Connections Between the Normal and the Binomial

As noted earlier, the normal distribution and the binomial distribution are closely
related. In fact, the limit of the binomial distribution as N goes to infinity is the
normal distribution. But, before we get to infinity, the normal distribution
approximates the binomial distribution. The rule-of-thumb is that the normal
approximation to the binomial can be used when N π > 5 and N (1 − π) > 5:
that is, the binomial distribution looks enough like the normal given some
combination of large N and π in the middle of the range (small N makes the
binomial blocky; pi near 0 or 1 makes the binomial more skewed).
Figure 5.21: Illustration of the Normal Approximation to the Binomial

We know that the mean of a binomial distribution is given by N π, that the


variance is given by N π(1 − π), and that the standard deviation is given by
√ N π(1 − π). When we use the normal approximation to the binomial, we treat

the binomial distribution as a normal distribution with those values for the mean
and standard deviation.

For example, suppose we wanted to know the probability of flipping 50 heads in


100 flips of a fair coin (as represented in Figure 20 above). We could simply
use software to compute the binomial probability10, or, we could treat s = 100
as a value under the curve of a normal distribution with a mean of N π = 50 and
a standard deviation of √N π(1 − π) = 5. Except: there is no probability for a
single value in a continuous distribution like the normal as there is for a discrete
distribution like the binomial. Instead, we must apply a range to the value of
interest: we will give the range a width of 1 to represent that we are looking for
the probability of 1 discrete value and we will use s – in this case s = 50 – as
the midpoint of that range. Thus, to find the approximate binomial probability for
s = 50|N = 100, π = 0.5, we will find the area under the normal curve

between x = 49.5 and x = 50.5:


p(s = 50|N = 100, π = 0.5) ≈ Area(z 50.5 ) − Area(z 49.5 )

50.5 − 50 49.5 − 50
= Area ( ) − Area ( ) = 0.0797.
5 5

The exact probability for 50 heads in 100 flips is dbinom(50, 100, 0.5) =
0.796, so this was a pretty good approximation.

5.4 the t distribution


The t distribution describes fewer variables than the normal or the binomial, but
what it does describe is essential to statistics in the fields of behavioral science.
The t distribution is a model for the distribution of sample means in terms of
standard errors of the mean.

A sample mean is, as noted in Categorizing and Summarizing Information, the


mean of a sample of scores. Any time a psychologist runs an experiment and
takes a mean value of the dependent variable for a group of participants, that is a
sample mean. Any time an ecologist takes soil samples and measures the average
concentration of a mineral in the samples, that is a sample mean.

Sample means have a remarkable property: if you take enough sample means,
then the distribution of those sample means will have a mean equal to the mean of
the population that the data were sampled from and a variance equal to the
variance of the original population divided by the size of the samples. That
property is codified in the central limit theorem.

5.4.1 The Central Limit Theorem

The central limit theorem is symbolically represented by the equation:


2
σ
x ∼ N (μ, )
n

x represents sample means. The tilde (~) in this case stands for the phrase is
distributed as. The N means a normal with its mean and variance indicated in
the parentheses to follow. μ is the mean of the population the samples came
from, σ is the variance of the population the samples came from, and n is the
2

number of observations per sample. Put together, the central limit theorem says:
As the size of sample means becomes large, sample means are distributed as
a normal with the mean equal to the population mean and the variance equal
to the population variance divided by the number in each sample.

And here’s a really wild thing about the central limit theorem:

Regardless of the shape of the parent, the sample means will arrange themselves
symmetrically. For positively-skewed distributions, there will be lots of samples
from the heavy part of the curve but the values of those sample will be small;
there will be few samples from the light part of the curve but the values of those
samples will be large. It all evens out eventually.

In a scientific investigation that collects samples, it’s important to know which


samples are ordinary and which are extreme (this will be covered in greater
detail in Differences Between Two Things). Just as there are infinite possible
normal distributions, there are infinite possible distributions of sample means.
Likewise, as we got the Standard Normal Distribution by dividing differences of
values from the mean by the standard deviation, we get the t distribution by
dividing the difference between sample means and the mean of the parent
distribution by the standard error (the square root of the sample mean variance
σ /n:
2

x − μ x − μ
t = =
σ
2
σ
√ √n
n

The shape of the t distribution – the model for the distribution of the t statistic –
is given by the following formula:
ν+1
ν+1 −
2
Γ ( ) t
2

2
f (t) = (1 + )
ν
√νπ Γ ( ) ν
2

which, super-yikes, but it does show us that there is only one variable other than t
in the equation: ν , also known as degrees of freedom (df ). That means that the
df is the only sufficient statistic for the t distribution.
11 The mean of the t
distribution is 0, and the variance of the t is undefined for df ≤ 1, infinite for
1 < df ≤ 2, and f or df > 2.
df

df +2

The df for a sample mean with sample size n is n − 1. An important caveat


(which, admittedly, I glossed over in the text above) is the distribution of sample
means approaches a normal distribution as the size of the samples n gets large.
“Large” is a relative term: there is no magical point12, but when df gets around
100, the t distribution is almost exactly the same as a normal distribution. For
smaller samples, the distribution of t is much more kurtotic, reflecting the fact
that it is more likely to get extremely small or large sample means for samples
comprising fewer observations. Figure 5.22 shows how the t distribution changes
as a function of df .
Figure 5.22: The t Distribution for Selected Values of df

5.5 The χ Distribution


2

The χ distribution is used to model squared differences from means. Figure 5.23
2

provides an illustration: the distribution of deviations from the mean of a


standard normal distribution – which is just a standard normal distribution
because the mean of the standard normal is 0 – is shown in the upper-left plot.
The upper-right plot is the distribution of a set of the deviations from the mean of
a standard normal squared (that is, taking each number in a standard normal,
subtracting the mean, and squaring the result). The bottom-left plot is the
distribution of the sum of two squared deviations from a normal (taking two
numbers from a standard normal, subtracting the mean from each, squaring the
result for each, and adding them together), and the lower-right plot is the
distribution of the sum of three squared deviations from the mean.
Figure 5.23: A Standard Normal Distribution and its Squared Deviations

The three plots based on squared deviations can be modeled by χ distributions


2

with 1, 2, and 3 df , respectively. The degrees of freedom we cite when talking


about shapes of the χ distribution are based on the same idea as the degrees of
2

freedom we cite when talking about shapes of the t distribution: both are
representations of things that are free to vary. When we are talking about sample
means, the things that are free to vary are the observed values in a sample: if we
know the sample mean, and we know all values of the sample but one and the
sample mean, then we know the last value and thus n − 1 values in the sample
are free to vary but the last one is fixed. When we are talking about the χ 2

distribution, we are talking about degrees of freedom with regard to group


membership: if, for example, we have two groups and people have freely sorted
themselves into one group, then we know what group the remainder of the people
must be in (the other one), so the degrees of freedom for groups in that example
is 1. Much more on that concept to come in Differences Between Two Things.

The equation for the probability density of the χ distribution is:


2
χ distribution is 2df .
2

2

f (x; ν) = ⎨

⎪ x

0
ν
2

k
2
−1

Γ(
e

Figure 5.24: χ Distributions for Various df


2
x
2

)
if x > 0

if x ≤ 0

Like the t distribution, the only sufficient statistic for the χ distribution is

Because the χ distribution is itself based on squared deviations, it is the ideal


distribution for statistical analyses of squared deviation values. Two of the most
important statistics that the χ distribution models are:
2

1. Variances. Recall from Categorizing and Summarizing Information that the


equation for a population variance is
∑(x−μ)
2

and the equation for a sample


N
2

ν = df . The mean of the χ distribution is equal to its df , and the variance of the
2
2
∑(x−x)
variance is n−1
. In both equations, the numerator is a sum of squared
differences from a mean value. Thus, the variance is a statistic that is
modeled by a χ distribution.
2

2. The chi statistic. Seems kind of obvious that the chi statistic is modeled
2 2

by the χ distribution, doesn’t it? In the χ tests – the χ test of statistical


2 2 2

independence and the chi goodness-of-fit test, both of which we will be


2

discussing at length later – the test statistic is:


2
∑(f o − f e )
2
χ =
obs
fe

where f is an observed frequency and f is an expected frequency. Again,


o e

we’ll go over the details of those tests later, but look at that numerator! It’s a
squared difference from an expected value, and that is the χ ’s time to shine,
2

baby!

5.6 Other Probability Distributions


5.6.1 Uniform distribution

A uniform distribution is any distribution where the probability of all things is


equal. A uniform distribution can be discrete, as in the distribution of outcomes
of a roll of a six-sided die:
Figure 5.25: A Discrete Uniform Distribution

or it can be continuous, as in the distribution of times in an hour:


Figure 5.26: A Continuous Uniform Distribution

5.6.2 the β distribution


The beta distribution describes variables in the [0, 1] range. It is particularly
ideal for modeling proportions. In the terms of the binomial distribution that we
have been using, the mean value for a proportion – symbolized π̂ – is equal to
s/N . We can also assess the distribution of a proportion based on observed

data13 in terms of the β distribution.

For proportions, the β distribution is directly related to the observed successes (


s) and failures ( f ):

s + 1
mean(β) =
s + f + 2

π̂(1 − π̂) (s + 1)(f + 1)


variance(β) = =
2
s + f + 3 (s + f + 2) (s + f + 3)

Figure 5.27 is a visualization of a β distribution where s = 4 and f = 5. The β


can be skewed positively (when s < f ), can be skewed negatively (when s > f
), or be symmetric (when s ≈ f ).
Figure 5.27: A β Distribution with s = 4, f = 5

5.6.2.1 The Dirichlet Distribution

The Dirichlet Distribution is to multinomial data what the β distribution is to


binomial data: in terms of visualization, we can think of it as a “3D β.” It may or
may not come up again but it is a good excuse to include this cool plot:
Figure 5.28: Dirichlet Distribution with Shape Parameters 3, 5, 12

5.6.3 The Logistic Distribution


The Logistic Distribution is a model used for probabilities of events. For
example, Figure 5.29 represents the number of points scored in NBA games on
the x axis and the proportion of times that the team scoring that number of points
won the game (the data are restricted only to home teams: otherwise pairs in the
data would be closely related to each other):
Figure 5.29: Relative Probabilities of Binary Events, Modelable by a Logistic
Distribution.

The relative frequency of wins is low for teams that score fewer than 75 points,
rises for teams that score between 75 and 125 points, and at 150 points, victory
is almost certain. This type of curve is well-modeled by the logistic distribution:
Figure 5.30: The Logistic Distribution

The logistic distribution closely resembles the cumulative normal distribution.


Both distributions can be used to model the probability of binary outcomes (like
winning or losing a basketball game) based on predictors (like number of points
scored) and both are used for that purpose. However, the logistic distribution has
some attractive features when used in regression modeling that lead many –
myself included – to prefer it.

5.6.4 The Poisson Distribution


The Poisson Distribution models events with predominantly small counts. The
classic example of the Poisson Distribution being used to model events is
Ladislaus Bortkiewicz’s 1898 book The Law of Small Numbers, in which he
described the (small) numbers of members of the Prussian Army who had been
killed as a result of horse and/or mule kicks.

It may resemble the Binomial Distribution, and that is no coincidence: the


Poisson is a special case of the class of distributions known as the Negative
Binomial, which is the distribution of the probability of a number of s occurring
before a specified number of f occur.
Figure 5.31: The Poisson Distribution

5.7 Interval Estimates


Interval estimates are important tools in both frequentist statistics (as in
confidence intervals) and Bayesian statistics (as in credible intervals or highest
density intervals). For continuous distributions, interval estimates are areas
under the curve.

Figure 5.32 is a visualization of interval estimates for the central 95% under a
standard normal distribution, a beta distribution with s = 5 and f = 2, a χ 2

distribution with df = 3, and a t distribution with df = 10.


Figure 5.32: Visual Representations of 95% Intervals for Selected Probability
Distributions

To find the central 95% interval under each curve, we can use the distribution
commands in R with the prefix q for quantile. In each q command, we enter the
quantile of interest plus the sufficient statistics for each distribution (we can
leave out the mean and sd for the normal command because the default is the
standard normal).

Lower Limit Lower Limit Upper Limit Upper Limit


Distribution
Command Value Command Value
Normal qnorm(0.025) -1.9600 qnorm(0.975) 1.9600
qbeta(0.025, 5, qbeta(0.975, 5,
β(s = 4, f = 1) 0.3588 0.9567
2) 2)
2
χ (df = 3) qchisq(0.025, 3) 0.2158 qchisq(0.975, 3) 9.3484
t(df = 10 qt(0.025, 10) -2.2281 qt(0.975, 10) 2.2281

1. *Unless your research happens to be about marbles, the jars in which they
reside, and the effect of removing them from their habitat, in which case:
you’ve learned all you need to know, and you’re welcome.↩
2. If π is exactly 0 or exactly 1, there really is no distribution: you would
either have no s or all s.↩

3. Sufficient Statistics: the complete set of statistical values needed to define


a probability distribution↩

4. *The maximum cumulative probability for any distribution is 1.↩

5. The fact that you have to use s − 1 for cumulative probabilities of all values
greater than s is super-confusing and takes a lot of getting used to.↩

6. Strong chance I just made up the word “quasitautological” but let’s go with
it.↩

7. Between this and the method of moments, it should be apparent that early
statisticans were super into physics and physics-based analogies.↩

8. Unlike the pbinom() command, we don’t have to change x when we go


from lower.tail=TRUE to lower.tail=FALSE because p(x) = 0 for the
continuous distributions and so the distinction between > x and ≥ x is
meaningless.↩

9. As with any mention of R.A. Fisher, it feels like a good to remind ourselves
that Fisher was a total dick.↩

10. Well, now we can compute it. The normal approximation was a bigger deal
when computers weren’t as good and using formulas with numbers like 100!
was impossible to do with calculators.↩

11. There are variations on the t that have other parameters. Let’s not worry
about them for now.↩

12. Fisher, who again, was a total dick, proposed that samples of at least 30
could be considered to have their means distributed normally, and statistics
texts since then have reported 30 as the magic number. It’s just an
approximate rule-of-thumb designed for a world without calculators or
computers, and should be taken as a guideline but not a rule.↩

13. Well, we can if we are running Bayesian analyses. Frequentist analyses


don’t do that kind of thing – we’ll get into it later.↩
6 Classical and Bayesian Inference

Figure 6.1: Frequentists vs. Bayesians

Let’s carefully examine a fake experiment:

A scientist is interested the rate at which radioactive ooze turns ordinary


turtles into mutant ninja turtles (the age of the turtles is irrelevant but if
you must know they are between 13 and 19 years old).

20 turtles are exposed to radioactive ooze.

16 of them show signs of mutation (the fun kind not the sad kind).
Figure 6.2: Experimental Results (n = 4)

So, we have a proportion with s = 16 (let’s call mutation a “success”) and


N = 20. Let’s call that proportion π̂:
1

16
π̂ = = 0.8
20

If we were to mutate 16 turtles into pizza-loving ninjas, that would be


scientifically noteworthy in and of itself. But, let’s suppose it is not
necessarily the mutation itself that we are interested in but the rate of
mutations. As noted in Categorizing and Summarizing Information, scientific
inquiry is in most cases about generalizing the results from a sample to a
population. In that light, what we would want to infer from these
experimental results is the population-level rate of turtle mutation.

And what we can say about the population-level rate is at the heart of the fight
between proponents of relative-frequency-based statistical inference and
proponents of subjective-probability-based statistics – or, more simply,
between Classical Statisticians and Bayesian Statisticians.
Figure 6.3: Artist’s Rendering

6.0.1 Different Approaches to Analyzing the Same Data


If the results of an experiment suggest that, for the particular participants and
under the specific conditions of the experiment, a rate of 80% (as in the
example of the turtles), there are a few ways to interpret that:

1. The true rate is exactly 80%: no further analysis needed.2

This is both one of the simplest and one of the least believable inferences we
could make. It is simple because it technically does precisely what we are
trying to do with scientific experimentation – to study a sample and generalize
to the population – in the most straightforward possible way. However, this
approach would fail to withstand basic scientific scrutiny because (among
other reasons, but mainly) the sample of participants is highly unlikely to be
representative of the entire population. And, even if the sample were
somehow perfectly representative, it would also fail from the perspective of
probability theory. Consider, yet again, a coin flip.3 The probability of
flipping a perfectly fair coin 20 times and getting other than exactly 10 heads
is:

20! 10 10
p(s ≠ 10|N = 20, π = 0.5) = 1 − (0.5) (0.5) = 0.8238029
10! 10!

so there is an approximately 82.4% chance that in a study of 20 flips of a fair


coin, the result by itself will imply that the probability of heads is not 50%.
We know that wouldn’t be true: if we got 8 heads in 20 flips of a coin, we
would be wrong to think there is something unusual about the coin. Likewise,
the results of one experiment with N = 20 trials should not convince us that
the true rate is exactly 80%. Both of these problems – a lack of
representativeness and relatively high statistical uncertainty – can be
somewhat ameliorated by increasing the sample size, but neither can be
completely eliminated short of testing the entire population at all times in
history. Thus, approach #1 is out.
2. Based on our experiment, the true rate is probably somewhere around
80%, give or take some percentage based on the level of certainty
afforded by the size of our sample.

The second approach takes the fact that a rate 80% was observed in one
sample of size N = 20 and extrapolates that information based on what we
know about probability theory to make statements about the true rate. In this
specific case, we could note that 80% is a proportion, and that proportions
can be modeled using the β distribution. We can use the observed s and f –
16 and 4, respectively – as the sufficient statistics, also known in the case of
the β (and other distributions) as shape parameters, for a beta distribution.
The shape parameters for a β distribution are s + 1 and f + 1, also known as
α and β where α = s + 1 and β = f + 1.
4 Based on our experiment, we
would expect this to be the probability density of proportions:

Figure 6.4: A β distribution with s + 1 = 17 and f + 1 = 5

Based on the β distribution depicted in Figure 6.4, we can determine that the
probability that the true rate is exactly 80% is 0 and this approach is useless.
No! That’s just a funny funny joke about the probability of any one specific
value in a continuous probability distribution. What fun!

Seriously, folks, based on the β distribution depicted in Figure 6.4, we can


determine the following probabilities of the true rate π:

p(0.79 < π < 0.81) = pbeta(0.81, 17, 5) - pbeta(0.79, 17, 5)


= 0.0914531

p(0.70 < π < 0.9) = pbeta(0.9, 17, 5) - pbeta(0.7, 17, 5)


= 0.7494661

p(0.65 < π < 0.65) = pbeta(0.95, 17, 5) - pbeta(0.65, 17, 5)


= 0.9044005

p(π > 0.5) = pbeta(0.5, 17, 5, lower.tail=FALSE) = 0.9964013

Thus, we would be about 90% sure that the true value of π is within 0.15 of
0.8, and over 99.5% sure that the true value of π is greater than 0.5.

We could also use Bayes’s Theorem to determine various values of p(H |D) –
in this case, p(π|Data) – and would get the same results as using the β.
Approach #2 is the Bayesian Approach.

3. The true rate is a single value that cannot be known without measuring
the entire population: there is a chance that our results behaved
according to the true rate and a chance that our results were weird.
Therefore, the results of our experiment cannot prove or disprove
anything about the true rate. We can rule out – with some uncertainty –
some possibilities for the true rate if our results are unlikely to have been
observed given those possible true rates.

In this approach, the true rate is unknowable based on the results of a single
experiment, so we can’t make any direct statements about it. Instead, the
scientific question focuses on the probability of the data rather than the
probability of the hypothesis. This approach starts by establishing what the
expectations of the results would be in a scenario where nothing special at
all is happening. If the experiment results in a situation where anything at least
as likely as the observations are extremely unlikely to have happened given
our hypothetical nothing-special-at-all assumption, then we have evidence to
question the nothing special assumption,5 and support for the idea that
something special is indeed happening.

Going back to our example: we might reasonably assert that a state of affairs
in which the radioactive ooze is just as likely to cause a mutation as not to
cause a mutation. In that case, the p(mutate) = p(¬mutate) = 0.5:
essentially a coin flip. As above, we collect the data and find that 16 out of
the 20 turtles show signs of mutation. Given our assumption that the rate is
0.5, then the probability of observing 16 successes in 20 trials is:

20! 16 4
p(s = 16|N = 20, π = 0.5) = (0.5) (0.5) = 0.004620552.
16! 4!

That’s pretty unlikely! But, it can be misleading: for one, the probability of
any specific s is pretty small and vanishingly small for increasing N and it’s
not like if we instead observed 17, 18, 19, or 20 mutations that we would
change our scientific theories about the experiment. So, we are instead going
to be interested in the probability of the number of observed s or more given
our assumption that π = 0.5:

p(s ≥ 16|N = 20, π = 0.5) = 0.005909

which is still super-unlikely. It is unlikely enough that we would reject the


notion that the rate of turtle-mutation following exposure to the ooze is 50%
(or, by extension, less than 50%) – the results are too unlikely to support that
idea – in favor of a notion that the true rate is somewhere greater than 0.5.

How unlikely is unlikely enough to reject the notion of nothing-special-at-all


going on? That depends on how often we are willing to be wrong. If we’re ok
with being wrong on 5% of our claims, then any result with a cumulative
probability of at least the observation happening less than or equal to 5% is
unlikely enough. If we’re ok with being wrong on 1% of our claims, then any
result with a cumulative probability less than or equal to 1% is unlikely
enough. Basically, if we say “this didn’t happen randomly” every time an
event has a cumulative probability less than x%, then it will have happened
randomly x% of the time and we will have been wrong in our claim of
nonrandomness.
This third approach is the most popular in applied statistics. It is the Classical
Null Hypothesis Testing Approach.

4. We cannot make statements about the true rate, but based on the observed
results and what we know about the distribution of the data (in this case,
the binomial distribution), we can estimate an interval for which a given
proportion of theoretical future samples will produce similar interval
estimates that contain the true rate (whatever it may be).

As in approach #3, we assume that the true rate is a fixed value that cannot
vary and cannot be known without measuring the entire population at all
points in time. But, given that we observed 16 successes in 20 trials, we can
make some statements about what future replications of the experiments will
show. For example, we know that the true rate is not 0 or 1: otherwise, all of
our trials would have resulted in failure (if π = 0) or in success (if π = 1).

Let’s say, for example, that we want to produce an interval such that 95% of
future repetitions of our experiment will produce 95% intervals that include
the true rate (whatever it is). The lowest possible value of π in that interval is
going to be the value that leads to the cumulative likelihood of the observed s
or greater being 2.5%, and the highest possible value of π is going to be the
value that leads to the cumulative likelihood of the observed s or less being
2.5% . Thus, the cumulative probability of the data in the middle is

1 − 2.5% − 2.5.

For N > 20, there are algebraic methods for estimating the width of this
interval,6, but if software is available, the best method regardless of the size
of N is known as the Clopper-Pearson Exact Method. Without software, the
Exact Method is an interminable ordeal of trial-and-error to get the lower and
upper values of π, but it’s easy enough with software:

library(binom)
binom.confint(16, 20, 0.95, methods="exact")

## method x n mean lower upper


## 1 exact 16 20 0.8 0.563386 0.942666

The lower limit of the interval is 0.563386, and the cumulative probability of
s ≥ 16 is:
pbinom(15, 20, 0.563386, lower.tail=FALSE) = 0.025

The upper limit of the interval is 0.942666, and the cumulative probability of
s ≤ 16 is:

pbinom(16, 20, 0.942666, lower.tail=TRUE) = 0.025

Thus, the math checks out.

The interval estimate created by this method is known as the Confidence


Interval. It’s similar to intervals tested using the Bayesian approach, except
that the interpretation is different due to the philosophical objection (that it
shares with the Classical Null Hypothesis Testing approach) against making
probabilistic statements about the true rate. The confidence interval is not a
range in which one is x/ confident that the true rate is in: that would imply a
degree of belief about the value of the true rate. Instead, the confidence
interval7 is a statement about the ranges produced by future estimates based
on repeated use of the same scientific algorithm. And if that sounds overly
complicated and/or semantically twisted, that’s because it kind of is.

The confidence interval approach is, like Null Hypothesis Testing, a Classical
approach.

6.0.2 The Essential Difference


The key difference between Bayesian Statistics and Classical Statistics (both
Null Hypothesis Testing and Confidence Intervals) that informs all of the
analyses used by both camps and the math involved in both lies in the
treatment of what was referred to as the true rate in the specific example
above and is more generally known as the population parameter. Population
parameters can be descriptions of rates, but also can be things like differences
between means, or variances, or regression coefficients.

The classical (or relative frequentist, or simply _frequentist) theorists of the


late 19th and early 20th centuries – who almost to a man held and promoted
horrible beliefs about humanity8 – collectively objected to the idea that a
parameter could be described in any probabilistic way. And that is a fair
point! A parameter is a fixed value that could only be known by measuring the
entire population at all points in time.

Bayesians believe that population parameters can be described using


probability distributions: it’s not so much that they think that population
parameters vary back and forth along curves as that it is reasonable to express
a degree of belief about a parameter or a range of parameter values, and that
belief can be represented by a distribution. The paradox of describing a fixed
state with a probability distribution was explained well after the
establishment of classical statistics by Bruno de Finetti (the fascist from the
sidenote). Please imagine that somebody asked you about some minor thing in
your life that you either did or did not do, for example: whether you
remembered to lock your front door last Tuesday, and they asked you to put a
probability value on whether you did it or not. Even though it is essentially a
yes or no question, we tend to be ok with assigning a probability value to it
based on things like what we know about our tendencies and what the context
was on a given day. That is a case of making a probabilistic statement about a
factual thing, which is how Bayesian statisticians approach placing
parameters in the context of probability distributions.

6.0.3 Consequences of the Difference

6.0.3.1 p(H |D) vs. p(D|H )

Bayesian statisticians get their name because Bayes’s Theorem is fundamental


to the way that they treat scientific inquiry: they are interested in the
probability of various hypothesis (H ) given the data (D): p(H |D). As
covered in the section on Bayes’s Theorem in Probability Theory, there are
four parts to the theorem:

1. p(H ): the Prior Probability

The prior probability is the most subjective element of Bayesian analysis.


Bayesian analyses often employ priors that represent scenarios where each
hypothesis starts out as equally likely: these include flat or noninformative
priors. Such analyses do, however, include the option to add information from
scientific theory or prior studies. For example: a Bayesian analyst might
choose a prior that weighs the probabilities against an extraordinarily unlikely
hypothesis (like the existence of extrasensory perception or time travel), or
(more commonly) that includes probabilities obtained in earlier studies.

2. p(D|H ): the Likelihood

The likelihood is the conditional probability of the observed data given the
hypothesis. Typically, the likelihood is not a single value but a likelihood
function, as Bayesian analyses are often based on evaluating a range of
possibilities that need to be fed into a function. In an example like the
mutating-turtle experiment described above where the hypothesis is about a
rate parameter, the likelihood function would be the binomial likelihood
function. In other studies, the likelihood function might be based on other
probability density functions like that of the normal distribution. Like the
prior, there is some expertise required to choose the proper likelihood
function (in my experience, this is the part that Bayesians argue the most about
with other Bayesians).

3. p(D): the Base Rate

The base rate is the overall probability of the data under the conditions of all
hypotheses under consideration. It’s actually the least controversial term of
Bayes’s Theorem: once the prior and the likelihood are established, the base
rate is just the normalizing factor that makes the sum of all the posterior
probabilities of all the hypotheses under consideration equal to 1.

4. p(H |D): the Posterior Probability

The posterior probability is the result of the analysis. It is the probability or


probability distribution that allows us to evaluate different hypotheses.

In the Bayesian approach, p(H |D) implies that there is probability


associated with the hypothesis but the data are facts. Classical Statisticians,
famously, refuse the notion that there can be a probability of a hypothesis.
Thus, for Classicists, the probability in statistics is associated with the data
and not the hypothesis. That means that the focus of Classical Statistics is on
the likelihood function p(D|H ) alone. That is, Classical techniques posit a
hypothesis (actually, a null hypothesis) and assess the probability of the data
based on that hypothesis.

6.0.3.2 Mathematical Implications

Classical approaches focus on the likelihood of the data and typically use
likelihood functions (that give rise to probability distribuions) that describe
the distribution of the data given certain assumptions. The most common such
assumption is that data are sampled from normal distributions. For example,
we may assume that the data are sampled from a standard normal distribution.
If an experiment results in an observed value with a z-score of 5, then the
likelihood of sampling that value from the standard normal is pnorm(5,
lower.tail=FALSE)=2.8665157 × 10 : not impossible, but seriously
−7

unlikely. In fact, that is so unlikely that we would reject the assumption that
this value came from the standard normal in favor of the idea that the value
came from some other normal distribution.

Classical approaches have been around so long and have become so widely
used that the procedures have been made computationally simple relative to
Bayesian approaches.9 Part of that relative simplicity is due to the
assumptions made about the probability distributions of the data. Another part
of the relative simplicity is that most of the most complicated work – for
example, all of the calculus that goes into estimating areas under the standard
normal curve – has been done for us. If you ever happen to read some of the
seminal papers in classical statistics (1/10 I can’t recommend it), you may
note the stunning mathematical complexity of the historical work on the
concepts. Classical statistical methods have been made easier (again,
relatively speaking) to meet the demands of the overwhelming growth of
applications of the field over the past century or so. Bayesian methods might
get there eventually.

Bayesian methods tend to be much more computationally demanding than their


Classical counterparts. There are two main reasons that Bayesian approaches
are more complex: 1. there are more moving parts to calculating p(H |D) than
calculating p(D|H ), and 2. they rely less on assumptions about probability
distributions.10 In terms of marbles-in-jars problems using Bayes’s Theorem,
there are lots of possible jars, and lots of different relative proportions of
different marbles in those jars, so calculating posterior probabilities involves
lots of calculus and/or computer simulations. A substantial amount of the
analyses that we can do today using modern computing and Monte Carlo
sampling were difficult-to-impossible decades ago (to be fair, the same can
be said for some classical analyses as well). The good news is that with
modern software, the calculations are much easier to do.

6.0.3.3 Comparability of Results

The decades of internecine fighting among statisticians belies the fact that the
results usually aren’t wildly different between Bayesian and Classical
analyses of the same data. Bayesian analyses tend to produce more precise
estimates, either because they don’t have to content with the sample size
required to reject a null hypothesis or because they provide more concise
interval estimates.

Bayesian approaches are more flexible in that one can construct their own
evaluation of posterior probabilities to fit their study design needs. Classical
approaches are in a sense easier in that regard because their popularity has
led to the development of tools for almost any experimental design that a
behavioral scientist would use.

Classical approaches offer the authority of statistical significance, a concept


that Bayesian approaches do not have. The upside there for Bayesian
approaches is that without a null-hypothesis testing framework, they don’t
carry concerns about type-I and type-II errors.

6.1 Examples of Classical and Bayesian Analyses


In this section, we will elaborate on the example above of an experiment with
s = 16 successes in N = 20 trials, analyzing the data using both Classical

and Bayesian methods.

6.1.1 Classical Null Hypothesis Testing


In Classical Hypothesis Testing, a hypothesis is a statement about a
population tested with information (statistics) from a sample.11 In the
Classical paradigm, two mutually exclusive hypotheses are posited: a null
hypothesis (abbreviated H ) and an alternative hypothesis (abbreviated H ).
0 1

6.1.1.1 Null and Alternative Hypotheses

The null hypothesis is a description of a state where nothing is happening. In


a study design that investigates the correlation between variables, nothing is
happening means that there is no relationship between the variables, and the
null hypothesis statement describes the mathematical state of affairs that
would result from variables having no relationship: i.e., that the statistical
correlation between the variables is equal to 0. In a study design that
investigates differences between variables, nothing is happening means that
there is no difference between the variables, and the null hypothesis is a
mathematical statement that reflects that: i.e., that the statistics describing the
variables are equal to each other. The alternative hypothesis is the opposite of
the null hypothesis: it describes a state where something is happening: i.e.,
that the correlation between variables is not equal to 0, or that the statistics
describing variables are different from each other.

The null and alternative hypotheses are both statements about populations.
They are, more specifically than described above, statements about whether
nothing (in the case of the null) or something (in the case of the alternative) is
going on in the population. In a case of an experiment designed to see if there
are differenes between the mean results under two conditions, for example,
the null hypothesis might be that there is no difference between the population
means for each condition, stated as μ − μ = 0 (note the use condition 1 condition 2

of Greek letters to indicate population means). That means that if the entire
world population were subjected to each of these conditions and we were to
measure the population-level means μ and μ , that the condition 1 condition 2

difference between the two would be precisely 0. However, even if the null
hypothesis μ − μ were true, it would be extraordinary
condition 1 condition 2

unlikely that the difference between the sample means –


x
¯
¯ − x = 0: if we take two groups of, say, 30 random
condition 1 condition 2

samples from the same distribution, it’s almost impossible that the mean of
both of those groups will be exactly equal to each other.
Thus, it is pretty much a given that two samples will neither be completely
uncorrelated with each other nor that two samples will be completely
identical to each other. But, based on a combination of how correlated to or
how different from each other the samples are, the size of the samples, and the
variation in the measurement of the samples, we can generalize our sample
results to determine whether there are correlations or differences on the
population level.

6.1.1.1.1 Types of Null and Alternative Hypotheses

The null, and by extension, the alternative hypotheses we choose can indicate
either a directional hypothesis or a point hypothesis. A directional
hypothesis is one in which we believe that the population parameter will be
either greater than the value indicated in the null hypothesis or less than the
value indicated in the null hypothesis but not both. For example, consider a
drug study where the baseline rate of improvement for the condition to be
treated by the drug in the population is 33%. In that case, 33% would be a
reasonable value for the rate in the null hypothesis: if the results from the
sample indicate that the success rate of the drug in the population would be
33%, that would indicate that the drug does nothing. If, in a large study with
lots of people, the observed success rate were something like 1%, there might
be a very small likelihood of observing that rate or smaller rates, but the
scientists developing that drug certainly would not want to publish those
results. Instead, that would be a case suited for a directional hypothesis that
would lead to rejection of the null hypothesis if the likelihood of the observed
successes or more was less than would be expected with a rate of 33%. The
null and alternative hypotheses for that example might be:
H 0 : π ≤ 0.33

H 1 : π > 0.33

Directional hypotheses are also known as one-tailed hypotheses because only


observations on one side of the probability distribution suggested by the null
hypothesis can potentially lead to rejecting the null.

Point hypotheses indicate that the population parameter is in any way


different – greater than or less than – the value stated in the null hypothesis.
For example: a marketing firm might be interested in whether a new ad
campaign changes people’s minds about a product. They may have a null
hypothesis that indicates a 50/50 chance of either improving opinions or not
improving opinions. In that case, evidence from sample data that indicates
highly improved opinions or highly declined opinions would indicate
substantial change and may – based on the cumulative likelihood of high
numbers of positive opinions or higher or on the cumulative likelihood of low
numbers of positive opinions or lower – lead to rejection of the null
hypothesis. A set of null and alternative hypotheses for that scenario may look
like this:
H 0 : π = 0.5

H 1 : π ≠ 0.5

A point hypothesis is also known as a two-tailed hypothesis because values


in the upper range or values in the lower range of the probability distribution
suggested by the null hypothesis can lead to rejection of the null.

The choice of whether to use a point hypothesis or a directional hypothesis


should depend on the nature of the scientific inquiry (although there are some
that will prescribe always using one or the other). Personally, I think that
cases where scientists are genuinely ambivalent as to whether a result will be
either small or large – which would suggest using a point hypothesis – are
legitimate but less frequent than cases where scientists have a good idea of
which direction the data should point, which would produce more directional
hypotheses than point hypotheses. But, that is a matter of experimental design
more than a statistical concern.

Classical Null Hypothesis testing follows an algorithm known as the six-step


hypothesis testing procedure.
Figure 6.5: There are few things in this world I love as much as a clever setup
to a dumb punchline.

This algorithm should be considered a basic guideline for the hypothesis


testing procedure: the six steps are not etched in stone12, the wording of them
varies from source to source, and sometimes the number of steps is cited as 5
instead of 6, but the basic procedure is always the same. The next section
outlines the procedure, using the mutating-turtles example to help illustrate.
6.1.2 The Six (sometimes Five) Step Procedure

The basic logic of the six-step procedure is this: we start by defining the null
and alternative hypotheses and laying out all of the tests we are going to do
and standards by which we are going to make decisions.

We then proceed to evaluate the cumulative probability of at least the


observed data (more on that in a bit) given the null hypothesis. This
cumulative probability is known as the p-value: if the p-value is less than a
predetermined rate value known as the α-rate, then we reject the null
hypothesis in favor of the alternative hypothesis.

Rejecting the null hypothesis means that we have assumed a world where
there is no effect of whatever we are investigating and have analyzed the
cumulative likelihood of the observed data in that context and come to the
conclusion that it’s unlikely – based on the observed data – that there really is
no effect of whatever we are investigating. Given that we have rejected the
null, then the alternative is all that is left. In cases where we reject the null
hypothesis, we may also say that there is a statistically significant effect of
whatever it is we are studying.

If we analyze the cumulative likelihood of the data under the assumptions of


the null hypothesis and find that it exceeds the α rate, then we would not be
able to rule out the null hypothesis: the data would be pretty likely if there
were no effect so we continue to assume no effect. In such cases, we
continue to assume the null hypothesis.

Please note that in the preceding paragraphs there is no mention of the terms
prove or accept or any form, derivation, or synonym of either of them. In the
Classical framework (and in the Bayesian framework, too), we never prove
anything, neither do we ever accept anything. That practice is consistent with
principles of early-20th century philosophy of science in that no scientific
theory is ever considered to be 100% proven, and it’s also consistent with the
notion that statistical findings are always, in some way, probabilistic rather
than deterministic in nature. The rejection of a null hypothesis, for example, is
not so much there’s no way the null can be true as the null is unlikely
enough given the likelihood threshold we have set.
6.1.2.1 p-values

As noted above, the decision as to whether to reject or to continue to assume


the null hypothesis depends on the cumulative likelihood of the data, and that
the value of the cumulative likelihood given the null hypothesis is known as
the p-value. P -values are the stiffest competition to confidence intervals for
most famously misunderstood concept in statistics: like the confidence
interval, it’s pretty poorly named.

The p-value is a cumulative likelihood: if it weren’t a cumulative likelihood,


we would be rejecting null hypotheses in just about every scientific
investigation. The probability of any one specific event is vanishingly small,
especially for continuous variables but for discrete variables with large N as
well.13 Generally speaking, ranges produce much more meaningful
probabilities than do single values.

The range of the cumulative likelihood of a p-value is defined as the observed


event plus more extreme unobserved events. That might sound odd,
particularly the part about unobserved events. How can unobserved events
affect our understanding of science? First, as noted in the preceding
paragraph, the probability of just the observed event would be misleadingly
small. Second, that reasoning isn’t terribly unusual: you or someone you know
might be or might have been in a situation where you or they need a certain
grade on an upcoming assessment to get a certain grade in a course. You may
have said or you may have heard somebody else say, “I need to score x% on
this test to pass [or get a D in or an A in, etc.] this class.” Let’s say a student
needs an 80% to get a C for the semester in a class. Would they be upset if
they scored an 81%, or an 82%, or a 100%? No, they would not. In the
context of classical statistics, the likelihood of interest is more like at least
the observed data.

That the p-value is a likelihood means that it is conditioned on something.


That something is the conditions of the null hypothesis. The null hypothesis
is a mathematical description of what would be expected if nothing were
special at the population level. To some extent, researchers need to make
choices about the null hypothesis. Sometimes these decisions are relatively
straightforward: in the case of a cognitive task where participants have four
options to choose from, a null hypothesis that states that the population
parameter is equal to 1/4 makes sense (that would be what would happen if
people guessed purely at random). Other times, the decisions are not as
straightforward: in the fictional case of our mutating-turtles example, the null
hypothesis might be based on a population parameter of 1/2, indicating that
the ooze is just as likely to cause a mutation as not. But, we might instead
define pure chance as something much smaller than 1/2 or much bigger than
1/2 based on what we know about turtle-mutating-ooze.
14 What we choose
as the null hypothesis parameter matters enormously to the likelihood of the
data.

The misconception about p-values is usually expressed in one of two ways: 1.


that the p-value is the probability of the observed data and 2. that the p-value
is the probability that the results of an experiment were arrived at by pure
chance. Both statements are wrong. Firstly, the p-value refers not just to the
data but at least the data because it’s a cumulative likelihood. Secondly (and
probably more importantly): the p-value can’t be the probability of the data
nor the probability that the data were a fluke because the p-value is
conditioned on the choice of the null hypothesis. If you can change the null
hypothesis (which includes the choice between using a point hypothesis or a
directional hypothesis: the two types of hypotheses will always alter the p-
value and may mean the difference between statistical significance and
continuing to assume the null) and in doing so change the p-value, then the
probability expressed cannot be a feature of the data themselves.

6.1.2.2 α-rates and Type-I Errors

The α-rate is also known as the false alarm rate or the Type I error rate. It is
the rate of long-run repetitions of an experiment that will falsely reject the null
hypothesis. No research study in the behavioral sciences produces definitive
results: we never prove an alternative hypothesis. Thus, we have to have
some standard for how small a p-value is to declare the data (cumulatively)
unlikely enough to reject the null hypothesis. Whatever value for that threshold
that we choose is also going to be the rate at which we are wrong about
rejecting the null.
Let’s say we choose an α rate of 0.05. Figure 6.6 is a representation of a
standard normal distribution with the most extreme 5% of distribution
highlighted in the right tail. The distribution represents a null hypothesis that
the population mean and standard deviation are 0 and 1, respectively, and the
shaded area indicates all values that have a cumulative likelihood of 0.05 or
less. Any result that shows up in the shaded region (that is, if the value of
greater than or equal to 1.645) will lead to rejecting the null hypothesis: the
shaded area is what is known as a rejection region.

Figure 6.6: Normal Distribution with an Upper-tail 5% Rejection Region

Again, anytime we have an observed value pop up somewhere in the rejection


region, we will reject the null hypothesis in favor of the alternative
hypothesis. Rejecting the null means that we have some reason to believe that
the value was not sampled from the distribution suggested by the null
hypothesis but rather that it came from some other distribution with different
parameters. We might be right about that. But as illustrated in Figure 6.6, we
will be wrong 5% of the time. Five percent of the values in the distribution
just live in that region. If there were no other distribution, we would sample
a value from that part of the curve 5% of the time.
So, what should we choose to be the α-rate? The most common choice is
α = 0.05, which is a legacy of a decision made by R.A. Fisher (dick) that

was partly informed by how easy it made calculations that had to be done
using a slide rule.15 It’s still common today largely because it has been
handed down from stats teacher to stats teacher through the generations.
Choosing α = 0.05 means that, on average, 5% of results declared
statistically significant will be false alarms. It may not surprise you to learn
that psychologists are increasingly tending towards choosing more stringent –
i.e., smaller – α rates.

6.1.2.3 Type-II Errors

The complement to the Type-I error is the type-II error, also known as the β
error, or a miss. A type-II error occurs when there is a real, population-level
effect of whatever is being tested. Figure 6.7 depicts a situation where the
population distribution is different from the distribution represented by the
null hypothesis. Whenever a value is sampled from this population
distribution that is less than the values indicated by the rejection region under
the null distribution, we will continue to assume the null hypothesis and be
wrong about it.
Figure 6.7: Normal Distribution with an Upper-tail 5% Rejection Region and
an Alternative Distribution with Highlighted Misses

The long-term rate of misses is known as the β rate. The complement of the β
rate – or 1 − β, is the rate of times when the null hypothesis will be correctly
rejected. In the sense that it is the opposite of the miss rate we might call that
the hit rate, but it is usually referred to as the power.

The table below indicates the matrix of possibilities based on whether or not
there is a real, population-level effect (columns) and what the decision is
regarding H (rows).
0

Real Effect?
Yes No
Correct Rejection of
Decision Reject H 0 False Alarm
H0

Continue to Assume H0 Correctly


Decision M iss
H0 Assumed
There is a tradeoff between type-I and type-II errors: the more stringent we
are about preventing false alarms (by adopting smaller α-rates), the more
frequently we will miss real effects, and the more lenient we are about
preventing false alarms (by adopting larger α rates), the less frequently we
will miss real effects. Which error – false alarm or miss – is worse? There
are deontological prescriptions to answer that question – some believe that
hits are worth the false alarms, others believe that avoiding false alarms are
worth the misses – but I prefer a consequentialist viewpoint: it depends on the
implications of the experiment.

And now, with no further ado, the six-step procedure:

1. Describe the null and the alternative hypotheses

2. Set the Type-I error rate

3. Identify the statistical test you will use

4. Identify a rule for deciding between the null and the alternative
hypotheses

5. Obtain data and make calculations

6. Make a decision

In step 1, we state the null and the alternative. In the running example for this
page (turtles and ooze), we will adopt a directional hypothesis: we will reject
the null if the likelihood of the observed data or larger unobserved data is less
than our α rate (but not if the observed data indicate a rate less than what is
specified in the null). Thus, our null and alternative are:
H 0 : π ≤ 0.5

H 1 : π = 0.5

In step 2, we define our α-rate. Let’s save the breaks from tradition for a less
busy time and just say α = 0.05.
In step 3, we indicate the statistical test we are going to use. Since we have
binomial data, we will be doing a binomial test.

In step 4, we lay out the rules for whether or not we will reject the null. In this
case, we are going to reject the null if the cumulative likelihood of the
observed data or more extreme (in this case, larger values of s) data is less
than the α-rate (which, as declared in step 2, is α = 0.05). Symbolically
speaking, we can write out our step 4 as:

if p(s ≥ s obs |π = 0.5, N = 20) ≤ 0.05 reject H 0 in f avor of H 1


{
else continue to assume H 0

In step 5, we get the data and do the math. For our example, the data are:

s = 16, N = 20

And the cumulative likelihood of s ≥ 16|π = 0.5, N = 20


16 is:

pbinom(15, 20, 0.5, lower.tail=FALSE)= 0.005909

In step 6, we make our decision. Because the cumulative likelihood – our p-


value – is 0.0059 and is less than α, we reject the null hypothesis in favor of
the alternative hypothesis.

There is a statistically significant effect of the ooze.

6.1.3 Confidence Intervals

A 1 − α% confidence interval represents the range in which 1 − α% of future


samples using the same algorithm will generate similar interval estimates that
capture the true population parameter 1 − α% of the time. It would make a lot
more sense – especially given the name confidence interval – if the interval
represented a range where we were 95% confident that the true population
parameter was. But, making such a statement would invalidate the very core
of the classicist philosophy: that a parameter cannot be described using
probability. The distinction in the language is not just philosophical, though:
the range of future interval estimates produces different values than would an
interval estimate about the true parameter value. We know that because
Bayesians do calculate such intervals, and they are frequently different than
corresponding Classical confidence interval estimates for the same data.

The term 1 − α% refers to the false alarm rate used in a statistical analysis. If
the false alarm rate α = 0.05, then the Classical interval estimate is a
1 − 0.05 = 95% confidence interval. If α = 0.01, then the estimate is a

1 − 0.01 = 99% confidence interval. Just as 0.05 is the most popular α rate,

95% is the most popular value for the Classical confidence interval. The

value of the lower limit of the 95% confidence interval (0.563386) is the
value of π for which p(≥ 16) = 2.5% . The value of the upper limit of the
95% confidence interval (0.942666) is the value of π for which
p(s ≤ 16) = 2.5% .

Confidence intervals are useful for comparison: a narrower confidence


interval indicates higher precision; and a wider confidence interval indicates
lower precision. Confidence intervals that don’t include values of interest – in
this case, 0.5 might be considered interesting because it suggests pure chance
– are indicators of statistical significance (even though the 6-step procedure
hasn’t been strictly followed) Confidence intervals for multiple samples that
don’t overlap indicate significant differences; confidence intervals that do
overlap indicate lack of significant differences.

6.2 Bayesian Inference


6.2.1 Posterior Probabilities

The Classical approach focuses on likelihoods: p(D|H ), the probability of


the data given the hypothesis (namely, the null hypothesis). It places all of the
probability on the observation of data. Data, in the classical approach, are
probabilistic events – they depend on randomness in sampling or in
unexpected changes in experimental procedures – and hypotheses are fixed
elements, things that either are something or are not. The Bayesian approach
focuses on posterior probabilities: p(H |D), which are the conditional-
probability antonyms of likelihoods. In the Bayesian approach, it is
hypotheses that are described with probability based on the data, which are
considered fixed facts: observations that happened with certainty.
This is how Bayesians get their name: the focus on the probability of
hypotheses invokes the use of Bayes’s Theorem – p(H |D) = – to
p(H )p(D|H )

p(D)

use posterior probabilities to describe hypotheses given the observed data.

As described in the page on probability theory, there are four parts to Bayes’s
Theorem: the prior probability, the likelihood, the base rate, and the posterior
probability. Using the turtle-mutating ooze experiment example, the next
section will illustrate how we use Bayes’s Theorem to investigate a scientific
hypothesis.

The scientific investigation represented by this example is centered around a


proportion: given that s = 16 successes were observed in N = 20 trials,
what can we say about the population-level probability parameter? That is:
what is the overall probability that the ooze will turn a teenaged turtle into a
teenaged mutant turtle (that may or may not practice ninjitsu)?

We will start with the prior probability. A common practice in Bayesian


statistics is to choose priors that indicate that each possibility is equally
likely, and that’s what we’ll do here. The population parameter has to be a
value between 0 and 1. Let’s take evenly-spaced possible values of π –
generating 11 candidate values
π = {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} – and use a uniform
i

prior probability p(H ) distribution that makes all 11 equally likely, as shown
in Figure 6.8:
Figure 6.8: Prior Probability Distribution for π i

Next, let’s calculate the likelihood p(D|H ) for each of the 11 possible values
of π using the binomial likelihood function:

20! 16 4
p(s = 16|N = 20, π = pi i ) = (π i ) (1 − π i )
16! 4!

The resulting likelihood distribution is shown in Figure 6.9.


Figure 6.9: Likelihood Distribution for π i

The base rate p(D) is the value that will make all of the posterior
probabilities that we calculate for each π sum to 1. To get that, we take the
i

sum of the products p(H )p(D|H ) for each value of π :


i

11

p(D) = ∑ p(π i )p(D|π i ) = 0.0434

i=1

Finally, for each value of π , we multiply the prior probability by the


i

likelihood and divide each value by p(D) (which is the constant 0.0434) to
get the posterior probability distribution shown in Figure 6.10:
Figure 6.10: Posterior Probability Distribution for π i

And, wouldn’t you know it? That looks an awful lot like the β distribution for
s = 16 and f = 4 (see Figure 6.4 above). In fact, if we divided the [0, 1]

more and took 101 values of π , it would look more like the β. And if we had
i

infinite values of π , it would be the β. The β distribution is the natural


i

conjugate function for binomial data, meaning that it’s the result of Bayes’s
thorem when you have binomial data. As a result, binomial data are pretty
much the easiest thing to analyze with Bayesian methods.

As a natural conjugate function, if the prior distribution is a β then the


posterior distribution is also a β. When the β is used as a prior, we use the
symbols s′ (“s prime”) and f ′ (“f prime”) to indicate that they are parameters
for a prior distribution. The observed data from an experiment are still called
s and f . The resulting posterior distribution uses the symbols s′′ (“s double

prime”) and f ′′ (“f double prime”). In our example, we want to start with a
prior probability that indicates that everything is equally likely. To do so, we
will use a flat prior (the Bayesian term for a uniform prior distribution),
specifically, the β distribution with s′ = 0 and f ′ = 0.17
To our prior β shape parameters s′ and f ′, we add our experimental data s

and f to get the posterior shape parameters s′′ and f ′′:


s′′ = s′ + s = 0 + 16 = 16

f ′′ = f ′ + f = 0 + 4 = 4

The resulting posterior β distribution is therefore based on the sufficient


statistics s′′ = 16 and f ′′ = 4.18 The mean of the posterior distribution is:
s′′ + 1 17
mean(π) = =
s′′ + f ′′ + 2 22

the variance is:

π(1 − π)
var(π) = = 0.0061
s′′ + f ′′ + 3

and the standard deviation is:

π(1 − π)
sd(π) = √ = 0.078.
s′′ + f ′′ + 3

Using the posterior β distribution, we can make all kinds of probabilistic


statements about the population parameter π. For example, we may want to
know what the probability is that π > 0.5:

pbeta(0.5, 17, 5, lower.tail=FALSE) = 0.9964013

6.2.2 Bayesian Interval Estimates

We also might be interested in the Bayesian counterpart to the confidence


interval, known as the the highest density interval (abbreviated HDI), or the
credible interval (abbreviated CI, the same as confidence interval is, which
is obnoxious). The Bayesian HDI differs from the classical confidence
interval in a couple of ways. The first is philosophical: the Bayesian HDI is
the range in which we are x% confident that the population parameter is. The
second is mathematical: the highest density interval is not necessarily
symmetric (as most Classical confidence intervals are) but represent the
narrowest possible interval that defines x% of the area under a posterior
probability curve. That doesn’t make much difference for distributions that are
themselves symmetric, but for skewed distributions (like the β in our
example), the HDI can be substantially narrower than a symmetric interval
would be, allowing for more precise evalutation of the population parameter.

To calculate the HDI (or credible interval, if you prefer. I don’t.), we’re going
to need some software help.19 We can install the R package HDInterval to
find HDI limits:

library(HDInterval) #load package


x<-seq(0, 1, 1/1000) #create a vector of x values to help defi
hdi(qbeta(x, 17, 5), 0.95) #get the 95% HDI associated with th

## lower upper
## 0.6000603 0.9303378
## attr(,"credMass")
## [1] 0.95

Thus, the tightest possible range that defines 95% of the area under the β
curve for s = 16, f = 4 has a lower limit of 0.60 and an upper limit of 0.93.
By convention, we report Bayesian intervals as:

p(0.60 ≤ π ≤ 0.93) = 0.95

6.2.2.1 The Metropolis-Hastings Markov-Chain Monte Carlo Algorithm

The β distribution is relatively painless to use for Bayesian analyses: the


shape parameters of the β come straight from the data. Unfortunately, the
proper distribution isn’t always so clear, especially in cases where there are
multiple population parameters that we’re trying to learn about.

In those cases, Bayesians often turn to Markov Chain Monte Carlo methods to
estimate posterior probability distributions on parameters. The most broadly
useful Monte Carlo method for estimating probability distributions is the
Metropolis-Hastings Algorithm.
Figure 6.11: Can’t stop, won’t stop

The Metropolis-Hastings Algorithm, like all Markov Chain Monte Carlo


(MCMC) methods, is a sophisticated version of a random walk model. In this
case, the steps are between parameter estimates, and the path is guided by the
relative likelihoods of the data given the parameter estimate option. We start
by choosing a starting value for our parameters – it doesn’t super-matter what
we choose, but if we make more educated guesses then the algorithm works a
little more efficiently. Then, we generate another set of parameters: if the new
set of parameters increases the likelihood of the data or results in the same
likelihood, then we accept the new set of parameters. If the new set of
parameters decreases the likelihood of the data, then we will still accept the
new parameters if the ratio of the new likelihood and the old likelihood is
greater than a randomly generated number from a uniform distribution
between 0 and 1. The second part – sometimes accepting new parameters if
they decrease the likelihood of the data – is the key to generating the target
probability distribution: if we only ever accepted parameters that increase the
likelihood of the data, then the estimates would converge on a single value
rather than the distribution. Accepting new parameters if the likelihood ratio
exceeds a random number between 0 and 1 means that parameters that
decrease the likelihood of the model are more likely to be accepted if they
slightly decrease the likelihood than if they decrease the likelihood by a
lot.20 Then, we repeat the process again and again about a million times
(deciding on how many iterations to use is more art than science, but the
answer is usually: a lot).
Let’s pretend for now that we were unaware of the Bayesian magic of the β
distribution and its shape parameters and had to generate it from scratch. Here
is how we would proceed:

1. Choose a semi-random starting parameter. We’ll call it π and start it at


1

0.5 (can’t be too far off to start if we start in the middle of the range).

2. Randomly sample a candidate replacement parameter. We’ll call that π 2

and generate it by taking a random number from a uniform distribution


between 0 and 1.

3. Calculate the likelihood ratio r:

p(D|π 2 )
r =
p(D|π 1 )

between the probability density using the new parameter and the old
parameter. For our data, the probability densities are given by:

20!
16 4
p(D|π i ) = π (1 − π)
16! 4!

4. If the ratio r is ≥ 1, we will accept the new parameter. π would be the


2

new π . We’ll write it down (with software, of course) before we move on.
1

a. If the ratio r is < 1, randomly sample a number we’ll call u between 0


and 1 from a uniform distribution.

b. Then, if the ratio r is > u, then we will accept the new parameter. π is 2

the new π . We’ll write it down.


1

c. If the ratio r is ≤ u, then we will reject the new parameter. π stays 1 π1 .


We’ll write that down.

5. Repeat steps 1 – 4 like a million times, writing down either π1 or π2 –


whichever wins the algorithm – each time.

Here’s how that looks in R code:


set.seed(77)
s<-16
N<-20
theta<-c(0.5, rep(NA, 999))
for (i in 2:1000000){ #Number of MCMC Samples
theta2<-runif(1) #Sample theta from a uniform distribution
r<-dbinom(s, N, theta2)/dbinom(s, N, theta[i-1]) #calculate
u<-runif(1) #Sample u from a uniform distribution
theta[i]<-ifelse(r>=1, theta2, #replace theta according to M
ifelse(r>u, theta2, theta[i-1]))
}

And the resulting distribution of the samples is shown in Figure 6.12.

Figure 6.12: Results of 1,000,000 Markov Chain Monte Carlo Samples


Selected Using the Metropolis-Hastings Algorithm

6.2.3 Bayes Factor


The last major Bayesian tool we’ll discuss in this introduction is the Bayes
Factor. The Bayes Factor is a way of evaluating competing models.

Let’s posit a hypothetical model H . The odds in favor of the prior H are
1 1

p(H 1 )
. If there are two models – H and H , then the prior probability of
1−p(H 1 )
1 2

H is equal to 1 − p(H ), and the prior odds in favor of H are:


2 1 1

p(H 1 )

p(H 2 )

.
p(H 1 |D)
Likewise, the posterior odds in favor of H given the data are
1
1−p(H 1 |D)
, the
prior probability of H given the data D is equal to 1 − p(H
2 1 |D), and the

posterior odds in favor of H are:


1

p(H 1 |D)

p(H 2 |D)

Using Bayes’s Theorem, we can derive the following relationship between the
prior odds and the posterior odds (note that p(D) cancels out in the numerator
and denominator – that’s why it’s missing):

p(H 1 |D) p(H 1 ) p(D|H 1 )


=
p(H 2 |D) p(H 2 ) p(D|H 2 )

The Bayes Factor is the factor by which the likelihood of model H increases
1

the posterior odds from the prior odds relative to the likelihood of model H 2

– it’s the rightmost part of the equation above.

p(D|H 1 )
B.F . =
p(D|H 2 )

Generally, the Bayes Factor is calculated by integrating both likelihoods over


the range of all of their possible parameters:
∫ p(D|H 1 )
B.F . =
∫ p(D|H 2 )

All things being equal, the integrated likelihood of a more complex model
will be smaller than the integrated likelihood of a simpler model – a model
with more parameters will stretch its likelihood across a much larger space.
That means that the Bayes Factor naturally favors simpler models.

Which model is H and which is H is arbitrary. In practice, the model with


1 2

the larger likelihood goes in the numerator so that Bayes Factor is always
reported as being ≥ 1: so the larger the Bayes Factor, the greater the evidence
in favor of the more likely model. There are two sets of guidelines for
interpreting Bayes Factors. Both are kind of arbitrary. I don’t have a
preference.

Jeffreys (1961) Kass & Raftery (1995)


Bayes Bayes
Interpretation Interpretation
Factor Factor
Not worth more than a Not worth more than a bare
1 to 3.2 1 to 3
bare mention mention
3.2 to 10 Substantial 3 to 20 Positive
10 to 100 Strong 20 to 150 Strong
> 100 Decisive > 150 Very Strong
Happily, we don’t need to use calculus for the example that wraps up this
discussion of Bayesian inference. Let’s go to the ooze one more time!

Suppose we had two competing hypotheses for the population parameter π


that describes the probability of a turtle mutating when coming into contact
with radioactive ooze. The first hypothesis is H : π = 0.8. The second1

hypothesis is H : π = 0.5. We can use Bayes’s theorem to compare them,


2

and, because the likelihood functions of the two hypotheses (both binomial
likelihood functions) differ only by the parameter value, a simple likelihood
ratio gives the Bayes Factor:
20! 16 4 16 4
p(D|H 1 ) (0.8) (0.2) (0.8) (0.2)
16! 4!
B.F . = = =
20! 16 4 16
p(D|H 2 ) (0.5) (0.5) (0.5 (0.5) 4 )
16! 4!

B.F . = 47.22366

That Bayes Factor is considered strong evidence in favor of the π = 0.8


model by both the Jeffreys (1961) guidelines and the Kass & Raftery (1995)
guidelines.

1. Please note that the little hat thing (ˆ) is supposed to be right on top of π.
Html, for some reason, doesn’t render that correctly.↩

2. The scientific principle of parsimony motivates explanations of


phenomena to be as simple as possible but no simpler. We’ll cover the
importance of parsimony and how it is emphasized in statistical analysis,
but there is such a thing as too simple.↩

3. I am well aware that the most important problem associated with


becoming a cashless society is worsening income inequality. But also: I
would be left with pretty much no ability to explain statistics without the
existence of coins.↩

4. I prefer using s + 1 and f + 1 to describe the shape parameters because


it makes it easier to connect the distribution of proportions to observed
successes and failures, but from my reading, α and β are more common,
and also the input that software packages including R expect.↩

5. This kind of logic is formally known as modus tollens, a Latin phrase


that roughly translates to “the way that denies by denying.” If the
statement if p then q is true, then the contrapositive of the statement if not
q _then not p must be true. For example, the statement if it rained, then
the street will be wet implies that if the street is not wet, then it did not
rain. In the case of hypothesis testing, to say if not the data we expect,
then not the hypothesis is a useful tool when your philosophy prevents
you from directly evaluating the hypothesis itself.↩

6. Two such methods that we will talk about in future content are the
Wilson Score Interval and the Normal Approximation (or Asymptotic)
Method.↩

7. Now is an excellent time to point out that the “confidence interval” is


perhaps the worst-named thing in statistics.↩

8. The founders of Classical Statistic are famously and rightly associated


with the eugenics movement: I include that here not as evidence that
Frequentism is an inferior approach to statistics but because I think it
merits mention as often as possible. Also, at least one prominent
Bayesian was a literal fascist.↩

9. To be clear: Classical statistics are not what most would consider


simple: but they are, by and large, simple relative to the Bayesian
alternative.↩

10. On the relatively rare occasions in which Classical analyses can’t rely
on assumptions about data distributions, they can become super
complex↩

11. The term hypothesis testing is generally broad and can include any
scientific inquiry where an idea is tested with observations and/or
experimentation. In the context of statistics, however, hypothesis testing
has come to be synonymous with classical null hypothesis testing and is
used as a shorthand for that approach.↩
12. I should say not etched in stone yet but keep refreshing this link.↩

13. For example, the probability of seeing exactly 1,000,000 heads in


2,000,000 coin flips is 0.0005641895.↩

14. I have to confess that all I know about turtle mutation comes from
watching cartoons (and movies and also playing both video games) as a
child and based on that I would guess that the mutation rate of
radioactive ooze is 100%.↩

15. If you have a moment, you’re gonna wanna click on that slide rule link.↩

16. We use π = 0.5 because that is the number in the statement of the null
hypothesis.↩

17. S′ = f ′ = 0 indicates that we have no information included in the prior


distribution. For that reason, this particular prior is an example of a
noninformative prior because it does not sway the shape of the posterior
distribution in anyway. This specific prior is also known as the
ignorance prior, which is just rude.↩

18. Or in the other terms used to describe the shape parameters of the β,
α = s′′ + 1 = 17, β = f ′′ + 1 = 5.↩

19. There are some HDI tables, but as with all tables, they are limited to a
certain amount of possible parameter values.↩

20. For a proof of how generating a random number from a uniform


distribution between 0 and 1 generates the desired probability
distribution, see Chechile, R. (2020) Bayesian Statistics for
Experimental Scientists. Cambridge, MA: MIT Press.↩
7 Correlation and Regression
7.1 Correlation Does Not Imply Causation…

The above chart comes from Tyler Vigen’s extraordinary Spurious Correlations
blog. It’s worth your time to go through the charts there and see what he has
found in what must be an enormous wealth of data.

Correlation does not imply causation is literally one of the oldest principles in
the practice of statistics. It is parroted endlessly in introductory statistics
courses by me and people like me. It is absolutely true if we look through
enough pairs of variables, we will find some pairs of things that are totally
unrelated – like cheese consumption and civil engineering PhD completion –
that appear to be connected.

It is also true – and probably more dangerous than coincidence – that there are
some variables that correlate not because one causes the other but because a
third variable causes both. Take, for example, the famous and probably
apocryphal example of the correlation between ice cream sales and violent
crime: the idea there is that ambient temperature is correlated with both ice
cream consumption and crime rates. Now, the ice cream part of it is most likely
just humorous conjecture, and there have been real studies that link heat with
aggression, but it is less likely that the link between crime rates with outdoor
temperatures is due to heat causing violence, but rather due to higher rates of
social interaction in warmer weather that create more opportunities for
interpersonal violence.

The decline of sales of digital cameras over the past decade is a well-
researched trend. Here, here’s the research:

And here’s another trend you may find intriguing:


The conclusion is clear: athleisure killed the digital camera.

Or maybe not. But something led to the decline of sales of digital cameras.

7.1.1 …but it Doesn’t Imply NOT Causation either

There is another annual trend that correlates inversely with sales of digital
cameras:
Again, correlation does not imply causation is a true statement. But it is
simultaneously true that when there is causation, there is also correlation.
Correlation techniques – more specifically, regression techniques – are the
foundation of most classical and some Bayesian statistical approaches, and as
such are widely used in controlled experimentation: the gold standard of
establishing causation. So, yes, correlation doesn’t by itself establish cause-
and-effect, but it is a good place to start looking.

7.2 Correlation
7.2.1 The Product Moment Coefficient r

Correlation is most frequently measured by the Classical statistic r: the Pearson


Product Moment Coefficient. It is the coefficient of the standardized score of a
predictor variable

7.2.1.1 Direction and Magnitude


There are two main features of interest to the value of r: its sign (positive or
negative) and its magnitude (its absolute value). The sign of r indicates the
direction of the correlation. A positive value of r indicates that increases in one
of a pair of variables correspond generally to increases in the other member of
the pair and vice versa: this type of correlation is known as direct or simply
positive. A negative value of r indicates that increases in one of a pair of
variables correspond generally to decreases in the other member of the pair and
vice versa: that type of correlation is known as indirect, inverse, or simply
negative.

The absolute value of r is a measure of how closely the two variables are
related. Absolute values of r that are closer to 0 indicate that there is a loose
relationship between the variables; absolute values of r that are closer to 1
indicate a tighter relationship between the variables. Jacob Cohen1
recommends the following guidelines for interpreting the absolute value of r:

Magnitude of r Effect Size


r > 0.5 Strong
0.3 < r < 0.5 Moderate
0.1 < r < 0.3 Weak
Note:
Cohen considered r statistics with absolute value less than 0.1 too unlikely to
be statistically significant to include. They do exist, but if you wanted to cite
Cohen’s guidelines to describe their magnitude you’d have to go with
something like ‘smaller than small’

The magnitude of r is an example of an effect size: a statistic that indicates the


degree of association (in the case of correlation) or difference (in pretty much
every other case) between groups. As with all effect size interpretation
guidelines, they are merely intended to be a general framework: the evaluation
of any effect size should be interpreted in the context of similar scientific
investigations. As Cohen himself wrote:

“there is a certain risk inherent in offering conventional operational


definitions for those terms for use in power analysis in as diverse a field
of inquiry as behavioral science”

Figure 7.1 illustrates the variety of sign and direction of r in scatterplots.


Figure 7.1: Scatterplots Indicating Positive and Negative; Strong, Moderate,
and Weak Correlations

7.2.1.2 Hypothesis Testing and Statistical Significance

The null hypothesis in correlational analysis is based on the assumption that r2


is equal to, greater than or equal to, or less than or equal to 0. A one-tailed test
can indicate that a positive correlation is expected:

H0 : r ≤ 0

H1 : r > 0

or that a negative correlation is expected:


H0 : r ≥ 0

H 1 : r < 0.

A two-tailed hypothesis suggests that either a positive or a negative correlation


may be considered significant:
H0 : r = 0

H1 : r ≠ 0

Critical values of r can be found by consulting tables like this one. The critical
values of r are the smallest values given the df – which for a correlation is
equal to n − 2 – that represents a statistically significant result given the
desired α-rate and type of test (one- or two-tailed).

7.2.1.3 Assumptions of Parametric Correlation

Like most parametric tests in the Classical framework, the results of parametric
correlation are based on assumptions about the structure of the data (the
assumptions of frequentist parametric tests will get it’s own discussion later.
The main assumption to be concerned about3 is the assumption of normality.

The normality assumption is that the data are sampled from a normal
distribution. The sample data themselves do not have to be normally
distributed – a common misconception – but the structure of the sample data
should be such that they plausibly could have come from a normal distribution.
If the observed data were not sampled from a normal distribution, the
correlation will not be as strong as it would if they were. If there are any
concerns about the normality assumption, the best and easiest way to proceed is
to back up a parametric correlation with one of the nonparametric correlations
described below: usually Spearman’s ρ, sometimes Kendall’s τ , or, if the data
are distributed in a particular way, Goodman and Kruskal’s γ.

The secondary assumption for the r statistic is that the variables being
correlated have a linear relationship. As covered below, r is the coefficient for
a linear function: that function does a much better job of modeling the data if the
data scatter in a linear function about the line.

If the data are not sampled from a normal distribution and/or linearly related,
then r just won’t work as well. It’s not that the earth beneath the analyst opens
up and swallows them whole when assumptions of classical parametric tests
are violated: it’s just that they don’t wory as intended, leading to type-II errors
(if anything).

7.2.2 Parametric Correlation Example


Let’s use the following set of n = 10 paired observations, each with an x value
and a y value. Please note that it’s pretty rare – but certainly not impossible – to
run a correlation with just 10 pairs in an actual scientific study; we’ll just limit
it to 10 here because anything more than that gets hard to follow. More
importantly, please note that n refers to pairs of observations in the context of
correlation and regression. We get n not from the number of numbers (which, in
this case, will be 20) but by the number of entities being measured, and with
correlation each pair of x and y are different measurements related to the same
thing. For example, imagine we were seeing if there were a correlation
between height and weight and we had n = 10 people in our dataset: we would
have 10 values for heights and 10 values for weights which is 20 total numbers,
but the 20 numbers still refer to 10 people.

Where were we? Right, the data we are going to use for our examples. It’s a
generic set of n = 10 observations numbered sequentially from 1 → 10, 10
observations for a generic x variable, and 10 observations for a generic y
variable.

n x y

1 4.18 1.73
2 4.70 1.86
3 6.43 3.61
4 5.04 2.09
5 5.25 2.00
6 6.27 4.07
7 5.34 2.56
8 4.21 0.33
9 4.17 1.49
10 4.67 1.46

And here is a scatterplot, with x on the x-axis and y on the y-axis:


Figure 7.2: Scatterplot of the Sample Data

7.2.2.1 Definitional Formula

The product-moment correlation is the sum of the products of the standardized


scores (z-scores) of each pair of variables divided by n − 1:
n
∑ i=1 z x i z y i
r =
n − 1

This formula is thus known as the definitional formula: it serves as both a


description of and a formula for r. To illustrate, we fill the below table out
using our sample data:

Observed Values z-transformed Values z product


Obs i xi yi z xi z y1 z xi z yi

1 4.18 1.73 -1.03 -0.36 0.37


2 4.70 1.86 -0.40 -0.24 0.10
3 6.43 3.61 1.72 1.38 2.37
Observed Values z-transformed Values z product
Obs i xi yi z xi z y1 z xi z yi

4 5.04 2.09 0.02 -0.03 0.00


5 5.25 2.00 0.27 -0.11 -0.03
6 6.27 4.07 1.52 1.81 2.75
7 5.34 2.56 0.38 0.41 0.16
8 4.21 0.33 -1.00 -1.66 1.66
9 4.17 1.49 -1.05 -0.58 0.61
10 4.67 1.46 -0.44 -0.61 0.27

Taking the sum of the rightmost column and dividing by n − 1 , we get the r

statistic:
n
∑ i=1 z x i z y i
= 0.9163524
n − 1

We can check that math using the cor.test() command in R (which we could
have just done in the first place):

cor.test(x, y)

##
## Pearson's product-moment correlation
##
## data: x and y
## t = 6.4736, df = 8, p-value = 0.0001934
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6777747 0.9803541
## sample estimates:
## cor
## 0.9163524

7.2.2.2 Computational Formula

r can alternatively and equivalently be calculated using what is known as the


computational formula:
n
∑ (x − x) (y − ȳ )
i=1
r =
√SS x SS y

where
n

2
SS x = ∑ (x − x)

i=1

and
n

2
SS x = ∑ (x − x)

i=1

That formula is called the computational formula because apparently it’s


easier to calculate. I don’t see it. But, here’s how we would approach using the
computational formula:

Observed Product of Squared


Deviations
Values Deviations Deviations
Obs 2 2
xi yi (x − x) (y − ȳ ) (x − x)(y − ȳ ) (x − x) (y − ȳ )
i

1 4.18 1.73 -0.85 -0.39 0.33 0.72 0.15


2 4.70 1.86 -0.33 -0.26 0.08 0.11 0.07
3 6.43 3.61 1.40 1.49 2.09 1.97 2.22
4 5.04 2.09 0.01 -0.03 0.00 0.00 0.00
5 5.25 2.00 0.22 -0.12 -0.03 0.05 0.01
6 6.27 4.07 1.24 1.95 2.43 1.55 3.80
7 5.34 2.56 0.31 0.44 0.14 0.10 0.19
8 4.21 0.33 -0.82 -1.79 1.46 0.67 3.20
9 4.17 1.49 -0.86 -0.63 0.54 0.73 0.40
10 4.67 1.46 -0.36 -0.66 0.23 0.13 0.44

∑(x − x)(y − ȳ ) = 7.2782

2
SS x = ∑(x − x) = 6.01504
2
SS y = ∑(y − ȳ ) = 10.4878

∑(x − x)(y − ȳ ) 7.2782


r = = = 0.9163524
√ SS x SS y √ (6.01504)(10.4878)

See? Same answer.

7.2.3 Nonparametric correlation

The r statistic is the standard value used to describe the correlation between
two variables. It is based on the assumptions that the data are sampled from a
normal distribution and that the data have a linear relationship. If the data
violate those assumptions – or even if we are concerned that the data might
violate those assumptions – nonparametric correlations are a viable
alternative.

Nonparametric correlations – and nonparametric tests in general – are so-


named because they do not make inferences about population parameters: they
do not involve means, nor standard deviations, nor combinations of the two (as
r is), in either the null or alternative hypotheses. They are, instead, based on the

cumulative likelihood of the observed pattern of the data or more extreme


unobserved patterns of the data.4

For example, consider the following data:

x y

11
22
33
44
55
66

Please notice that x and y are perfectly correlated: in fact, they are exactly the
same. What is the probability of that happening? To answer that, we need to
know how many possible combinations of x and y there are, so we turn to
permutation.
To find the number of combinations of x and y, we can leave one of the
variables fixed as is and then find out all possible orders of the other variable:
that will give us the number of possible pairs. That is: if we leave all the x
values where they are, we just need to know how many ways we could shuffle
around the y values (or we could leave y fixed and shuffle x – it doesn’t
matter). The number of possible orders for the y variable (assuming we left x
fixed) is given by:

6!
nP r = 6 P6 = = 720
0!

So, there are 720 possible patterns for x and y. The probability of this one
pattern is therefore 1/720 = 0.0014. If we have a one-tailed test where we are
expecting the relationship between x and y to be positive, then the observed
pattern of the data is the most extreme pattern possible (can’t get more positive
agreement than we have), and 0.0014 is the p-value of a nonparametric
correlation (specifically, the Spearman correlation).

cor.test(1:6, 1:6, method="spearman", alternative="greater")

##
## Spearman's rank correlation rho
##
## data: 1:6 and 1:6
## S = 0, p-value = 0.001389
## alternative hypothesis: true rho is greater than 0
## sample estimates:
## rho
## 1

If we have a two-tailed test where we do not have an expectation on the


direction of the relationship, then there would be one more equally extreme
possible pattern: the case where the variables went in perfectly opposite
directions (x = {1, 2, 3, 4, 5, 6} and y = {6, 5, 4, 3, 2, 1} or vice versa).
With two patterns being as extreme as the observed pattern, the p-value would
then be 2/720 = 0.0028.

cor.test(1:6, 1:6, method="spearman", alternative="two.sided")

##
## Spearman's rank correlation rho
##
## data: 1:6 and 1:6
## S = 0, p-value = 0.002778
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 1

7.2.3.1 Spearman’s ρ

The ρ is the r of the ranks: rho is also known as the rank correlation. There are
fancy equations to derive ρ from the ranks, but I have no idea why you would
need to use them instead of just correlating the ranks of the data

Returning to our example data, we can rank each value relative to the other
values of the same variable from 1 to n. Usually the ranks go from smallest to
largest, but it doesn’t really matter so much as the same convention is followed
for both the x and the y values. In the case of ties, take the average of the ranks
above and below the tie cluster. If, for example, the data were {4, 7, 7, 8}, the
ranks would be {1, 2.5, 2.5, 4}.

x y rank(x) rank(y)

4.18 1.73 2 4
4.70 1.86 5 5
6.43 3.61 10 9
5.04 2.09 6 7
5.25 2.00 7 6
6.27 4.07 9 10
5.34 2.56 8 8
4.21 0.33 3 1
4.17 1.49 1 3
4.67 1.46 4 2

We then could proceed to calculate the correlation r between the ranks. Or,
even better: we can calculate the ρ correlation for the observed data:

cor.test(x, y, method="spearman")
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 20, p-value = 0.001977
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8787879

And note that the ρ is the same value as if we calculated r for the ranks5

cor.test(xrank, yrank, method="pearson")

##
## Pearson's product-moment correlation
##
## data: xrank and yrank
## t = 5.2086, df = 8, p-value = 0.0008139
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5577927 0.9710980
## sample estimates:
## cor
## 0.8787879

7.2.3.2 Kendall’s τ

Where ρ is based on patterns of ranks, Kendall’s τ correlation is based on


patterns of agreement. In the annotated scatterplot in Figure 7.3, the lines
connecting four of the possible pairs between the five datapoints – the
concordant pairs – move in the same direction (from southwest to north east),
while the line connecting one pair of the datapoints – the discordant pair moves
in the opposite direction.
Figure 7.3: Pairwise Comparisons of Sample Dataset with Mixed Concordance
and Discordance (Example 3)

The τ for the example data (the original example data, not the data for Figure
7.3) is:

cor.test(x, y, method="kendall")

##
## Kendall's rank correlation tau
##
## data: x and y
## T = 39, p-value = 0.002213
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.7333333

7.2.3.2.1 A Bayesian Nonparametric Approach to Correlation

There is not a lot said about the differences between Classical vs. Bayesian
approaches to correlation and regression on this page, mainly because this is an
area where Bayesians don’t have many snarky things to say about it. Bayesians
may take issue with the assumptions or with the way that p-values are
calculated – and both of those potential problems are addressed in the Bayesian
regression approach discussed below – but they are pretty much down with the
concept itself.

A recently developed approach6 involves taking the Kendall τ concept of


concordant pairs and discordant pairs and treating the two pair possibilities as
one would successes and failures in a binomial context. Using the number of
concordant pairs T as the number of discordant pairs T as s and f , we can
c d

analyze the concordance parameter ϕ the same way we analyze the


c

probability π of s: with a β distribution. The ϕ parameter is related to


c

Kendall’s τ :

τ + 1
ϕc =
2

and Figure 7.4 shows how the β distribution of ϕ relates to scatterplots of data
c

described by various values of τ .


Figure 7.4: τ̂ Values (n = 20) and Corresponding Posterior β Distributions of
ϕc
There is a slight problem with the τ statistic in the case where values are tied
(either within a variable or if multiple pairs are tied with each other). The
difference is small, but for a more accurate correction for ties, see the bonus
content at the end of this page.

7.2.3.3 Goodman and Kruskal’s γ

Imagine that we wanted to compare the letter grades received by a group of 50


high school students in two of their different classes. Here is a cross-tabulation
of their grades:

Algebra Grade
A B C D F
Social Studies Grade A 6 2 1 2 0
Social Studies Grade B 2 6 2 1 1
Social Studies Grade C 1 1 7 1 1
Social Studies Grade D 1 1 0 6 3
Social Studies Grade F 0 0 0 0 5

Goodman and Kruskal’s γ correlation allows us to make this comparison. The


essential tenet of the γ correlation is this: if there were a perfect, positive
correlation between the two categories, then all of the students would be
represented on the diagonal of the table (everybody who got an “A” in one
class would get an “A” in the other, everybody who got a “B” in one class
would get a “B” in the other, etc.). If there were a perfect, inverse correlation,
then all of the students would be represented on the reverse diagonal. The γ,
which like the other correlation coefficients discussed here ranges from −1 to
1, is a measure to the extent which categories agree. That means that the γ can
be used for categorical data, but also for numbers that can be placed into
categories (like, the way that ranges of numerical grades can be converted into
categories of letter grades).

GoodmanKruskalGamma(table(social.studies, algebra), conf.level =

## gamma lwr.ci upr.ci


## 0.6769596 0.4685468 0.8853724
7.2.4 Nonlinear correlation

One of the assumptions of the r correlation is that there is a linear relationship


between the variables. If there is a non-linear relationship between the
variables, one option is to use a nonparametric correlation like ρ or τ 7. If there
is a clear pattern to the non-linearity, one might also consider transforming one
of the variables until the relationship is linear, calculating r, and then
transforming it back.

As an example, the data represented in Figure 7.5 are the same example data
that we have been using throughout this page, except that the y variable has been
exponentiated: that is, we have raised the number e to the power of each value
of y

Figure 7.5: Scatterplot of the Sample Data x and e y

We can still calculate r for these data despite the relationship now looking
vaguely non-linear: the correlation between x and e is 0.86. However, that
y

obscures the relationship between the variables.8 In this case, we can take the
natural logarithm ln e to get y back, and then calculate an r of 0.92 as we did
y
before. We would then report the nonlinear correlation r, or report r as the
correlation between one variable and the transformation of the other.

The only thing that we can’t do is transform a variable, calculate r, and then
report r without disclosing the transformation. That’s just cheating.

There are, of course, whole classes of variables for which we use


transformations, usually in the context of regression. To name a few examples
from the unit on probability distributions, when the y variable is distributed as a
logistic distribution, we transform the y variable in the procedure known as
logistic regression; when the y variable is distributed as a Poisson distribution,
we use Poisson regression, and when the y variable is distributed as a β
distribution, we use beta regression. That’s one area of statistics where the
names really are pretty helpful.

A couple of caveats before we move on: some patterns aren’t as easy to spot as
logarithms (or, squares or sine waves), and most nonlinear correlations are
better handled by multiple regression (sadly, beyond the scope of this course).

7.3 Regression
Correlation is an indicator of a relationship between variables. Regression
builds on correlation to create predictive models. If x is our predictor variable
and y is our predicted variable, then we can use the correlation between the
observed values x and y to make a model that predicts unobserved values of y
based on unobserved values of x.

7.3.1 The Least-Squares Regression Line

Our regression model is summarized by the least-squares regression line. The


least-squares line is the function out of all possible functions of its type that
results in the smallest squared distance between each observed (x, y) pair and
the function itself. In visual terms, it is the line that minimizes the sum of the
squared distances between each point and the line: move the line in any
direction and the sum of the squared distances would increase. There are two
principal ways of expressing the least-squares regression line: the standardized
regression equation and the raw-score regression equation.
7.3.1.1 Standardized Regression Equation

The standardized regression equation describes the relationship between the


standardized values of x and the standardized values of y. The standardized
values of the variables are the z-scores of the variables: thus, the standardized
regression equation is also called the z-score form of the regression equation.
The standardized regression equation indicates that the estimated z-score of the
y variable is equal to the product of r and the z-score of the y variable:

z
ˆy = rz x

Figure 7.6 is a scatterplot of the z-scores for the x and y variables in the
example introduced above. The blue line is the least-squares regression line in
standardized (z-score) form. Please note that when z = 0, then rz = zˆ = 0,
x x y

so the line – as do all lines made by standardized regression equations – passes


through the origin (0, 0).

Figure 7.6: Scatterplot of z and z


x y

7.3.1.2 Raw-score regression equation


Technically, the standardized regression equation would be all you would need
to model the relationship between two variables – if you knew the mean and
standard deviation of each variable and had the time to derive x and y values
from z and z values. But, the combination of those statistics and the time
x y

and/or patience to convert back-and-forth from standardized scores are rarely


available, especially when consuming reports on somebody else’s research. It
is far more meaningful (and convenient) to put the standardized equation into a
form where changes in the plain old raw x variable correspond to changes in
the plain old raw y variable. That format is known as the raw-score form of the
regression equation, and it’s almost always the kind of regression formula one
would actually encounter in published work.

The form of the raw-score equation is:

ŷ = ax + b

where a is the slope of the raw-score form of the least-squared regression line
and b is the y-intercept of the line (that is, the value of the line when x = 0 and
the line intersects with the y-axis).

We get the raw-score equation by converting the z-score form of the equation.
Wait, actually, 99.9% of the time we get the raw-score equation using software
in the first place, but if we did need to calculate the raw-score equation by
hand9, we would first find r, which would give us the standardized equation
zˆ = rz , which we would then convert to the raw-score form via the
y x

following steps:

1. Find the intercept b

The intercept (of any equation in the Cartesian Plane) is the point at which a
line intersects with the y-axis, meaning that it is also the point on a line where
x = 0. Thus, our first job is to find what the value of y is when x = 0. We don’t

yet have a way to relate x = 0 to any particular y value, but we do have a way
to connect z to z : the standardized regression equation zˆ = rz .
x y y x

First, we can find the value of z for x = 0 using the z-score equation z =
x
x−x

sd

and plugging 0 in for x. Using our sample data, x = 5.026 and


sd = 0.8175193, thus:
x
0 − 5.026
zx = = −6.147867
0.8175193

From the standardized regression equation, we know that zˆ y = rz x , so the value


of zˆ when x = 0 is:
y

z
ˆy = rz x = (0.9163524)(−6.147867) = −5.633613

And then, using the z-score formula again and knowing that the mean of y is
2.12 and the sd of y is 1.0794958, we can solve for y:
y − 2.12
−5.633613 =
1.079496

y = −3.961463

We know then that the raw-score least-squares regression line passes through
the point (0, −3.961463), so the intercept b = −3.961463.

It takes two points to find the slope of a line. We already have one – the y-
intercept (0, −3.961463) – and any other one will do. I prefer finding the point
of the line where x = 1, just because it makes one step of the math marginally
easier down the line.

Using the same procedure as above, when x=1, z = −4.924654, x

z = −4.512719, and y = −2.751461. The slope of the line is the change in y


y

divided by the change in x ( Δy

Δx
):

Δy −2.751461 − −3.961463
a = = = 1.210002
Δx 1 − 0

Thus, our least-squares regression equation in raw-score form is (after a bit of


rounding):

ŷ = 1.21x − 3.9615
Figure 7.7: Scatterplot of x and y Featuring Our Least-Squares Regression Line
in Raw-Score Form

But, if all of that algebra isn’t quite your cup of tea, this method is quicker:

model<-lm(y~x) ## "lm" stands for Linear Model


summary(model)

##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.80264 -0.22414 0.00656 0.33794 0.63366
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.9615 0.9505 -4.168 0.003133 **
## x 1.2100 0.1869 6.474 0.000193 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
1
##
## Residual standard error: 0.4584 on 8 degrees of freedom
## Multiple R-squared: 0.8397, Adjusted R-squared: 0.8197
## F-statistic: 41.91 on 1 and 8 DF, p-value: 0.0001934

7.3.1.2.1 A note on the hat that y is wearing

The jaunty cap sported by y and z in this section indicates that y and z are
y y

estimates. rz does not equal exactly z , nor does ax + b equal exactly y. If


x y

they did, then all of the points on the scatterplot would be right on the regression
line. There is always error in prediction, hence, the hat
ˆ.

7.3.2 The Proportionate Reduction in Error (R ) 2

Once you have r, R is pretty easy to calculate:


2

2 2
R = r

R as a statistic has special value. It is known – in stats classes only, but known
2

– as the proportionate reduction in error. That title means that R represents 2

the predictive advantage conferred by using the model to predict values of the y
variable over using the mean of y (ȳ ) to make predictions.

In model prediction terms, R is the proportion of the variance in y that is


2

explained by the model. Recall that the least-squares regression line minimizes
the squared distance between the points in a scatterplot and the line. The line
represents predictions and the distance represents the prediction error (usually
just called the error). The smaller the total squared distance between the points
and the line, the lower the error, and the greater the value of R . 2

Mathematically – in addition to being the square of the r statistic – R is given 2

by:

2
SS total − SS error
R =
SS total

where SS total is the sum of the squared deviations in the y variable:


2
∑(y − ȳ )
and SS is the sum of the squared distances between each y value predicted
error

by the model at each point x and the observed y predicted by the model for each
x:

2
∑(y pred − y obs )

The table below lists the squared deviations of y, the predictions


y = ax + b, and the squared prediction errors
pred
10 (y pred − y ) :
obs
2

Obs x y (y − ȳ ) 2
y pred (y pred − y obs )
2

1 4.18 1.73 0.1521 1.0963 0.4015757


2 4.70 1.86 0.0676 1.7255 0.0180902
3 6.43 3.61 2.2201 3.8188 0.0435974
4 5.04 2.09 0.0009 2.1369 0.0021996
5 5.25 2.00 0.0144 2.3910 0.1528810
6 6.27 4.07 3.8025 3.6252 0.1978470
7 5.34 2.56 0.1936 2.4999 0.0036120
8 4.21 0.33 3.2041 1.1326 0.6441668
9 4.17 1.49 0.3969 1.0842 0.1646736
10 4.67 1.46 0.4356 1.6892 0.0525326

The sum of the column (y − ȳ ) gives the SS


2
, which is 10.4878. The sum
total

of the column (ypred − y )


obs gives us the SS
2
, which is 1.6811761. Thus,
error

the proportionate reduction of error – better known to like everybody on Earth


as R – is:
2

2
SS total − SS error
R = = 0.8397017
SS total

And if we take the square root of 0.8397017, we get the now-hopefully-familiar


r value of 0.9163524.

7.4 Bonus Content


7.4.1 Correcting the Kendall’s τ Software Estimate
As a general mathematical rule, there are possible comparisons between
n(n−1)

n pairs of numbers. We will call the number of concordant-order comparisons

of a given dataset n . Thus, for the data in Example 1:


c

A comparison between any two pairs of observations is considered


concordant-order if and only if the difference in the ranks of the corresponding
variables in each pair is of the same sign.

A comparison between any two pairs of observations is considered discordant-


order if and only if the difference in the ranks of the corresponding variables in
each pair is of the opposite sign.

We will call the number of discordant-order comparisons of a given dataset n . d

7.4.1.0.1 Calculation of τ̂ Without Ties

If there are no ties among the x values nor among the y values in the data, we
can calculate the sample Kendall τ̂ statistic using the following formula:

2(n c − n d )
τ̂ = .
n(n − 1)

7.4.1.0.2 With Ties

When making pairwise rank comparisons, ties in the data reduce the number of
valid comparisons that can be made. For the purposes of calculating τ̂ when
there are tied data, there are three types of ties to consider:

1. x ties: tied values for the x variable

2. y ties: tied values for the y variable

3. xy ties: pairs where the x observation and the y observation are each
members of a tie cluster (that is, the pair is doubly tied)

For example, please consider the following data (note: this dataset has been
created so that both the x variables, the y variables, and the alphabetical pair
labels are all in generally ascending order for simplicity of presentation: note
that this does not need to be the case with observed data):
pair x y rank x rank y

A 1 11 1.0 1.0
B 2 12 3.0 2.0
C 2 13 3.0 3.5
D 2 13 3.0 3.5
E 3 14 5.0 5.0
F 4 15 6.0 6.0
G 5 16 7.5 7.0
H 5 17 7.5 8.0
I 6 18 9.0 9.5
J 7 18 10.0 9.5

Let us consider, in turn, the x, the y, and the xy ties:

1. x ties: In the example data, there are two tied-x clusters. One tied-x
cluster comprises the x observations of pair B, pair C, and pair D: each
pair has an x value of 2, and those three x values are tied for 3rd lowest of
the x set. The other tie cluster comprises the x observations of pair G and
pair H, and those x values are tied for 7.5th lowest. Thus, there are two x
tie clusters: one of 3 members and one of 2 members. Let t denote the xi

number of tied-x members t of each cluster i and let m denote the


x x

number of tied-x clusters. We will then define a variable T to denote the


x

number of comparisons that are removed from the total possible


comparisons used to calculate the τ̂ statistic as such:
mx
t xi (t xi − 1)
TX = ∑
2
i=1

For the example data:

3(2 − 1) 2(2 − 1)
TX = + = 4
2 2

2. y ties: In the example data, there are also two tied-y clusters. One tied-y
cluster comprises the y observations of pair C and pair D: each has an y
value of 13, and those two y values are tied for 3.5th lowest of the y set.
The other tie cluster comprises the y observations of pair I and pair J, each
with an observed y value of 18 and tied for 9.5th highest position. Thus,
there are two y tie clusters, each with 2 members. Let t denote the yi

number of tied-y members t of each cluster i and let m denote the


y y

number of tied-y clusters. We will then define a variable T to denote the


y

number of comparisons that are removed from the total possible


comparisons used to calculate the τ̂ statistic as such:
my
t yi (t yi − 1)
TY = ∑
2
i=1

For the _example data:

2(2 − 1) 2(2 − 1)
TY = + = 2
2 2

3. xy ties: There are two data pairs in the example data that share the same x
rank y rank – pair C and pair D – and the pairs in each are said to be
doubly tied or, that they constitute an xy cluster. Therefore, for this dataset
there is one xy cluster with two members – the x and y values of pair C
and the x and y values of pair D. Let t denote the number of members
xyi

t
xy
of each xy cluster i and let m denote the number of xy clusters. We
xy

will then define a variable T to denote the number of comparisons that


xy

are removed from the total possible comparisons used to calculate the τ̂
statistic as such:
m xy
t xyi (t xyi − 1)
T XY = ∑
2
i=1

For the example data:

2(2 − 1)
T XY = = 2
2

Earlier, we noted that the maximum number of comparisons between pairs of


data – which we can denote n max– is equal to (where n equals the
n(n−1)

number of pairs). In the cases of Example 1, Example 2, and Example 3, there


were no x, y, or xy ties, and so that relationship gave the value for the
maximum number of comparisons that was used to calculate τ̂ . Generally, the
equation to find n max – whether or not T X ,
= 0 TY = 0 , and/or T XY = 0 is the
following:

n(n − 1)
n max = − T X − T Y + T XY
2

The above equation removes the number of possible comparisons lost to ties by
subtracting T and T from the total possible number
n(n−1)
x y , and then adjusts 2

for those ties that were double-counted in the process by re-adding T . xy

With the number of available comparisons now adjusted for the possible
presence of ties, we may now re-present the equation for the τ̂ statistic as:
nc − nd
τ̂ =
n max

7.4.1.0.3 τ̂ for larger data sets (τ -b correction)

To this point, the example datasets we have examined have been so small that
identifying and counting concordant-order comparisons, discordant-order
comparisons, tie clusters, and the numbers of members of tie clusters has been
relatively easy to do. To calculate the Kendall τ̂ with larger datasets invites a
software-based solution, but there is a problem specific to the τ̂ calculation that
can lead to inaccurate estimates when there are tied data in a set.

That problem stems from the specific correction for ties proposed by Kendall
(1945), following Daniels (1944), in an equation for a correlation known as the
Kendall tau-b correlation τˆ : b

nc − nd
τ
ˆb =
n(n−1) n(n−1)
√( − T X )( − TY )
2 2

In addition to being endorsed by Kendall himself, this equation has been


adopted, for example, in R (as well as SAS, Stata, and SPSS), which brings us
to our software problem. If there are no ties in a dataset, then the τˆ b

denominator simplfies to √(
n(n−1) n(n−1) n(n−1)

2
)(
2
) = , which is the same
2

denominator as in the calculation of τ̂ when there are no ties. However, when


working in R, a correlation test (cor.test()) with the Kendall option
(method="kendall") will return τˆ instead of the desired τ̂ .
b

If the Kendall correlation statistic is calculated using a program – such as R –


that is programmed to return the τˆ statistic rather than the τ̂ statistic, then the τ̂
b

statistic can be recovered by multiplying by the denominator of the τˆ b

calculation and dividing by the proper τ̂ calculation (in other words: multiply
by the wrong denominator to get rid of it and then divide by the correct
denominator):

n(n−1) n(n−1)
τ̂ b √ ( − T X )( − TY )
2 2
τ̂ = .
n(n−1)
− TX − TY + TX Y
2

Given that we may be working with larger datasets in using this correction, it
may seem that we have traded one problem for another: the above equation may
give us the desired estimate of τ̂ , but how do we find T , T , T Y , n , and n
X Y X c d

without using complex, tedious, and likely error-prone counting procedures?


The process of counting ties can be handled with coding conditional logic.
Alternatively, the R language has several commands in its base package that can
be exploited for that purpose. One way to do this is using the table()
command. The table() command in R applied to an array, will return a table
with each of the observed values as the header (the “names”) and the frequency
of each of those values. For example, given an array x with values [5, 5, 5, 8, 8,
13]], table(x) will return:

x<-c(5, 5, 5, 8, 8, 13)
table(x)

## x
## 5 8 13
## 3 2 1

We can restrict the table to a subset with only those members of the array that
appear more than once:

table(x)[table(x)>1]

## x
## 5 8
## 3 2
And then, we can remove the header row with the unname() command, leaving
us with an array of the sizes of each tie cluster in the original array x:

unname(table(x)[table(x)>1])

## [1] 3 2

We then can use the count of values in that resulting array to represent the
number of tie clusters in the original array and the values in that resulting array
to represent the number of members in each tie cluster in the original array, a
procedure we can use to calculate T and T . X Y

To find doubly-tied values in a pair array – as we need to calculate T Y – only X

a minor modification is needed. We can take two arrays (for example, x and y),
pair them into a dataframe, and then the rest of the procedure is the same. Thus,
given the data from Example 4, :

x<-c(1, 2, 2, 2, 3, 4, 5, 5, 6, 7)
y<-c(11, 12, 13, 13, 14, 15, 16, 17, 18, 18)
xy<-data.frame(x,y)
unname(table(xy)[table(xy)>1])

## [1] 2

we reach the same conclusion as we did when counting by hand: there is one xy
cluster with two members in this dataset.

However we count the numbers of tie clusters and the count of members in each
tie cluster, once T , T , and T Y are determined, there are algebraic means
X Y X

for calculating n and n .


c d

The value of n follows from algebraic rearrangement of the above equations:


c

n(n−1) n(n−1)

n(n − 1) TX TY TX Y ˆb √ (
τ − T X )( − TY )
2 2
nc = − − + +
4 2 2 2 2

The value of n is then found as the difference between n


d max and n :
c

n d = n max − n c
1. Cohen, J. (2013). Statistical power analysis for the behavioral sciences.
Academic press.↩

2. Normally we would use a Greek letter to indicate a population parameter


when discussing null and alternative hypotheses, but the letter ρ is going to
be used for something else later on.↩

3. And, honestly, not that concerned about↩

4. The binomial test used as the Classical example of Classical and Bayesian
Inference is an example of a nonparametric test. In that test, the patterns
being analyzed were the number of observed successes or more in the
number of trials given the assumption that success and failure were equally
likely. In the Classical approach to that example, we were not concerned
with, parameters like the mean number of successes or the standard
deviation of the number of successes in the population.↩

5. The p-values are different, though, as they are based on permutations. For
significance, use the value given by software or feel free to consult this
old-ass table↩

6. Full disclosure: I’m an author on a manuscript under review as of 10/2020


describing this approach. Also, by “recently developed,” I mean like
“working on it right now.”↩

7. Nonparametric alternatives are usually a good choice when the


assumptions of classical parametric tests are violated.↩

8. This is a case where violating assumptions makes type-II errors more


likely: the r value is smaller than it would be otherwise, increasing the
chances of missing a real effect.↩

9. Calculating a raw-score regression equation by hand isn’t quite as


ridiculous as I’m making it seem. If you’re ever in a situation where you
need to work with models that aren’t part of software packages (for
example, this one), it’s really helpful to know these steps.↩

10. Note: prediction errors are commonly referred to as residuals.↩


8 Signal Detection Theory
8.1 Detecting Signals
Have you ever taken a hearing test? If you haven’t – or if it’s been a while
since you have – here’s basically how it goes: you wear big over-the-ear
headphones and you hear some fuzzy noise. Every once in a while, a tone is
played amidst the noise. You instructions are to indicate when you hear the
tone and/or in which ear you hear it in.

Here’s a related example: radar operators have to look at radar screens and
be able to identify little lights that represent things like airplanes among other
little lights that represent things like rainstorms.

Figure 8.1: Signal Detection in Space

The tones in a hearing test and the lights that indicate aircraft on a radar
screen are known as signals. The fuzzy sounds in a hearing test and the
atmospheric events on a radar screen are known as noise.

It would be perfectly reasonable to think that understanding when people


would be able to hear a tone or see a light in noisy situations is pretty easy: if
the tone is louder than the noise, people should be able to hear and identify it;
if the light is brighter than the noise, people should be able to see and identify
it. But, that would be wrong.1 Human psychology – the way we perceive
things, understand them, evaluate them, and make decisions based on them– is
much more than a series of direct comparisons of the relative amplitudes of
sound or light or other stimuli.*

Here’s an example of how factors other than relative strengths of signal and
noise can influence whether or not we see – or at least report seeing – signals
in noise. Let’s say I am an experimenter and you are a signal-detecting
participant and we run two experiments with monetary awards associated
with it. In each experiment, I give you a signal detection task: could be
hearing tones, seeing lights, feeling temperature changes, or any other kind of
perceptual thing of your choosing (it doesn’t matter to me – it can be whatever
helps you understand the example). In Experiment 1, I will give you 1 US
Dollar for every time you correctly identify a signal with no penalty for being
wrong, and you could earn up to $20 if you get them all. In Experiment 2, I
will give you $20 at the start and take away $1 every time you incorrectly
identify a signal with no bonus for being right. Far be it from me to assume
how you would behave in each of those experiments, but if I had to guess, I
would imagine that you might be more likely to identify signals in Experiment
1 than you would be in Experiment 2. It would only be natural to risk being
wrong more easily in Experiment 1 and to be more cautious about being
wrong in Experiment 2. From a strictly monetary-reward-maximizing
perspective, the best strategy would be to say that you are seeing signals all
the time in Experiment 1 and to say that you are never seeing signals in
Experiment 2. Of course, those are two extreme examples, but we could – and
as we’ll explore later in this chapter, do – tweak the schedule of rewards and
penalties so that the decisions to make are more difficult.

Signal Detection Theory is a framework for understanding how people make


decisions about perceiving signals amid noise.

8.1.1 Hits, Misses, False Alarms, and Correct Rejections


In a signal-detection framework – literal signal-detection tasks like hearing
tests and radar and metaphorical signal-detection tasks – there are two
possible decisions that a person performing a test – known as an operator –
can make at any point in the task:
1. The signal is present
2. The signal is absent

The contingency table below descibes the possible outcomes of a signal-


detection test. If the signal truly is present, an operator can either correctly
identify it – a hit – or not identify it – a miss. If the signal truly is absent, an
operator can incorrectly say it is there – a false alarm – or correctly not say it
is there (or say it is not there) – a correct rejection.

Decision<-rep("Operator Response", 2)
rejectaccept<-c("Signal Present", "Signal Absent")
Yes<-c("Hit", "Miss")
No<-c("False Alarm", "Correct Rejection")

kable(data.frame(Decision, rejectaccept, Yes, No), "html", esc


kable_styling() %>%
add_header_above(c(" "=2, "Is the Target There?"=2)) %>%
collapse_rows(1)

Is the Target There?


Yes No
Operator Response Signal Present Hit False Alarm
Operator Response Signal Absent Miss Correct Rejection

8.2 The Signal Detection Metaphor


Since signal detection theory emerged in the psychophysics literature in the
years folloiwing World War II2, the framework has been used metaphorically
to model choices under different conditions. Medical diagnosis a natural fit
for the framework: a medical condition can be either present or absent and a
diagnostician can either make a diagnosis or not. We can adapt the SDT
contingency table for medical diagnosis thusly:

Decision<-rep("Diagnosed", 2)
rejectaccept<-c("Yes", "No")
Yes<-c("True Positive", "False Alarm")
No<-c("Miss", "True Negative")
kable(data.frame(Decision, rejectaccept, Yes, No), "html", esc
kable_styling() %>%
add_header_above(c(" "=2, "Do They Have the Condition?"=2))
collapse_rows(1)

Do They Have the Condition?


Yes No
Diagnosed Yes True Positive Miss
Diagnosed No False Alarm True Negative

Another application of the SDT framework to a decision process is one we


have already encountered: classical null hypothesis testing. In null hypothesis
testing, an effect (either a relationship as in correlation or a difference as in a
difference between condition means) can be either present or absent at the
population level. The classical analyst must make a decision based on the
cumulative likelihood of the data given the null hypothesis whether to reject
the null or to continue to assume the null:

8.2.0.1 Classical Hypothesis Testing

Decision<-rep("Decision", 2)
rejectaccept<-c("Reject $H_0$", "Continue to Assume $H_0$")
Yes<-c("Correct Rejection of $H_0$", "False Alarm")
No<-c("Miss", "$H_0$ Correctly Assumed")

kable(data.frame(Decision, rejectaccept, Yes, No), "html", esc


kable_styling() %>%
add_header_above(c(" "=2, "Real Effect?"=2)) %>%
collapse_rows(1)

Real Effect?
Yes No
Correct Rejection of
Decision Reject H 0 Miss
H0
Real Effect?
Yes No
Continue to Assume H Correctly
Decision False Alarm 0

H0 Assumed

Null hypothesis testing is a particularly apt example for understanding signal


detection because it presents a case where we know there either is a signal –
in this case a population-level effect – or there is not, and we know that we
won’t catch the signal all the time (the type-II error) and that sometimes we
will commit false alarms (the type-I error).

The SDT metaphor as applied to null hypothesis testing – and statistical


analysis in general – has been extended so that any real pattern in data can be
considered signal and anything in the data that prevents seeing those patterns
can be considered noise.3 Please consider these two scatterplots (reproduced
from correlation and regression) each representing a of n = 100 pairs of
(x, y) observations:

set.seed(123)
n <- 100
mu <- c(x = 0, y = 0)
Rstrong <- matrix(c(1, 0.95,
0.95, 1),
nrow = 2, ncol = 2)

Rweak <- matrix(c(1, 0.15,


0.15, 1),
nrow = 2, ncol = 2)

strong.df<-data.frame(mvrnorm(n, mu=mu, Sigma=Rstrong))


weak.df<-data.frame(mvrnorm(n, mu=mu, Sigma=Rweak))

rstrongpos<-round(cor.test(strong.df$x, strong.df$y, method="p


rweakpos<-round(cor.test(weak.df$x, weak.df$y, method="pearson

strongpos<-ggplot(strong.df, aes(x, y))+


geom_point()+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
theme(axis.text=element_blank())+
labs(x=expression(italic(x)), y=expression(italic(y)), title
geom_smooth(method="lm", se=FALSE)

weakpos<-ggplot(weak.df, aes(x, y))+


geom_point()+
theme_tufte(base_size=12, base_family="sans", ticks=FALSE)+
theme(axis.text=element_blank())+
labs(x=expression(italic(x)), y=expression(italic(y)), title
geom_smooth(method="lm", se=FALSE)

plot_grid(strongpos, weakpos, nrow=1)

Figure 8.2: Scatterplots of Data Sampled from Bivariate Normals with


r = 0.95 (left) and with r = 0.15 (right)

The (x, y) pairs with population-level correlations of 0.95 and 0.15 led to
samples that have correlations of 0.94 and 0.06, respectively. The former
correlation is statistically significant at the α = 0.05 level (p < 0.001); the
latter is not (p = 0.5423). In the signal-detection metaphor, the models
indicated by the least-squares regression lines represent the signal and the
distances between the lines and the dots represent the noise.

Please recall from correlation and regression that R is the proportion of


2

variance explained by the model. The proportion of the variance that is not
explained by the model – that is, 1 − R , is the error. In the SDT framework,
2

the error is noise.

Since R = 0.944 = 0.891, the model in part A of Figure 8.2 explains


2 2

89.1% of the variance in the y data. It’s an enormous sample size. In SDT
terms, it is a signal so strong that it would be visible despite the strength of
any noise (or: when the correlation is that strong on the population level, it is
nearly impossible to sample 100 observations at random that wouldn’t lead to
rejecting the null hypothesis). The error associated with the model is
1 − R = 0.109. If we take the error to be the noise, then the signal-to-noise
2

ratio is4 nearly 9:1. The model in part B of Figure 8.2, by contrast, represents
an R of 0.062 = 0.0038, therefore explaining less than four-tenths of one
2 2

percent of the variance. Thus, 99.6% of the observed data is explained by


error – or noise – and the signal-to-noise ratio is about 4:996. In SDT terms,
that is a signal that – while real on the population level – is swallowed by the
noise.

8.2.1 Limits of the Signal-Detection Metaphor

Finding signals among noise is a convenient metaphor for a lot of processes


involving decisions. For some applications, it may be a little too convenient.
Take, for example, please, the application to memory research. Memory
would seem to make sense as a candidate to be described by signal detection:
memory experiments often include presenting stimuli and then, some time
later, asking participants if a stimulus has been shown before or if it is new. In
such a paradigm, recognizing an old stimulus is analogous to a hit, not
recognizing an old stimulus is analogous to a miss, declaring a new stimulus
to be old is like a false alarm, and calling a new stimulus new is like a
correct rejection. The SDT metaphor is commonly found in memory theories.
However, actual, non-metaphorical signal detection is based on the
perception of the relative strength of lights and sounds; human memories are
much more complex. When we encode memories, we encode far more than
we attend to: contexts, sources, ambient stimuli, and more. Remembering is
also not a binary decision: we can remember parts of events and
misremember events in part or in total. We frequently report remembering
events that never happened, both as individuals and in our collective
memories.

So where in a signal-detection-style contingency table would a false memory,


or a partially-remembered item, or the source of a memory be placed? That’s
not clear to me, and I think it’s because memory and other phenomena – a
much longer screed could be written about applying the signal-detection
framework to interpersonal interactions, for example – are too complex to be
explained by framework of decisions based on the one dimension of relative
stimulus strength. Signal detection theory is great for what it does. In the case
of more complex phenomena, the signal-detection model is oversimplistic and
based on reductive assumptions but gives the impression of scientific rigor
and is well-liked by a shocking amount of people, just like reruns of The Big
Bang Theory.

8.2.2 Signal + Noise and Noise Distributions

The central tenet of signal detection theory is that the decisions that are made
by operators under different conditions are all products of underlying strength
distributions of the signal and of the noise. The strength of a signal and the
strength of noise occupy more than single points in an operator’s mind: if they
did, detecting signals would be deterministic, and operators would always
choose based on the stronger of the signal and the noise. We know that
operators don’t behave like that, and so the basic theory is that the strength of
the signal and the strength of the noise are represented by distributions. We
assume that both distributions are normal distributions, as depicted in Figure
8.3. And, we assume that noise is always present, so the distribution usually
referred to as the signal distribution is sometimes (and technically more
theory-aligned) referred to as the signal plus noise distribution.

Figure 8.3: Illustration of Noise and Signal + Noise Distributions

We don’t really know what the x and y values are for the signal distribution
and the noise distribution and it honestly doesn’t matter. All we are
interested in is the shapes of the curves relative to each other so that we can
learn about how people make decisions based on their relative perceived
strength. Since the placement doesn’t matter, we can set one of those curves
wherever we want and then measure the other curve relative to the set one.
And since we can set either one of the curves wherever we want, to make our
mathematical lives easier, it is a very good idea to set the noise distribution to
have a mean of 0 and a standard deviation of 1. In other words, we assume
that the perception of noise follows a standard normal distribution.5
8.3 Distinguishing Signal from Noise
The point of signal detection theory is to understand the underlying perception
of signal strength with respect to noise and how operators make decisions
given that perception. Of all of the measurements that come out of signal
detection frameworks, three statistics are most frequently used: d′, which
measures the discriminability – in terms of the curves, it’s the distance
between the peak of the noise curve and the signal (plus noise) curve –
between signal and noise, β, which measures the response bias – the tendency
to say yes or no at any point, and the C-statistic<The “C” in “C-statistic” is
the only statistical symbol that is not abbreviated in APA format and I have
never found a good explanation for why that is.], which measures the
predictive ability of an operator by taking into account both true positives and
false alarms.

Of those three, the β statistic is the least frequently used – it’s more in the
bailiwick of hardcore SDT devotees – but we’ll talk about it anyway. It will
give us perhaps our only opportunity in this course to use probability density
as a meaningful measurement, just like the last sentence gave me perhaps my
only opportunity to use the word “bailiwick” in a stats reading.

8.3.1 d′

As noted above, d′ is a measure of the discriminability between noise and


signal, and represents the difference between the peaks of the noise and the
signal plus noise distributions. Figure 8.4 illustrates the d′ statistic in the
context of the two curves.
Figure 8.4: Illustration of d′

It’s relatively easy to pick out a signal when it is on average much stronger
than the noise. In a hearing test, it’s easy to pick out the tones if those tones are
consistently much louder than the background noise; on a radar screen, it’s
easier to pick out the planes when those lights are consistently much brighter
than the atmospheric noise. In terms of the visual display of the underlying
signal and noise distributions, that sort of situation would be represented by a
signal distribution curve with its center further to the right on the strength axis
than the center of the noise distribution.
Figure 8.5: Relatively Large and Small Values of d′

The d′ statistic has no theoretical upper limit and is theoretically always


positive. However, it’s pretty much impossible to see a d′ value much greater
than 3: since the standard deviation of the noise curve is assumed to be 1, a d′
of about 3 indicates as much of a difference as can be determined (whether the
peaks of the distributions are 3 standard deviations apart or 30, no overlap is
no overlap). It’s also possible but really unlikely to observe a negative d′. If
the perception of noise and the perception of signal plus noise are
approximately equal – meaning that the contribution of signal to the signal
plus noise curve is approximately nothing – then responses are essentially
random, and it is possible that sample data by chance could indicate that the
peak of the signal plus noise curve lives to the left of the noise curve.

There is one scenario that would consistently produce a negative value of d′.
In the 2018 film Spider-Man: Into the Spider-Verse, high school student (and
budding Ultimate Spider-Man) Miles Morales intentionally answers every
question on a 100 true-false question test incorrectly.
Figure 8.6: Absolute Masterpiece

As Miles’s teacher points out, somebody with no knowledge of the test


material would have an expected score of 50%, which she takes as evidence
that Miles had to know all of the answers in order to get each one wrong.
Anyway, this is the main, non-chance reason for a negative d′: basically
somebody has to either intentionally messing with the experiment or they have
the buttons mixed up.

8.3.2 β

The β statistic (not to be confused with the β distribution or any of the other
uses of the letter β in this course) is a measure of response bias: whether an
operator’s response is more likely to indicate signal or to indicate noise at
any point. The β statistic – of which there can be one or there can be many
depending on the experiment – is the ratio of the probability density of the
signal plus noise curve to the density of the noise curve at a criterion point. If
a criterion is relatively strict, then an operator is likely to declare that they
perceive a signal only when they have strong evidence to believe so: there
will be few hits at a strict criterion point and few false alarms as well. If a
criterion is relatively lenient, then an operator is generally more likely to
declare that they perceive signals: there will be relatively many hits at a
lenient criterion point and many false alarms as well. Stricter criteria, as
illustrated in Figure 8.7, are further to the right on the strength axis where the
probability density of the signal distribution is high relative to the probability
density of the noise distribution (that is, the strength line is higher than the
nosie line) than more lenient criteria where the probability density of the
noise distribution is higher relative to the probability density of the noise
distribution.

Figure 8.7: Criterion Points and β Values

The term response bias may imply that it describes a feature of a given
operator, but that is not necessarily the case. The response bias is largely a
feature of the criterion that the operator has adopted, which in turn can vary
based on circumstances. For example, in a signal-detection experiment where
an operator receives a reward (monetary or otherwise) for each hit that they
register with no penalty for false alarms, the operator has motivation to adopt
a more lenient criteria – they really should say that the signal is present all the
time. Conversely, in a situation where an operator is penalized for false
alarms and not rewarded for hits, then the operator may be motivated to adopt
a more stringent criterion – they might say that the signal is never present.
Signal-detection experiments often take advantage of the variability of criteria
in order to measure more of the underlying noise and signal distributions by
observing more data points.
8.3.3 C-statistic

The C-statistic is a measure of the predictive power of an operator. It has an


hard upper limit of 1 (indicating perfect predictions) and a soft lower limit of
0.5 (indicating indifference between predicting correctly and incorrectly).
The C-statistic is related to d′: larger values of d′ are accompanied by larger
C-statistics.6

The C-statistic is equal to the area under the Receiver Operator


Characteristic (ROC) Curve. It is, equivalently, for that reason also known as
the area under the ROC, the AUC (Area Under Curve), or the AUROC (Area
Under ROC Curve). So now is probably a pretty good time to talk about the
Receiver Operator Characteristic (ROC) Curve.

8.3.3.1 The Receiver Operator Characteristic (ROC) Curve

The Receiver Operator Characteristic (ROC) is a description of the responses


made by the operator in a signal detection context. The ROC curve is a plot of
the hit rate (on the y-axis) against the false alarm rate (on the x-axis). Figure
8.8 is an illustration of what an empirical ROC curve looks like: at different
measured points (for example, different decision criteria), the hit rate and the
false-alarm rate are plotted. Also included (as is conventional) is the
indifference line: a model of what would happen if an operator were
precisely as likely to make correct as to make incorrect decisions; an
illustration of the pure-chance expectation.
Figure 8.8: A Typical Empirical ROC Curve

Figure 8.9 is an illustration of theoretical ROC curves that have been


smoothed according to models (sort of like curvy versions of the least-squares
regression line, as if this page needed more analogies). Three curves are
included in the figure, representing the expected ROC for when d′ = 3, when
d′ = 2, and when d′ = 1, respectively: as d′ gets larger, the ROC curve

bends further away from the indifference line.


Figure 8.9: Typical Smoothed ROC Curve for Three Values of d′

The area under the ROC curve for d′ = 1 in Figure 8.9 – that is, the C-
statistic – is 0.76, the C-statistic for d′ = 2 is 0.921, and the C-statistic for
d′ = 3 is 0.76 (see Figure 8.10)
Figure 8.10: C-statistics for Three ROC Curves with d′ = 3, d′ = 2, and
d′ = 1

8.4 Doing the Math


Having covered the conceptual bases of SDT and the measurements it
produces, we turn now to the actual calculations. We will spend most of our
time with the following dataset from Green and Swets (1966)7 which comes
from an SDT experiments with 5 conditions, each condition having 200 target-
present trials (from which we get the hit rate) and 200 target-absent trials
(from which we get the false alarm rate:

Condition Hit Rate False Alarm Rate


1 0.245 0.040
2 0.300 0.130
3 0.695 0.335
4 0.780 0.535
5 0.975 0.935
Knowing the hit rates and false alarm rates, we can draw the ROC curve for
these data (Figure 8.11):

Figure 8.11: Empirical ROC Curve for the Green & Swets (1966) Data.

The hit rate at each point is the frequency at which the individual correctly
identified a signal. In terms of the Noise/Signal+Noise Distribution
representation, that rate is taken to be the area under the signal+noise
distribution curve to the right of a given point on the strength axis: everything
to the right of the point (indicating greater strength) will be identified by the
operator as a signal.

The false alarm rate at each point is the frequency at which the individual
misidentifies noise as a signal. In terms of the Noise/Signal+Noise
Distribution representation, that rate is the area under the noise distribution
curve to the right of a given point on the strength axis: as with the hit rate,
everything to the right of that point will be (mis)identified by the operator as a
signal.
Thus, the hit rate and the false alarm rate give us the probabilities that an
operator is responding to their perception of signal and noise, respectively,
under each condition. Figure 8.12 depicts these probabilities for the signal
and noise distributions plotted separately for condition 3 (nothing super-
special about that condition, I just had to pick one).

Figure 8.12: Areas Under the Noise and Signal + Noise Curves Indicating
Probability of False Alarm and Hit, Respectively

Because both the signal and the noise distributions are normal distributions,
based on the area in the upper part of the curve, we can calculate z-scores that
mark the point on those curves that define those areas. Additionally, because
the noise distribution is assumed to be a standard normal distribution, the
values of z are also x-values on the strength axis. The signal distribution
FA

lives on the same axis, but it’s z-scores are based on its own mean and
standard deviation, and so for the points on the strength axis defined by the
criteria which, in turn, vary according to experimental condition (that is, the
motivation for adopting more stringent or more lenient criteria are
manipulated experimentally), the same x points will represent different z
values for the signal and for the noise distribution.
Thus, we take the experimentally-observed proportions of hits and false
alarms, we consider those to be the probabilities of hits and false alarms,
translate those probabilities into upper-tail areas under normal
distributions, and find the z-scores that define those upper-tail probabilities:
those will be our z values (for the signal distribution probabilities) and our
hit

zFA values (for the noise distribution probabilities):

Condition Hit Rate False Alarm Rate z z


H it FA

1 0.245 0.040 0.69 1.75


2 0.300 0.130 0.52 1.13
3 0.695 0.335 -0.51 0.43
4 0.780 0.535 -0.77 -0.09
5 0.975 0.935 -1.96 -1.51

At this point, our analyses hit a fork in the road. The observed ROC curve is
based on data: it does not change based on how we choose to analyze the
data, and the C-statistic does not change either. The d′ and β statistics will
change (not a whole lot, but substantially) based on what we believe about the
shape of the signal distribution.

8.4.1 Assumptions of Variances

There are two assumptions that we can make about the signal distribution that
will alter our evaluation of d′ and beta.

1. Equal Variance: The variance of the signal distribution is equal to the


variance of the noise distribution.

2. Unequal Variance: The signal and the noise distributions can have
different variances.

There is domain-specific debate – as in this recent example paper regarding


recognition memory – over which model is more appropriate and why. I have
have no strong opinions on the matter with regard to psychological processes.
From an analytic standpoint, it seems to make more sense to start from the
unequal variance assumption because it includes the possibility of equal
variance (it may be another case of a poorly-named stats thing: the unequal
variance assumption might be more aptly called the not assuming the
variances are equal but they might be assumption, but that’s really not as
catchy). That is, if we start off assuming unequal variance and the variances
end up being exactly equal, then that’s ok. Conversely, though, the analyses
that come from the equal variance assumption do not even allow for the
possibility that the underlying variances could be very different.8

This page will cover both. We will start by analyzing the sample data based
on the unequal variance assumption, then circle back to analyze the same data
using the equal variance assumption.

8.4.2 SDT with Unequal Variances


We know the mean and the standard deviation of the noise distribution
because we decided what they would be: 0 and 1, respectively. That leaves
the mean and standard deviation of the signal+noise distribution to find out.
Since we decided that the noise would be represented by the standard normal,
the mean and standard deviation of the signal+noise distribution become
easier to find.

Our sample data come from an experiment with five different conditions, each
one eliciting a different decision criterion (and, in turn, a different response
bias). To find the overall d′ (and β, which will depend on first finding d′), we
will use a tool known as the linearized ROC: a tranformation of the ROC
curve mapped on the Cartesian (x, y) plane.9 The linearized ROC plots z hit

on the x-axis and z FA on the y-axis (see Figure 8.13 below). That’s
potentially confusing since the ROC curve has the hit rate on the y-axis and
the FA rate on the x-axis. But, putting z on the x-axis and z on the y-axis
hit FA

makes the math much more straightforward, and we only need the linearized
ROC to calculate d′, so it’s likely worth a little (temporary) confusion.

Thus, for the linearized ROC and only for the linearized ROC, z FA = y and
z = x.
H it

The slope of the linearized ROC in the form y = âx + b is the ratio of the
standard deviation of the y variable to the ratio of the standard deviation of
the x variable:
sd y 1.25
â = = = 1.17.
sd x 1.07

The intercept of the Linearized ROC is an estimate of d′. We know that d′ is


the distance between the mean of the noise distribution and the mean of the
signal+noise distribution in terms of the standard deviation of the noise
distribution. Translating that into the observed data, that means:

ŷ = âx + b̂

ˆ
ŷ = 1.17x + b

Pluggin in the mean of z FA for y and the mean of z H it for x:

0.342 = 1.17(−0.406) + b̂

0.342 − 1.17(−0.406) = b̂ = 0.817 = d′

The slope of the linearized ROC is also an estimate of the relationship


between the standard deviations of the signal and of the noise distributions:
σs
â ≈
σn

Thus, the following is a visual representation of the Linearized ROC:


Figure 8.13: Linearized ROC for the Green & Swets (1966) Data.

Because we have assumed that the noise distribution is represented by a


standard normal distribution, which by definition has a standard deviation of
1 (again, we could have picked any normal distribution, and now we are glad
that we picked the one with a mean of 0 and a standard deviation of 1), our
estimate of the standard deviation of the signal+noise distribution is also 1.17.
We also know that since d′ is a measure of the distance between the mean of
the noise distribution and the mean of the signal+noise distribution in terms of
the standard deviation of the noise distribution, that the distance between the
means is equal to d′/1 = d′ and thus that the mean of the signal+noise
distribution is also equal to d′. Our underlying distributions of noise and of
signal+noise therefore look like this:
Figure 8.14: The Signal plus Noise and Noise Curves for the Green & Swets
(1966) Data Under the Assumption of Unequal Variances.

The β estimates at each criterion point – as noted above – are the ratios of the
probability density of the signal plus noise distribution to the probability
density of the noise distribution at each point.

We can use the noise distribution to locate points on the strength axis: we
assumed the noise distribution is a standard normal so it is centered at 0 and
its standard deviation is 1 so the z-values for the noise distribution are also
x values. Those values are listed on the x-axis in Figure 8.13 We will use

those values to find the probability density at each point for the noise
distribution.

Our next step is to find what those x values represent in terms of the signal
plus noise distribution. We will calculate z-scores for the signal distribution
that correspond to each of the criterion points based on the estimates of the
mean and standard deviation of the signal distribution we got from the
linearized ROC, which are separate from the z values we got from the hit
H it

rates in the experimental data. The mean of the signal distribution is equal to
d′: μSignal= 0.817 and the standard deviation of the signal distribution is

estimated by the slope of the linearized ROC: σ Signal= 1.17. Plugging each

x value, μ , and σ into the z-score formula z =


x−μ
Signal Signal
σ
, we arrive at
the following values:

ResponseBias<-c("Most Stringent", "$\\downarrow$", "$\\downarr


Criterion<-1:5
xcriterion<-c(1.75, 1.13, 0.43, -0.09, -1.51)
znoise<-xcriterion
zsignal<-(xcriterion-0.817)/1.17

beta.df<-data.frame(ResponseBias, Criterion, xcriterion, znois

kable(beta.df, "html", booktabs=TRUE, align="c", col.names = c


kable_styling(full_width = TRUE) %>%
collapse_rows(1)

Response Bias Experimental Condition Strength (x) z noise z signal

Most Stringent 1 1.75 1.75 0.80


↓ 2 1.13 1.13 0.27
↓ 3 0.43 0.43 -0.33
↓ 4 -0.09 -0.09 -0.78
Most Lenient 5 -1.51 -1.51 -1.99

To find the β values at each criterion point, we next find the probability
density for each curve given the respective z values and the mean and
standard deviation of each curve. To do so, we can simply use the dnorm()
command: to find the densities for the noise distribution – which, again, is a
standard normal distribution – we use the values of znoise as the x variable in
the command dnorm(x, mean=0, sd=1) (as a reminder: mean=0, sd=1 are
the defaults for dnorm(), so in this specific case you can leave those out if
you prefer), and to find the densities for the signal distribution, we use the
values of z as the x variable in the command dnorm(x, mean=0.817,
signal

sd=1.17), indicating the mean and standard deviation of the signal


distribution. Then, β for each criterion point c is simply the ratio of the
c

signal density to the noise density.


beta.df$noisedensity<-dnorm(znoise)
beta.df$signaldensity<-dnorm(zsignal, 0.817, 1.17)
beta.df$beta<-beta.df$signaldensity/beta.df$noisedensity

kable(beta.df, "html", booktabs=TRUE, align="c", col.names = c


kable_styling(full_width = TRUE) %>%
collapse_rows(1)

Response Experimental Strength Noise Signal


z z signal βc
Bias Condition (x) noise
Density Density
Most
1 1.75 1.75 0.80 0.09 0.34 3.95
Stringent
↓ 2 1.13 1.13 0.27 0.21 0.31 1.45
↓ 3 0.43 0.43 -0.33 0.36 0.21 0.58
↓ 4 -0.09 -0.09 -0.78 0.40 0.14 0.34
Most
5 -1.51 -1.51 -1.99 0.13 0.02 0.15
Lenient

8.4.3 Equal Variance Assumption


Figure 8.15: Equal Variance Assumption
σ signal
Under the equal variance assumption, = 1. Thus, we can dispense with
σ noise

finding the slope of the linearized ROC and call it 1. Thus:

y = x + d′ = z F A = z H it + d′

d′ = z F A − z H it

Again plugging in the mean of the z FA values and the mean of the z H it values:

d′ = 0.342 − −0.406 = 0.748

The value of d′ found without the assumption of equal variance was


d′ = 0.817, so it’s not an enormous difference.

Because d′ has changed with the change in assumptions, the estimate of the
mean of the signal distribution μ has also changed: both are now
Signal

d′ = μ Signal= 0.748. Also, since we are now assuming equal variance

between the noise and the signal plus noise distributions, the variance and the
standard deviation of the signal plus noise distribution are the same as those
for the noise distribution, that is to say, σ
2
N oise
= 1 and
√σ 2
N oise
= σ N oise = 1 (because we still assume that the noise distribution is
a standard normal) and thus σ
2
Signal
= 1 and √σ
2
Signal
= σ Signal = 1 . Our
estimates of β for each criterion point will also change – the z-scores and the
c

corresponding probability densities for the noise distribution don’t change but
the z-scores and the densities for the signal plus noise distributions do.
Replacing the mean and standard deviation of the signal plus noise
distribution derived using the unequal-variance assumption – namely:
μSignal = 0.817 and σ = 1.17 – with the mean and standard deviation
Signal

of the signal plus noise distribution derived using the equal-variance


assumption – μ = 0.748 and σ
Signal = 1, we can update the table of z,
Signal

probability densities, and β : c

ResponseBias<-c("Most Stringent", "$\\downarrow$", "$\\downarr


Criterion<-1:5
xcriterion<-c(1.75, 1.13, 0.43, -0.09, -1.51)
znoise<-xcriterion
zsignal<-(xcriterion-0.748)

betaequal.df<-data.frame(ResponseBias, Criterion, xcriterion,

betaequal.df$noisedensity<-dnorm(znoise)
betaequal.df$signaldensity<-dnorm(zsignal, 0.748, 1)
betaequal.df$beta<-betaequal.df$signaldensity/beta.df$noiseden

kable(betaequal.df, "html", booktabs=TRUE, align="c", col.name


kable_styling(full_width = TRUE) %>%
collapse_rows(1)

Response Experimental Strength Noise Signal


z z signal βc
Bias Condition (x) noise
Density Density
Most
1 1.75 1.75 1.00 0.09 0.39 4.48
Stringent
↓ 2 1.13 1.13 0.38 0.21 0.37 1.77
↓ 3 0.43 0.43 -0.32 0.36 0.23 0.62
↓ 4 -0.09 -0.09 -0.84 0.40 0.11 0.29
Response Experimental Strength Noise Signal
z z signal βc
Bias Condition (x) noise
Density Density
Most
5 -1.51 -1.51 -2.26 0.13 0.00 0.03
Lenient

8.4.3.1 The Area Under the ROC Curve: AUC, or, the C-statistic

The C-statistic is also known as the area under the ROC curve (AUC) and is
literally the area under the curve. For the experimental data from Green &
Swets, should we want to calculate the C-statistic, we merely need to treat the
area under the observed curve as a series of triangles and rectangles as shown
in Figure 8.16:

Figure 8.16: Breaking Down The Empirical ROC from the Green & Swets
(1966) Data.

If we add up the areas of all those triangles (A = bh) and all those
1

rectangles (A = bh), we get an area under the curve – the C-statistic – of


0.694
8.4.4 Logistic Regression and the C-stat

However, we really don’t want to be in the business of calculating C-


statistics by hand, especially when models become more complicated.
Instead, we will employ a regression method known as logistic regression.
Logistic regression enjoys a host of applications and has many moving parts
that we will revisit later, but we’ll only need a brief introduction to it for our
signal-detecting purposes here.

Logistic regression is used when the dependent variable – or, predicted


variable – is binary. It is frequently used in diagnostic contexts: how well do
independent – or predictor – variables predict medical conditions? Just as
medical diagnoses can be applied to signal detection theory in terms of
correct diagnoses, misses, false alarms, and correct rejections, so can the
predictions made by logistic regression.

For the Green & Swets data, we can translate the hit and false alarm rates to
binary responses because we know that there were n = 200 target-present
trials and n = 200 target-absent trials for each condition: each hit and each
false alarm gets a value of 1 for the response, and each miss and each correct
rejection gets a value of 0 for the response. Likewise, we can assign a value
of 1 for the binary condition of the target being present or absent for each of
the target-present trials and a value of 0 for each of the target-absent trials.
Pooling the data across all of the trials, here is a contingency table for the
operator responses vs. whether the target was present or absent:

Decision<-rep("Operator Response", 2)
rejectaccept<-c("Signal Present", "Signal Absent")
Yes<-c(599, 395)
No<-c(401, 605)

kable(data.frame(Decision, rejectaccept, Yes, No), "html", esc


kable_styling() %>%
add_header_above(c(" "=2, "Is the Target There?"=2)) %>%
collapse_rows(1)

Is the Target There?


YesTarget There?
Is the No
Operator Response Signal Present 599 401
Operator Response Signal Absent 395
Yes 605
No

We can use the target-present variable to predict responses using logistic


regression. The mathematical form of the logistic regression may be daunting:
1
p(y = 1) =
−b 0 +b 1 x
1 + e

where b is the intercept and b , the coefficient, is the logarithm of the odds
0 1

ratio associated with the predictor variable.

But for our purposes, all we need to know is that we will use almost the exact
commands to predict the binary response outcome from the target-presence
variable as we did to predict y from x in linear regression. The only changes
are that we will call glm() instead of lm() – general linear model instead of
linear model – and we will indicate that this is a logistic regression – with a
binary outcome – by including the command family = "binomial" in the
glm() parentheses (press “Code” to see the commands).

signal.present<-c(rep(1, 1000), rep(0, 1000))


response<-c(rep(1, 49), rep(0, 151),
rep(1, 60), rep(0, 140),
rep(1, 139), rep(0, 61),
rep(1, 156), rep(0, 44),
rep(1, 195), rep(0, 5),
rep(1, 8), rep(0, 192),
rep(1, 26), rep(0, 174),
rep(1, 67), rep(0, 133),
rep(1, 107), rep(0, 93),
rep(1, 187), rep(0, 13))

logistic.model<-glm(response~signal.present, family="binomial"

Then, we will use the Cstat() command from the DescTools package to get
the C-statistic:
Cstat(glm(response~signal.present, family="binomial"))

## [1] 0.6020037

The value we get from logistic regression – 0.696 – is nearly identical to


what we got by calculating the area under the ROC curve by hand, so, good
job by us.

In the foregoing example from Green & Swets (1966), the binary presence or
absence of the signal was used to predict operator response. Using logistic
regression allows us to include much more data in the prediction. If, say, we
were not interested in predicting operator response but in predicting the
binary presence or absence of a medical condition, we could include multiple
health-related predictor variables to help inform the prediction.

For example, we can use systolic blood pressure data to predict high blood
pressure from some of the other variables in the dataset. First, we need to
define a binary variable for the prediction. We can (rather arbitrarily) define
“high” blood pressure as “above the median SBP level for this dataset:”

sbp<-data.frame(read.csv("data/hd.csv")) sbp$High<-
ifelse(sbp$SBP>median(sbp$SBP), 1,0)

If we wanted to predict whether a given person would be in the high-SBP


category based on their age, their body mass index, and their smoking status,
we could define a model called sbpmodel using the following model
statement:

sbpmodel<-glm(High~Age+BMI+Smoke, data=sbp,
family="binomial")

Then, using the Cstat command:

Cstat(sbpmodel)

sbp<-data.frame(read.csv("data/hd.csv"))
sbp$High<-ifelse(sbp$SBP>median(sbp$SBP), 1,0)
sbpmodel<-glm(High~Age+BMI+Smoke, data=sbp, family="binomial")

Cstat(sbpmodel)
## [1] 0.8554688

we find that the C-statistic for the model is 0.86. For context: if the model
were no better than chance, we would expect a C-statistic around 0.5, and
usually C-statistics greater than 0.7 are considered to show reasonably good
predictive power for the model and and C-statistics greater than 0.8 are
considered to indicate strong predictive power.10

1. If you’re wondering something like “if humans can’t make decisions


based on objective differences in strength of signals vs. noise, why not
let computers do the really important jobs for us,” the short answers are
1. we do, to an extent, and 2. keeping humans involved in signal-
detection tasks is a very good idea.↩

2. Fun fact: have you ever heard that eating carrots improves eyesight
and/or confers the ability to see in the dark? That’s the result of a WWII
British intelligence campaign to prevent Germany from figuring out that
Royal Air Force pilots had on-board radar. Carrots are still good for
you, though, and can help in maintaining (but not super-powering)
eyesight.↩

3. It’s the central conceit of the book The Signal and the Noise by the
increasingly insufferable Nate Silver.↩

4. Invoking the idea of a signal-to-noise ratio is mixing metaphors a bit, but


it is sometimes used in statistical analyses in the context of effect size.↩

5. The term assumption is commonly used, but it’s not so much an


assumption in the sense that we assume that noise is shaped like that, but
that we assume it’s true because we can make the shape and position of
the noise distribution anything we want and that’s the shape and position
we decided on because it makes things like math and interpretation way
easier.↩

6. C-statistics can be less than 0.5 for the exact same, Ultimate-Spider-
Man-explained reasons that d′ can be negative: it’s an artifact of chance
responding or it’s a result of systematic misunderstanding of responses.↩

7. David Green and John Swets’s book is one of the seminal texts on
statistical analysis of psychophysical measures. The full citation is:
Green, D. M., & Swets, J. A. (1966). Signal detection theory and
psychophysics (Vol. 1). New York: Wiley.↩

8. If there is only one experimental condition and thus one criterion point,
then the point is moot: you would have to assume equal variances
because you wouldn’t have any way of assessing different variances.↩

9. Similar approaches involve creating least-squares regression lines based


on condition-specific d′ estimates.↩

10. The usual caveats about interpreting effect sizes apply, and the citation
for those figures is: David W.. Hosmer, & Lemeshow, S. (2000). Applied
logistic regression. New York: Wiley.↩
9 Markov Chain Monte Carlo
Methods
9.1 Let’s Make a Deal
In the September 9, 1990 issue of Parade Magazine, the following question
appeared in Marilyn vos Savant’s “Ask Marilyn” column:

This is the most famous statement of what came to be known as The Monty Hall
Problem due to the similarity of its setup to the game show Let’s Make a Deal,
hosted in the 60’s, 70’s, 80’s, and 90’s by Monty Hall. This article lays out the
history of the problem and the predictably awful reason why vos Savant in
particular drew so voluminous and vitriolic a response for her correct answer
(hint: it rhymes with “mecause she’s a woman”).

The Monty Hall Problem is set up nicely for solving with Bayes’s Theorem. In
this case, we are interested in comparing:

1. the posterior probability that the prize is behind the door originally picked
by the contestant and
2. the posterior probability that the prize is behind the door that the contestant
can switch to.

For simplicity of explanation, let’s say that the contestant originally chooses
Door #1, and the host shows them that the prize is not behind Door #2 (the
answer is the same regardless of what specific doors we choose to calculate the
probabilities for). Thus, we are going to compare the probability that the prize is
behind Door 1 given that the host has shown that the prize is not behind Door 2 –
we’ll call that p(Door |Show ) – with the probability that the prize is behind
1 2

Door 3 given that the host has shown Door 2 – p(Door |Show ).
3 2

The prior probability that the prize is behind door #1 is p(Door ) = 1/3 and
1

the prior probability that the prize is behind Door #3 p(Door ) is also 1/3 –
3

without any additional information, each door is equally likely to be the winner.

Since the contestant has chosen Door #1, the host can only show either Door #2
or Door #3. If the prize is in fact behind Door #1, then each of the host’s choices
are equally likely, so p(Show |Door ) = 1/2. However, if the prize is behind
2 1

Door #3, then the host can only show Door #2, so p(Show |Door ) = 1. Thus,
2 3

the likelihood associated with keeping a choice is half as great as the likelihood
associated with switching.

The base rate is going to be the same for both doors, so it’s not 100% necessary
to calculate, but here it is anyway:
p(Show 2 ) = p(Door 1 )p(Show 2 |Door 1 ) + p(Door 3 )p(Show 2 |Door 3 ) = 1/2

Thus, the probability of Door #1 having the prize given that Door #2 is shown –
and thus the probability of winning by staying with the original choice of Door #1
– is:
(1/3)(1/2) 1
p(Door 1 |Show 2 ) = =
1/2 3

and the probability of the prize being behind Door #3 and thus the probability of
winning by switching is:

(1/3)(1) 2
p(Door 3 |Show 2 ) = =
1/2 3

What does this have to do with Monte Carlo simulations? Well, the prominent
mathematician Paul Erdös famously refused to believe that switching was the
correct strategy until he saw a Monte Carlo simulation of it. so that’s what we’ll
do.

9.2 Let’s Make a Simulation, or, the Monte Carlo


Problem
9.2.1 Single Game Code

Let’s start by simulating a single game.

First, let’s randomly put the prize behind a door. To simulate three equally likely
options, we’ll draw a random number between 0 and 1: if the random number is
between 0 and 1/3, that will represent the prize being behind Door 1, if the
random number is between 1/3 and 2/3, that will represent the prize being
behind Door 2, and if the random number is between 2/3 and 1, that will
represent the prize being behind Door 3.

The default sufficient statistics in the base R distribution commands with the root
unif() define a continuous uniform distribution that ranges from 0 to 1. Using
the command runif(1) will take one sample from that uniform distribution: it’s
a random number between 0 and 1. We can use the ifelse command to record a
value of 1, 2, or 3 for the simulated prize location:

## Putting the prize behind a random door

set.seed(123) ## Setting a specific seed makes the computer take


x1<-runif(1) ## Select a random number between 0 and 1
Prize<-ifelse(x1<=1/3, 1, ## If the number is between 0 and 1/3,
ifelse(x1<=2/3, 2, ## Otherwise, if the number is l
3)) ## Otherwise, the prize is behind

Prize

## [1] 1

Next, we can simulate the contestant’s choice. Since the contestant is free to
choose any of the three doors and their choice is not informed by anything but
their own internal life, we can use the same algorithm that we used to place the
prize:

## Simulating contestant's choice

x2<-runif(1) ## Select a random number between 0 and 1


Choice<-ifelse(x2<=1/3, 1, ## If the number is between 0 and 1/3
ifelse(x2<=2/3, 2, ## Otherwise, if the number is l
3)) ## Otherwise, contestant chooses

Choice

## [1] 3

Now, here’s the key step: we simulate which door the host shows the contestant.
It’s going to be a combination of rules and randomness:

1. If the contestant has chosen the prize door, then the host shows either of the
two other doors with p = 1/2 per door.

a. We can draw another random number between 0 and 1: if it is less than 1/2
then one door is chosen, if it is greater than 1/2 then the other is chosen.

2. If the contestant has not chosen the prize door, then the host must show the
door that is neither the prize door nor the contestant’s choice:

x3<-runif(1)
if(Prize==1){
if(Choice==1){
ShowDoor<-ifelse(x3<=0.5, 2, 3)
}
else if(Choice==2){
ShowDoor<-3
}
else{
ShowDoor<-2
}
} else if(Prize==2){
if(Choice==1){
ShowDoor<-3
}
else if(Choice==2){
ShowDoor<-ifelse(x3<=0.5, 1, 3)
}
else{
ShowDoor<-1
}
} else if(Prize==3){
if(Choice==1){
ShowDoor<-2
}
else if(Choice==2){
ShowDoor<-1
}
else{
ShowDoor<-ifelse(x3<=0.5, 1, 2)
}
}

ShowDoor

## [1] 2

Given the variables Prize, Choice, and ShowDoor, we can make another
variable StayWin1 to indicate if the contestant would win (StayWin = 1) or
lose (StayWin = 0) by keeping their original choice:

StayWin<-ifelse(Prize==Choice, 1, 0)
StayWin
## [1] 0

And we can also create a variable SwitchWin2 to indicate if the contestin would
win (SwitchWin = 1) or lose (SwitchWin = 1) by switching their choices:

if (Choice==1){
if (ShowDoor==2){
Switch<-3
} else if (ShowDoor==3){
Switch<-2
}
} else if (Choice==2){
if (ShowDoor==1){
Switch<-3
} else if (ShowDoor==3){
Switch<-1
}
} else if (Choice==3){
if (ShowDoor==1){
Switch<-2
} else if (ShowDoor==2){
Switch<-1
}
}

SwitchWin<-ifelse(Prize==Switch, 1, 0)
SwitchWin

## [1] 1

Thus, to recap, in this specific random game: the prize was behind Door #1, the
contestant chose Door #3, and the host showed Door #2: the contestant would
win by switching and lose by staying. But, that’s just one specific game: in games
where stochasm is involved, good strategies often lose and bad strategies often
win.^[“My shit doesn’t work in the play-offs. My job is to get us to the play-ffs.
What happens after that is fucking luck”

– Billy Beane, characterizing the difference between stochastic strategy in large


vs. small sample sizes. from Lewis, M. (2003) Moneyball. New York: W.W.
Norton & Company, p. 275.]

9.2.2 Repeated Games


Now, let’s pull the whole simulation together and repeat it 1,000 times. If our
simulation works (hopefully!) and Bayes’s Theorem works, too (hahaha it
better!), then we should see about 1/3 of the 1,000 games won by staying and
about 2/3 won by switching.

The following code takes all the lines of code for the single-repetition game and
wraps it in a for loop with 1,000 iterations.

## Putting the prize behind a random door

StayWin<-rep(NA, 1000) ## When making an array, R prefers if yo


SwitchWin<-rep(NA, 1000) ## We start with "NA's" - R's version of

for (i in 1:1000){
x1<-runif(1) ## Select a random number between 0 and 1
Prize<-ifelse(x1<=1/3, 1, ## If the number is between 0 and 1/3,
ifelse(x1<=2/3, 2, ## Otherwise, if the number is l
3)) ## Otherwise, the prize is behind
## Simulating contestant's choice

x2<-runif(1) ## Select a random number between 0 and 1


Choice<-ifelse(x2<=1/3, 1, ## If the number is between 0 and 1/3
ifelse(x2<=2/3, 2, ## Otherwise, if the number is l
3)) ## Otherwise, contestant chooses

x3<-runif(1)

if(Prize==1){
if(Choice==1){
ShowDoor<-ifelse(x3<=0.5, 2, 3)
}
else if(Choice==2){
ShowDoor<-3
}
else{
ShowDoor<-2
}
} else if(Prize==2){
if(Choice==1){
ShowDoor<-3
}
else if(Choice==2){
ShowDoor<-ifelse(x3<=0.5, 1, 3)
}
else{
ShowDoor<-1
}
} else if(Prize==3){
if(Choice==1){
ShowDoor<-2
}
else if(Choice==2){
ShowDoor<-1
}
else{
ShowDoor<-ifelse(x3<=0.5, 1, 2)
}
}

StayWin[i]<-ifelse(Prize==Choice, 1, 0)

if (Choice==1){
if (ShowDoor==2){
Switch<-3
} else if (ShowDoor==3){
Switch<-2
}
} else if (Choice==2){
if (ShowDoor==1){
Switch<-3
} else if (ShowDoor==3){
Switch<-1
}
} else if (Choice==3){
if (ShowDoor==1){
Switch<-2
} else if (ShowDoor==2){
Switch<-1
}
}

SwitchWin[i]<-ifelse(Prize==Switch, 1, 0)
}
MontyHallResults<-data.frame(c("Staying", "Switching"), c(sum(Sta
colnames(MontyHallResults)<-c("Strategy", "Wins")
ggplot(MontyHallResults, aes(Strategy, Wins))+
geom_bar(stat="identity")+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)+
labs(x="Strategy", y="Games Won")

Figure 9.1: Frequency of Wins in 1,000 Monty Hall Games Following ‘Stay’
Strategies and ‘Switch’ Strategies

In our simulation, a player who stayed every time would have won 331 games
for a winning percentage of 0.331; a player who switched every time would have
won 669 games for a winning percentage of 0.669. So, Marilyn vos Savant and
Thomas Bayes are both vindicated by our simulation.

Please note that we arrived at our simulation result without including anything in
the code about Bayes’s Theorem or any other mathematical description of the
problem. Instead, we arrived at estimates of the expected value of the
distribution of wins by describing the logic of the game. That’s a valuable feature
of MCMC methods: in situations where we might not have firm expectations for
the posterior distribution of a complex process, we can arrive at estimates based
on taking those processes step-by-step.
Another matter of interest (possibly only to me): the game described in the Monty
Hall Problem was never really played on Let’s Make a Deal.

9.3 Random Walks


Random Walk models are an application of Monte Carlo Methods. Random
Walks were introduced in the page on probability theory, but are worth
revisiting. The random walk uses randomly-generated numbers to make models
of stochastic3 processes. The model depicted in Figure 9.2 simulates the cash
endowment of a gambler making repeated $50 bets on a 50-50 proposition: at
each step, she can either win $10 or lose $10. In this series of gambles, the
gambler starts with $100. Her cash endowment peaks at $170 after winning 8 out
of her first 11 bets, she hits her minimum of $60 on four separate occasions (after
losing her 65th, 67th, 125th, and 199th bets), and finishes with $70.

Figure 9.2: A Random Walk


The value of using a Random Walk model in this example is that it provides a
tangible description of a set of behaviors that result from a stochastic process –
this is a case where the behavior is of a hypothetical bankroll rather than of a
human or an animal – that might not be as easily understood if described by
statistics like the expected value and variance of a gamble.

In this example, the expected value of each gamble is 0 – one is just as likely to
win $10 as to lose $10, the variance of each gamble is the probability-weighted
square of the expectations: ( )($100) + ( )($100) = $100 and the standard
1

2
1

deviation is √$100 = $10. Thus, in the long run, the expectation of the gamble
over N trials is 0 × N , the variance is $100 × N , and the standard deviation is
10 × √ N .
4 But, that doesn’t really give us an accurate description of the
gambling experience, nor, by extension, of how people make decisions while
gambling. If somebody went to a casino with $100, they might be less likely to
want to know what would happen if I went to the casino an infinite amount of
times and played an infinite amount of games? and more likely to know will I
win?. With repeated Random Walk models – that is, running models like the one
above 1,000 or 1,000,000 times or whatever we like – we can model the rate at
which a gambler could run out of money, the rate at which they could hit a target
amount of winnings at which they decide to walk away, or the rate of anything in-
between.

The above is a relatively simple Random Walk, as it only involves one


dimension, and we have ways of understanding long-term expected values of
each step. Random Walks can also be two-dimensional, as in Figure 9.3, a
demonstration of a two-dimensional random walk model from this paper on
cellular biology:
Figure 9.3: Diagram of Two-Dimensional Random Walk Model

9.4 Markov Chain Monte Carlo Methods


Random Walk models are, specifically, examples of Markov Chain Monte Carlo
methods (MCMC). The key quality of Markov Chains is the condition of being
memoryless. A Markov Chain is one in which the next step in a chain depends
only on the current step. That is: each event is a step from the current state – a
step in one dimension as in Figure 9.2, in two dimensions as in Figure 9.3 or
more — that is independent of whatever steps that came before the current state.5

We also can use MCMC models as blunt-force tools for understanding stochastic
processes where analytic methods come up short. Famously, mathematicians have
been unable to use analytic methods to determine the probability of winning a
game of Klondike Solitaire, (see Figure 9.4). Every time you start a new game,
the result of the prior game has no bearing on the next one.
Figure 9.4: Solitaire is a Lonely Man’s Game.

By simulating games using Monte Carlo methods, computer scientists have found
solutions to the Solitaire problem where mathematicians haven’t be able to.

Computer simulations of games more complex than Klondike – namely, athletic


competitions – are often used in popular media to predict outcomes. For
example: ESPN has used the Madden series of video games to simulate the
outcome of Super Bowls. Letting computer-generated opponents play each other
in a video game repeatedly is equivalent to a Monte Carlo simulation with lots of
parameters, including simulated physical skills of the simulated competitors and
the physics of the simulated environment. It’s also the basic plot generator of the
2006 film Rocky Balboa.

On the remote chance that you, dear reader, may be interested in statistical
analysis regarding things other than game shows, or gamblers, or solitaire, or
boxing matches, we turn now to the main point:

9.5 MCMC in Bayesian Analysis


MCMC methods are a popular way to estimate posterior distributions of
parameters in Bayesian frameworks.^[MCMC methods are not very popular in
frequentist statistics for several reasons, including:
1. Posterior distributions aren’t a thing in the frequentist framework, and

2. frequentist statistics are based around assumed, rather than estimated


probability distributions.

There are classical methods that make use of repeated sampling, most
prominently the calculation of bootstrapped confidence intervals, but these also
rely on the assumption of probability distributions.] As described in Classical
and Bayesian Inference, once the posterior distribution on a parameter (or set of
parameters) is determined, then various inferential questions can be answered,
such as: what is the highest density interval (HDI)? or what is the probability
that the parameter is greater than/less than a variable of interest?

Here we will focus on the most flexible – although not always the most efficient
– MCMC method for use in Bayesian inference: the Metropolis-Hastings (MH)
algorithm. We will discuss how it works, how to get estimates of a posterior
distrbution, why MH gives us samples from the posterior distribution, and
applications to Psychological Science.

9.5.1 The Metropolis-Hastings Algorithm


Here’s a really super-basic explanation of how the MH algorithm works: you
start with a random number for your model, then you pick another random
number. If the new random number makes your model better, you keep it, if it
doesn’t, then sometimes you keep it anyway but otherwise you stick with the old
number. You repeat that process a bunch of times, keeping track of all the
numbers that you kept, and the numbers you kept are the distribution.

The italicized step (sometimes you keep it anyway) is the key to generating a
distribution. If the MH algorithm only picked the better number, then the
procedure would converge on a single value rather than a distribution. What’s
worse, that number might not be the best number, and the algorithm could get
stuck for a long time until a better number is found. The sometimes is key to
getting a distribution with both high-density probabilities and lower-density
probabilities (a common feature of probability distributions).

9.5.1.1 Sampling from the Posterior Distribution


The random numbers that we plug into the MH algorithm are candidate values for
parameters. We sample them at random from a prior distribution: in Bayesian
terms, they are samples from p(H ). The determination of whether the new
candidate values make the model better or not is based on the likelihood of the
observed data given the hypothesized values of the parameters: in Bayesian
terms, it is the likelihood p(D|H ). We decide which candidate value is better
based on the ratio of the prior times the likelihood of the new candidate to the
prior times the likelihood of the old candidate. If we call the current parameter
value H , the proposed new candidate parameter H , and the ratio of the
1 2

products of the priors and the likelihoods r, then:

p(H 2 )p(D|H 2 )
r =
p(H 1 )(p(D|H 2 )

Since Bayes’s Theorem tells us that the posterior probability of H is


p(H )p(D|H )

p(D)
,
r is the ratio of the two posterior probabilities p(H |D) and p(H |D) – p(D)
2 1

would be the same for both hypotheses, and so cancels out. If we sample all of
the candidate values from the same prior distribution – as we will for our
examples in this page – p(H ) cancels out as well. For example, if each candidate
value is sampled randomly from a uniform distribution, then the probability of
choosing each value is the same. Thus, it is more common to present the ratio r
as:

p(D|H 2 )
r =
p(D|H 1 )

Sampling parameter values from a prior distribution and evaluating them based
on the product of the (often canceled) prior probability and the likelihood given
the prior divided by the (always canceled) probability of the data ensures that the
MH produces samples from the posterior distribution.

9.5.1.2 The Acceptance sub-Algorithm

The MH algorithm always accepts the proposed candidate parameter value if the
likelihood of the data given the new parameter is greater than or equal to 1:

p(D|H 2 )
1. Accept H 2 if r = ≥ 1.
p(D|H 1 )
Thus the posterior distribution will be loaded up with high-likelihood values of
the parameter. The sometimes is also not nearly as random as my super-basic
description has made it out to be.

As alluded to above, the MH algorithm sometimes accepts the proposed


candidate parameter value if the likelihood of the data given the new parameter
is less than 1. The new parameter will be accepted with a probability equal to
the likelihood ratio: if r = 0.8, then the probability of accepting H is 0.8; if
2

r = 0.2, then the probability of accepting H is 0.2. That is how the MH


2

algorithm produces the target distribution based on the likelihood function: it


doesn’t randomly accept parameters that don’t increase the likelihood of the data
given the parameter, it accepts them at a rate equal to how close they get the
model to peak likelihood.

The step of accepting parameters that don’t increase the likelihood at rates equal
to the amount that they do change the likelihood is accomplished by generating a
random value from a uniform distribution which is usually called u. Similarly to
how we simulated a choice between three doors by generating a random number
from the uniform distribution and calling it (for example) “Door #1” if the
random number were ≤ 1/3, if r > u, then we accept the proposed candidate
value for the parameter:

p(D|H 2 )
2. Accept H 2 if r = > u, else accept H 1
p(D|H 1 )

If, for example, r = 0.8, because there is an 80% chance that the random number
u < 0.8, there is an 80% chance that H will be accepted; if r = 0.2, then there
2

is a 20% chance that H will be accepted because there is a 20% chance that a
2

randomly-generated u will be less than 0.2.

Here, in words, is the full Metropolis-Hastings algorithm:

1. Choose a starting parameter value. Call this the current value.

In theory, it doesn’t really matter which starting value you use – it can be a
complete guess – but it helps if you have an educated guess because the
algorithm will get to the target distribution faster if the starting parameter is
closer to the middle of what the target distribution will be. For a parameter in the
[0, 1] range, 0.5 is a reasonable place to start.
2. Generate a proposed parameter value from the prior distribution.

For example, given an uninformative prior, a sample from a uniform distribution


will do nicely.

3. Calculate the ratio r between the likelihood of the observed data given the
proposed value to the and the likelihood of the observed data given the
current value.

This part can be trickier than it appears. For the MH algorithm to generate the
actual target distribution, it must use a good likelihood function (formally, this
requirement is often stated the likelihood function must be proportional to the
target distribution). In a case where the data are binomial, then the likelihood
function is known (it’s the binomial likelihood function π (1 − π) ). For
N!

s! f !
s f

more complex models, the likelihood function can be derived in much the same
way the binomial likelihood function is derived. Other investigations can make
assumptions about the likelihood function: if a normal distribution is expected to
be the target distribution, for example, then the likelihood function is the density
x−μ 2

function for the normal (f (x|μ, σ) = 1


e
−(
σ
)
). But, once the likelihood
σ√ 2π

function is decided upon, r is just a simple ratio of the likelihood given the
proposed value to the likelihood given the current value.

4. If r ≥ 1, accept the proposed parameter.

a. If r > u(0, 1), accept the proposed parameter.

u(0, 1)refers to a random number from a uniform distribution bounded by 0 and


1. When coding the MH algorithm, we can just call it u.

b. else, accept the current parameter.

5. The accepted parameter becomes the current parameter. Store the accepted
parameter.

6. Repeat steps 1 – 5 as many times as you like.

How many iterations are used is somewhat a matter of preference, but there are a
couple of considerations:
9.5.1.2.1 Model complexity

A model with a single parameter will generally require fewer samples for the
MH algorithm to converge to the target distribution than models with two or
more. Whereas 1,000 iterations might be enough to generate a smooth target
distribution for a single parameter, 1,000,000 or more might be required in
situations where two or more parameters have to be estimated simultaneously.

9.5.1.2.2 Burn-in Period

As noted above, the starting parameter value or values for the MH algorithm are
guesses (educated or otherwise). The algorithm is dumb – it doesn’t know where
it’s going, and if you happen to start it far away from the center of the target
distribution, there may be some meaningless wandering at the beginning. This
problem is more pronounced when there are multiple parameters. Thus, the first
few iterations are often referred to as the burn-in period and are thrown out of
estimates of the posterior distribution. As with the overall number of trials,
there’s no a priori way of knowing how many iterations the burn-in period
should last. For what it’s worth: I learned that the burn-in period should be about
the first 3% of trials (presumably based on my stats teacher’s MCMC
experience). Basically, you want to take out any skew caused by the first bunch of
trials: if you drop them and the mean and sd of the distribution changes, they
probably should be dropped, and if you drop more and the mean and sd stay the
same, then you probably don’t need to drop the additional trials.

9.5.1.2.3 Auto-correlation

Getting stuck in place is a concern for MH models when there are multiple
parameters involved: when the combination of multiple parameters need to
increase the posterior likelihood to have a high chance of being accepted, there
can be long periods of continuing to accept the current parameters. The concern
there is that some parameters could inappropriately be overrepresented by virtue
of the search algorithm getting stuck. That concern is easily alleviated by taking
every nth iteration (say, every 100th value of the parameters), but doing so might
require taking increased overall iterations and more computing time.6

The following code performs the MH algorithm on data from a binomial


experiment where s = 16 and f = 4 (note: these are the same data from the
Classical and Bayesian Inference page). Figure ?? illustrates the selection
process of the algorithm for 1,000 iterations. For this example, we know that the
target distribution is a β distribution with a mean of s+1

s+f +2
≈ 0.77 and a

standard deviation of √ ≈ 0.088. Please note that the accepted values of


π̂(1−π̂)

s+f +3

the parameter (which is called theta in the code) accumulate near the center of
the target distribution but occasionally less likely values are accepted.

set.seed(77)
s<-16
N<-20
theta<-c(0.5, rep(NA, 999))
for (i in 2:1000){ #Number of MCMC Samples
theta2<-runif(1) #Sample theta from a uniform distribution
ltheta2<-dbinom(s, N, theta2) ## p(D|H) for new theta
ltheta1<-dbinom(s, N, theta[i-1]) #p(D|H) for old theta
r<-ltheta2/ltheta1 #calculate likelihood ratio

u<-runif(1) #Sample u from a uniform distribution


theta[i]<-ifelse(r>=1, theta2, #replace theta according to MH a
ifelse(r>u, theta2, theta[i-1]))
}
x<-1:1000
theta<-data.frame(theta, ltheta2, x)
Figure 9.5: Results of 1,000 Markov Chain Monte Carlo Samples Selected Using
the Metropolis-Hastings Algorithm

The mean of the posterior distribution generated by the MH algorithm is 0.78 and
the standard deviation is 0.086 – very close to the analytic solution from the
posterior β distribution that has a mean of 0.77 and a standard deviation of
0.088.

Now that we have some test cases to indicate that the algorithm works and how it
works, let’s review a couple of useful applications for estimating posterior
distributions on parameters: multinomial models and regression.

9.6 Multinomial Models


Multinomial Models are tools for describing the relationships between latent
processes and observable outcomes. For example, multinomial models are often
used in genetics research: genotypes are generally latent (they were way more
latent before modern microscopy), while phenotypes are observable. If Gregor
Mendel had computers and knew of Markov Chain Monte Carlo methods (both of
which were invented after he died), he might have used the Metropolis-Hastings
algorithm (also posthumous for our monk friend) to model the distribution of
dominant and recessive genes in pea plants.

Psychology, of course, encompasses the study of many unobservable latent


processes. We can’t directly measure how much a person stores in their memory
nor how likely they are to be able to access what they have stores, but we can
measure what stimuli they can recall or recognize. We can’t directly measure
how people perceive different features of a stimulus but we can measure what
features they can identify.

Multinomial models in psychology presuppose that multiple processes can occur


in the mind with different probabilities and map out the combination or
combinations of processes that need to occur to produce an observable outcome
(like a behavior). For example, a multinomial memory model might posit that to
remember somebody’s name, one would have to A. have stored that person’s
name in their memory and B. be able to retrieve it from memory. In that model, if
it is assumed that a stimulus is either stored or is not, then there is a probability
of storage in memory and the probability of non-storage is 1 minus that
probability.7 Then, there is a probability of retrieval and a probability of non-
retrieval (1 − p(retrieval)). The goal of multinomial modeling is to recover
those probabilities based on observable behaviors. Those probabilities can be
used to describe latent processes under different conditions. To continue with the
memory example: if our research question is how do memory processes change
as a function of time?, we may compare the posterior distribution of p(storage)
at time t with the posterior distribution of p(storage) at time t + 1, the posterior
distribution of p(retrieval) at time t with the posterior distribution of
p(retrieval) at time t, etc.

Multinomial models are naturally described using probability trees. Figure 9.6 is
an example of a relatively basic generic multinomial model.
Figure 9.6: Generic Tree Representation of a Multinomial Model (adapted from
Batchelder & Reifer, 1999)

In this model, for the behaviors noted in Cell 18 to occur, two events need to
happen: one with probability θ and another with probability θ . Thus, behavior
a b

1 (recorded in Cell 1) happens with probability θ θ . Behavior 2 happens with


a b

probability θ (1 − θ ), behavior 3 happens with probability (1 − θ )θ , and


a b a b

behavior 4 happens with probability (1 − θ )(1 − θ ).


a b

The blessing and the curse of multinomial models is that they are flexible enough
to handle a wide range of theories of latent factors. If there are, say, four possible
processes that could occur in the mind at a given stage of the process, one can
construct a multinomial model with four branches coming off of the same node. If
different paths can lead to the same behavior (for example, remembering a fact
can help you on a multiple choice test but guessing can get you there – less
reliably – too), one can construct a multinomial model where multiple paths end
in the same outcome. A multinomial model can take on all kinds of forms based
on the theory of the latent processes that give rise to observable phenomena like
behaviors and physiological changes: that’s the blessing. The curse is that we
have to come up with likelihood functions that fit the models, which can be a
chore.
For the model in Figure 9.6, the likelihood function is structurally similar to the
binomial likelihood function: we take the probabilities involved in getting to
each end of the paths in the model and multiply them by the number of ways we
can go down those paths. That is, the likelihood function is the product of a
combinatorial term and a kernel probability term.

Let’s say we have an experiment with N trials, where n , n , n , and n are the 1 2 3 4

number of observations of behaviors 1, 2, 3, and 4, respectively. The


combinatorial term – N things combined n , n , and n at a time (with n
1 2 3 4

technically being leftover) is:

N!

n1 ! n2 ! n3 ! n4 !

The kernel probability is given by the product of each probability – θ , θ , a b

1 − θ , and 1 − θ – raised to the power of the cells in which they play a part in
a b

getting to:
n 1 +n 2 n 1 +n 3 n 3 +n 4 n 2 +n 4
θa θ (1 − θ a ) (1 − θ b )
b

Thus, our entire likelihood function for the model, where cells are represented by
the letter ϕ is:
i

N! n 1 +n 2 n 1 +n 3 n 3 +n 4 n 2 +n 4
L(θ a , θ b ) = p(D|H ) = θa θ (1 − θ a ) (1 − θ b )
b
n 1 !n 2 !n 3 !n 4

We can make a user-defined function in R to calculate ${\cal L}(\theta_a,


\theta_b)$:

likely<-function(n1, n2, n3, n4, theta.a, theta.b){


L<-(factorial(n1+n2+n3+n4)/(factorial(n1)*factorial(n2)*factori
(theta.a^(n1+n2))*
(theta.b^(n1+n3))*
((1-theta.a)^(n3+n4))*
((1-theta.b)^(n2+n4))
return(L)
}

Now, we will run an experiment.


Actually, we don’t have time for that. Let’s suppose that we ran an experiment
with N = 100 and that these are the observed data for the experiment:

Observation n
Cell 1 28
Cell 2 12
Cell 3 42
Cell 4 18

And let’s estimate posterior distributions for θ and θ using the MH algorithm.
a b

Note that in this algorithm, we have two starting/current parameter values and
two proposed parameter values, corresponding to θ and θ . At each iteration,
a b

we plug values for both parameters into the likelihood function and either accept
both proposals or keep both current values. This tends to slow things down a bit,
but is more flexible than algorithms that change one parameter at a time.9 Given
that we have two parameters to estimate, for this analysis we will use 1,000,000
iterations.

iterations<-1000000

theta.a<-c(0.5, rep(NA, iterations-1))


theta.b<-c(0.5, rep(NA, iterations-1))

n1=28
n2=12
n3=42
n4=18

for (i in 2:iterations){
theta.a2<-runif(1)
theta.b2<-runif(1)
r<-likely(n1, n2, n3, n4, theta.a2, theta.b2)/likely(n1, n2, n3
u<-runif(1)
if (r>=1){
theta.a[i]<-theta.a2
theta.b[i]<-theta.b2
} else if (r>u){
theta.a[i]<-theta.a2
theta.b[i]<-theta.b2
} else {
theta.a[i]<-theta.a[i-1]
theta.b[i]<-theta.b[i-1]
}
}

Assuming a burn-in period of 3,000 iterations, here are the estimated posterior
distributions of θ and θ given the observed data:
a b

Figure 9.7: Posterior Distributions of θ and θ Given the Observed Data


a b
As a check, let’s plug the mean values of θ and theta back into our
a b

multinomial model to see if they result in predicted data similar to our observed
data:

Observation n obs p(ϕ i ) n pred

Cell 1 28 θa θb 27.997
Cell 2 12 θ a (1 − θ b ) 12.201
Cell 3 42 (1 − θ a )θ b 41.651
Cell 4 18 (1 − θ a )(1 − θ b ) 18.151

9.7 Bayesian Regression Models


Regression models are the last application of MCMC we will cover here. In the
context of regression, the likelihood function is related to the fit of the model to
the data. For example, in classical linear regression, the likelihood function is the
product of the normal probability density of the prediction errors: the least-
squares regression line is the model with parameters that maximize that
likelihood function.10

Bayesian regression confers a couple of advantages over its Classical


counterpart. Because Bayesian models don’t assume normality, there is less
concern about using it with variables sampled from different distributions, and
since the results are not based on the normal assumption, Bayesian regression
outputs can include more informative interval estimates on model statistics like
the slope and the intercept (for example: non-symmetric interval estimates where
appropriate). Another advantage is that Bayesian interval estimates are often
more narrow than classical estimates in cases where MCMC methods give more
precise measurements than classical methods, which, again, assume that model
statistics are sampled from normal distributions. Finally (there are more I can
think of but I will stop here out of a sense of mercy), Bayesian point and interval
estimates have more intuitive interpretations than classical estimates.

Bayesian regression using MCMC is complex in theory, but easily handled with
software. Using the BAS package, for example, we can use many of the same
commands that we use for classical regression to obtain Bayesian models.

As an example, let’s use the data represented in Figure 9.8.

Figure 9.8: Sample Scatterplot

First, let’s run a classical simple linear regression with the base R lm()
commands. Instead of the full model summary() command, we will limit our
output by using just the coef() command to get the model statistics and the
confint() command to get interval estimates for the model statistics.

coef(lm(y~x))

## (Intercept) x
## 0.2589419 3.8024346

confint(lm(y~x))
## 2.5 % 97.5 %
## (Intercept) -1.042028 1.559912
## x 3.127986 4.476884

Next, we will use the Bayesian counterpart bas.lm() from the BAS package,
with 500,000 MCMC iterations11:

library(BAS)
set.seed(77)
confint(coef(bas.lm(y~x,
MCMC.iterations = 500000)))

## 2.5% 97.5% beta


## Intercept 3.260162 5.387505 4.363655
## x 3.106006 4.435586 3.795684
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The results are very similar! The Bayesian interval estimate for the coefficient of
x (that is, the slope) is slightly narrower than the classical confidence interval,

but for the Bayesian interval, you’re allowed to say “there’s a 95% chance that
the parameter is between 3.1 and 4.4.” The intercept is slightly different, too, but
that has little to do with the differences between Bayesian and classical models
and much more to do with the fact that the BAS package uses an approach called
centering which forces the intercept to be the mean of the y variable – it’s a
choice that is also available in classical models and one that isn’t going to mean
a whole lot to us right now.

And that’s basically it. There are lots of different options one can apply
regarding different prior probabilities and Bayesian Model Averaging (which, as
the name implies, gives averaged results that are weighted by the posterior
probabilities of different models, but is only relevant for multiple regression),
but for basic applications, modern software makes regression a really easy entry
point to applying Bayesian principles to your data analysis.

1. At this point, I feel compelled to admit that I’m not very good at naming
variables.↩

2. Like, I’m really bad at naming variables↩


3. As a reminder: stochastic means probabilistic in the sense that is the
opposite of deterministic – that is: the outcomes in a stochastic process are
not determined by a series of causes leading to certain effects but by a
series of causes leading to possible effects.↩

4. Variances add linearly, standard deviations do not.↩

5. Recall from Probability Theory that the Random Walk is also known as the
Drukard’s Walk: none of the steps that have gotten somebody who is
blackout drunk to where they are predict the next step they will take.↩

6. Your experiences may vary with regard to how much additional computing
time may be incurred by increasing the number of MCMC iterations. If a lot
of MCMC investigations are being run, the time can add up. In the case of a
single analysis of a dataset, it’s probably more like an excuse to get up from
your computer and have a coffee break while the program is running. ↩

7. A more complicated memory model might posit that there is also a chance
of partial storage, as in the phenomenon when you can remember the first
letter or sound of somebody’s name.↩

8. “Cell” is used in this case in the same sense as “cells” in a spreadsheet


where data are recorded.↩

9. There is a specific subtype of the Metropolis-Hastings algorithm known as


the Gibbs Sampler that does hold one parameter constant while changing the
other at each iteration. The upside is that the Gibbs Sampler can be more
computationally efficient, the downside is that the Gibbs Sampler requires
assumptions about the conditional distributions of each parameter given the
other and thus is less broadly useful than the generic MH algorithm. It’s
fairly popular among Bayesian analysts and it’s not unlikely that you will
encounter it, but the MH algorithm can always get you – eventually – where
you need to go.↩

10. Not to get too into what I meant to be a brief illustration, but it might be
confusing that we want to maximize a function based on the errors when the
goal of regression modeling is to minimize the errors generally. The link
there is that the normal probability density is highest for values near the
mean of the distribution; and if the mean of the errors is 0, then minimizing
the errors maximizes the closeness of the errors to the mean – and thus the
peak – of the normal density.↩

11. Without getting too far into the mechanics of the BAS package, it uses a
subset of MH-generated samples to estimate statistics and intervals.↩
10 Assumptions of Parametric Tests
10.1 Probability Distributions and Parametric Tests
Imagine you’ve just done an experiment involving two separate groups of
participants – one group is assigned to a control condition and the other is
assigned to an experimental condition – and these are the observed data:

Figure 10.1: Hypothetical Experimental Data

Naturally, you want to know if there is a statistically significant difference


between the mean of the control group and the mean of the experimental group. To
test that difference, you decide to run an independent samples t-test on the data:

t.test(Experimental, Control, paired=FALSE, var.equal = TRUE)

##
## Two Sample t-test
##
## data: Experimental and Control
## t = 3.6995, df = 58, p-value = 0.000482
## alternative hypothesis: true difference in means is not equal
to 0
## 95 percent confidence interval:
## 0.4801216 1.6122688
## sample estimates:
## mean of x mean of y
## 2.296369 1.250174

The t-test says that there is a significant difference between the Experimental and
Control groups: t(df = 58) = 3.7, p < 0.01. That’s a pretty cool result, and you
may feel free to call it a day after observing it.

That one line of code – t.test(Experimental, Control, paired=FALSE,


var.equal = TRUE) – will get you where you need to go. There may be other
ways to analyze the observed data (for example: a nonparametric test or a
Bayesian t-test), but that is a perfectly reasonable way to analyze the results of an
experiment involving two separate groups of participants.

Using statistical software to calculate statistics like t and p is a little like riding a
bicycle in that it takes some skill and it will get you to your destination fairly
efficiently, but doing so successfully requires some assumptions. For the bicycle,
it is assumed that there is air in the tires and the chain is on the gears and the seat
is present: if those assumptions are not met, you still could get there, but the trip
would be less than ideal. For the t-test – and for other classical parametric tests –
the assumptions are about the source(s) of the data.

The t-test evaluates sample means using the t-distribution. It is based largely on
the Central Limit Theorem, which tells that sample means are distributed as
normals when n is sufficiently large…
2
σ
x ∼ N (μ, )
N

but it doesn’t tell us how large is sufficiently large. n = 1, 000, 000 would
probably do it. The old rule of thumb is that n = 30 does it, but that is garbage.1

The only way to guarantee that the sample means – regardless of sample size –
will be distributed normally is to sample from a normal distribution. We can prove
that rather intuitively: imagine you had sample sizes of n − 1. The sample means
for an n of 1 from a normal distribution would naturally be a normal distribution:
it would be just like taking a normal distribution apart one piece at a time and
rebuilding it (see Figure 10.2).
Figure 10.2: Normal Parent Distribution and Sampling Distribution with n = 1

In the case of taking samples of n = 1 from a normal distribution, the distribution


of sample means would be the same distribution we started with.2 This is the only
distribution for which we can make such a claim. Hence: we assume that the data
are sampled from a normal distribution.

The other feature that the use of the t-distribution to evaluate means depends on is
the variance: as noted in Probability Distributions, the t-distribution models the
distribution of sample means standardized by the sampling error. The standard
deviation of a t distribution is the standard error, and, like the immortals in
highlander…
Figure 10.3: This reference is almost too old even for me.

Thus, we assume that the data in the groups were drawn from populations with
equal variances. This is known as either the homoscedasticity assumption or the
homogeneity of variance assumption: take your pick (I prefer homoscedasticity
because it’s more fun to say). Another way to view the importance of
homoscedasticity is to consider what we would make of the results shown in
Figure 10.1 if we knew that they came from the distributions shown in Figure 10.4:

Figure 10.4: Populations with Means of 1.25 and 2.23, Respectively, and Very
Different Variances

Were the populations to vary as in Figure 10.4, we might not be so sure about the
difference between groups: it would appear that the range of possibilities in
sampling from the Experimental population comfortably includes the entire
Control population and that the difference is really of variances rather than means
(which is not what we want to test with a t-test).

Maybe the simplest reason for assuming homoscedasticity is a research-methods


reason: we want to assume that the only thing different about our populations is the
experimental conditions. That is, all of our participants come in with their usual
participant-to-participant difference with regard to the dependent measure
(whatever it may be) and it is only the experimental manipulation that moves the
means of the groups apart.

These assumptions undergird the successful use of the t-test. Like the bicycle –
remember the bicycle analogy? – we can go ahead and use the t-test (or whatever
other parametric test we are using) assuming that everything is structurally sound.
If it gets us where we’re going: then violation of our assumptions probably didn’t
hurt us too much. But, the responsible thing to do before riding your bicycle is to at
least check your tire pressure, and the responsible thing to do before running
classical parametric tests is to check your assumptions.

10.2 The General Assumptions


10.2.1 Scale Data

The classical parametric tests make inferences on parameters – means, variances,


and the like – which are, by definition, numbers. To make inferences about
numbers, we need data that are also numbers. Continuous data – interval- or
ratio-level data – will work just fine. Ordinal data can also be used, but as
discussed in Categorizing and Summarizing Information, it makes more logical
sense to use nonparametric tests for ordinal data, as means and variances aren’t
likely to be as meaningful for ordinal data as they are for continuous data.

10.2.1.1 Testing the Scale Data Assumption

If your data are numbers, they are scale data. End of test.

10.2.2 Random Assignment

The classical parametric tests assume that in experimental studies with different
conditions, that individual observations – be they observations of human
participants, animal subjects, tissue samples, etc. – arise from random assignment
to each of those experimental conditions. The key to random assignment is that
individuals in a study should be on average the same with regard to the dependent
variables.

For example, please imagine an educational psychology experiment involving


undergraduate volunteers that tests the effect of two kinds of mathematics trainings
– Program A and Program B – on performance on a problem set. If an
experimenter were to find all of the math majors among the volunteers and to
assign them all to the Program A group, we might not be surprised to see that
Program A appears to have a more positive effect on problem set performance
than Program B. That would not be random assignment. That would be cheating. If
the volunteers were instead randomly assigned to groups, then, at least in theory,
the levels of prior math instruction among participants would average out, and the
condition means could be more meaningfully assessed.

10.2.2.1 Testing the Random Assignment Assumption

No test for this one: we just have to rely on scientific ethics.

10.2.3 Normality

The normality assumption is that the data are sampled from a normal
distribution. This does not mean that the observed data themselves are normally
distributed!

That makes testing the normality assumption theoretically tricky. While data that
come from a normal distribution are not – and I can’t emphasize this enough –
necessarily themselves normally distributed, we test the normality assumption by
seeing whether the observed data could plausibly have come from a normal
distribution. We do so with a classical hypothesis test, with the null hypothesis
being that the data were sampled from a normal distribution, and the alternative
hypothesis being that the data were not sampled from a normal distribution. So, it’s
actually more accurate to say that we are testing the cumulative likelihood that the
observed data or more extremely un-normal data were sampled from a normal
distribution.

To summarize the above paragraph of nonsense: if the result of normality testing is


not significant, then we are good with our normality assumption.

If the result is significant, then we have evidence that the data violate the normality
assumption, in which case, we have options as to how to proceed.

Tests of normality are all based around what would be expected if the data were
sampled from a normal distribution. The differences between different tests arise
from the different sets of expectations they use. For example, on this page there are
three tests of the normality assumption: the χ goodness-of-fit test, the
2

Kolmogorov-Smirnov test, and the Shapiro Wilk test. The χ test is based on how
2
many data points fall into different places based on what we would expect from
a sample from a normal. The Kolmogorov-Smirnov test is based on how the
observed cumulative distribution compares to a normal cumulative distribution.
The Shapiro-Wilk test is based on correlations between observed data and data
that would be expected from a normal distribution with the same dimensions.
But the general idea of all these tests is the same: they compare the observed data
to what would be expected if they were sampled from a normal: if those things are
close, then we continue to assume our null hypothesis that the normality
assumption is met, and if those things are wildly different, then we reject that null
hypothesis.

10.2.3.1 Testing the Normality Assumption

10.2.3.1.1 The χ
2
Goodness-of-fit Test

The χ goodness-of-fit test, as noted above, assesses where observed data points
2

fall relative to each other. Let’s unpack that concept. One thing we know about the
normal distribution is that 50% of the distribution is less than the mean and 50% of
the distribution is greater than the mean:

Figure 10.5: A Bisected Normal Distribution


Logically, it follows that if data are sampled from a normal distribution, roughly
half should come from the area less than the mean and roughly half should come
from the area greater than the mean. The distribution pictured in Figure 10.5, for
example, is a standard normal distribution: were we to sample from that
distribution, we would expect roughly half of our data to be less than 0 and
roughly half to be greater than 0.

Likewise, if we divided a normal distribution into four parts, as in Figure 10.6…

Figure 10.6: A Tetrasected Normal Distribution

… we would expect roughly a quarter of our data set to fall into each quartile of a
normal distribution:
Figure 10.7: Four Parts of a Standard Normal Distribution Overlaid with a
Histogram of 10,000 Samples from a Standard Normal Distribution

Whereas if we drew samples from a non-normal distribution like the χ 2

distribution (just because it’s known to be skewed – the fact that it’s in the chi2

test section is just a nice coincidence), we would see that the samples do not line
up with the normal distribution with the same mean and standard deviation:
Figure 10.8: Four Parts of a Normal Distribution with μ = 1 and σ = 2 Overlaid
with a Histogram of 10,000 Samples from a χ Distribution with df = 1 (
2

μ = 1, σ = 2)

Thus, the χ goodness-of-fit test assesses the difference between how many values
2

of a dataset we would expect in different ranges based on quantiles of a normal


distribution and how many values of a dataset we observe in those ranges. We
don’t know which normal distribution that the data came from3, but our best guess
is that it is the normal with a mean equal to the mean of the observed data (
μ = x observed ) and with a standard deviation equal to the standard deviation of the
observed data (σ = s observed ).

To perform the χ goodness-of-fit test, we divide the assumed distribution – in


2

this case, the normal distribution with a mean equal to the observed sample mean
and a standard deviation equal to the observed sample standard deviation – into k
equal parts, divided by quantiles (if k = 4, then the quantiles are quartiles, if
k = 10, then the quantiles are deciles, etc.). As an example, consider the

following made-up observed data:

made-up data
1.6 3.7 5.1 6.7
made-up data
2.1 4.5 5.6 7.3
2.2 4.6 5.7 7.8
2.3 4.8 6.3 8.1
2.8 4.9 6.6 9.0

The mean of the made-up observed data is 5.08 and the standard deviation of the
made-up observed data is 2.16. If we decide that k = 4 for our χ test, then we
2

want to find the quartiles for a normal distribution with a mean of 5.08 and a
standard deviation of 2.16.

qnorm(c(0.25, 0.5, 0.75), 5.08, 2.16)

## [1] 3.623102 5.080000 6.536898

For k groups – which we usually call cells – we expect about n/k observed
values in each cell. The χ test is based on the difference between the observed
2

frequency f in each cell and the expected frequency f in each cell. The statistic
e o

known as the observed chi-squared statistic (χ ) is the sum of the ratio of the
2
obs

square of the difference between f and f to f – the squared deviation from the
o e o

expectation normalized by the size of the expectation:


2
(f o − f e )
2
χ = ∑
obs
fe

The larger the difference between the observation and the expectation, the larger
the χ .
2
obs

For the observed data, 20/4 = 5 value are expected in each quartile as defined
by the normal distribution: 5 are expected to be less than 3.623102, 5 are expected
to be between 3.623102 and 5.080000, 5 are expected to be between 5.080000
and 6.536898, and 5 are expected to be greater than 6.536898. The observed
values are:

−∞ → 3.623 3.623 → 5.080 5.080 → 6.537 6.537 → ∞

5 5 4 6

The observed χ value is:2

2 2 2 2 2
(f o − f e ) (5 − 5) (5 − 5) (4 − 5) (6 − 5)
2
χ obs = ∑ = + + + = 0.4
fe 5 5 5 5

The χ
2
test uses as its p-value the cumulative likelihood of χ given the df of 2
obs

the χ 2
distribution. The df is given by k − 1.4Thus, we find the cumulative
likelihood of χ
2
obs
– the area under the χ
2
curve to the right of χ
2
obs
– given
df = k − 1.

pchisq(0.4, df=3, lower.tail=FALSE)

## [1] 0.9402425

The p-value is 0.94 – nowhere close to any commonly-used threshold for α


(including 0.05, 0.01, and 0.001) – so we continue to assume the null hypothesis
that the observed data were sampled from a normal distribution.

There is one point that hasn’t been addressed – why k = 4, or why any particular
k? There are two things to keep in mind. The first is that a general rule for the χ
2

test is that it breaks down when f < 5: when the denominator of


(f o −f e )
e is fe

smaller than 5, the size of the observed χ statistics can be overestimated. The
2

second is that the larger the value of k – i.e., the more cells in the χ test – the 2

greater the power of the χ test and the better the ability to accurately identify
2
samples that were not drawn from normal distributions. Thus, the guidance for the
size of k is to have few enough cells so that f ≥ 5, but as many more (or close to
e

as many more) cells as possible.

10.2.3.1.2 The Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov Test5 is a goodness-of-fit test that compares an


observed cumulative distribution with a theoretical cumulative distribution. As
illustrated in the page on probability distributions, the cumulative normal
distribution is an s-shaped curve:

Figure 10.9: The Cumulative Standard Normal Distribution

To get the empirical cumulative distribution for a dataset, we order the data from
smallest to largest. The first value in the empirical distribution is the first value in
the ordered dataset. The second value in the empirical cumulative distribution is
the sum of the first two values in the ordered dataset, the third value in the
empirical cumulative distribution is the sum of the first three values in the ordered
dataset, etc, until the last value in the empirical cumulative distribution which is
the sum of all values in the data. For example, here is the empirical cumulative
curve of the sample data:
Figure 10.10: Empirical Cumulative Curve of the Made-up Data

The Kolmogorov-Smirnov test of normality is based on the differences between


the observed cumulative distribution and the theoretical cumulative distribution
with a mean equal to the mean of the sample data and a standard deviation equal to
the standard deviation of the sample data. More specifically, the test statistic is the
maximum difference between the observed and theoretical cumulative
distributions (wich a couple of rules to determine what to do about ties in the
data); this statistic is called D. How the p-value is determine is a bit too
unnecessarily complicated, especially because the Kolmogorov-Smirnov test is
applied pretty easily using software. As shown below, the ks.test() command in
the base R package takes as entry data the observed data, the cumulative
probability distribution that the data are to be compared to – for a test of
normality, this is the cumulative normal known to R as “pnorm” – and the
sufficient statistics for the cumulative probability distribution – again, for
normality: the mean and the standard deviation, which are given by the mean and
standard deviation of the observed data.

data<-c(1.6, 2.1, 2.2, 2.3, 2.8, 3.7, 4.5, 4.6, 4.8, 4.9, 5.1, 5.6

ks.test(data, "pnorm", mean(data), sd(data))


##
## One-sample Kolmogorov-Smirnov test
##
## data: data
## D = 0.10489, p-value = 0.9639
## alternative hypothesis: two-sided

As with the χ test, the p-value is large by any standard, and thus we continue to
2

assume that the data were sampled from a normal distribution.

10.2.3.1.3 The Shapiro-Wilk Test

The Shapiro-Wilk test is based on correlations between the observed data and
datapoints that would be expected from a normal distribution with a mean equal to
the mean of the observed data and a standard deviation equal to the standard
deviation of the observed data.6

Unlike the χ and Kolmogorov-Sminov test, the Shapiro-Wilk test is specifically


2

made for testing normality, and its implementation in R is the simplest of the three
tests:

data<-c(1.6, 2.1, 2.2, 2.3, 2.8, 3.7, 4.5, 4.6, 4.8, 4.9, 5.1, 5.6
9.0)

shapiro.test(data)

##
## Shapiro-Wilk normality test
##
## data: data
## W = 0.96466, p-value = 0.6406

10.2.3.2 Testing Normality for Multiple Groups

A lot of the parametric tests that we use – correlation, regression, independent-


samples t-tests, ANOVA, and more – test the associations or differences between
multiple samples. However, for each classical parametric analysis, there is only
one normality assumption: that all the data are sampled from a normal. For
example: were we interested in an analysis of an experiment with three conditions,
we would not run three normality tests. We would only run one.
Of course, if there are multiple groups, each with different means, the data will not
look like they came from a single normal, as in the pooled data from samples from
three different normal distributions depicted in Figure 10.11.

set.seed(77)
group1<-rnorm(100, 1, 1)
group2<-rnorm(100, 11, 1)
group3<-rnorm(100, 22, 1)

Figure 10.11: Distribution of the Combination of Three Groups of Data, Each


Sampled From a Different Normal Distribution

And we run the Shapiro-Wilk test on all the observed data, we reject the null
hypothesis: the pooled data violate the normality assumption.

data<-c(group1, group2, group3)


shapiro.test(data)

##
## Shapiro-Wilk normality test
##
## data: data
## W = 0.86785, p-value = 2.349e-15
The solution to this problem is to not assess the observed data, but rather to assess
the residuals of the data. The residuals of any group of data are the differences
between each value in the group minus the group mean (x − x). Because the mean
is the balance point of the data, the mean of the residuals of any data set will be
equal to 0. When we calculate and then put together the residuals of the 3 groups
depicted in Figure 10.11:

residuals<-c(group1-mean(group1), group2-mean(group2), group3-mean

then we have a single group of data with a mean of 0, as depicted in Figure 10.12.

Figure 10.12: Distribution of the Combined Residuals of Three Groups of Data,


Each Sampled From a Different Normal Distribution

And running a goodness-of-fit test on the residuals, we find that the assumption of
normality holds:

shapiro.test(residuals)

##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.99431, p-value = 0.3251

It should be noted that we need to run normality tests on the residuals – rather than
the observed data – when we are assessing the normality assumption before
running parametric tests with multiple groups. We don’t have to calculate
residuals when testing the normality assumption for a single data set but we could
without changing the results. Subtracting the mean of a dataset from each value in
that set only changes the mean of the dataset from whatever the mean is to zero
(unless the mean of the dataset is already exactly zero, in which rare case it would
remain the same). Changing the mean of a normal distribution does not change the
shape of that normal distribution, it just moves it up or down the x-axis. When we
change the mean of an observed dataset, we also change the mean of the
comparison normal distribution for the purposes of normality, and the both the
shape of the data and the shape of the comparison distribution remain the same. So,
for a single set of data, since the residuals will have the same shape as the
observed data, we can run tests of normality on either the residuals or the
observed data and get the same result – it’s only for situations with multiple
groups of data that it becomes an issue.

10.2.3.3 Comparing Goodness-of-fit Tests

Of the three goodness-of-fit tests, the χ test is the most intuitive (admittedly a
2

relative achievement). It is also extremely flexible: it can be used not just to assess
goodness-of-fit to a normal distribution, but to any distribution of data, so long as
expected values can be produced to compare with observed values. It is also the
least powerful of the three tests when it comes to testing normality – regardless of
how many cells are used – and as such is the most likely to miss samples that
violate the assumption that the data are drawn from a normal distribution.

The Kolmogorov-Smirnov test, like the χ test, is not limited to use with the
2

normal distribution. It can be used for any distribution for which a cumulative
theoretical distribution can be produced. It is also more powerful than the χ test
2

and therefore less likely to miss violations of normality. Put those facts together
and the Kolmogorov-Smirnov test is the most generally useful goodness-of-fit test
of the three.

The Shapiro-Wilk test is the most powerful of the three and the easiest to
implement in software. The only drawback to the Shapiro-Wilk test is that it is
specifically and exclusively designed to test normality: it is useless if you want to
assess the fit of data to any other distribution. Still, if you want to test normality,
and you have a computer to do it with, the Shapiro-Wilk test is the way to go.

10.2.3.4 Visualizing Normality Tests: the Quantile-Quantile (Q-Q) Plot

A quantile-quantile or Q-Q plot is a visual representation of the comparison


between an observed dataset and the theoretical probability distribution (such as
the normal) to which it is being assessed. In a Q-Q plot such as the one in Figure
10.13 that depicts the goodness-of-fit between the made-up example data and a
normal distribution, quantiles (such as 1/n, as used in Figure 10.13, to depict
each observed data point) of the observed data are plotted against the counterpart
quantile of the comparison distribution. If the two distributions are identical, then
the points on the Q-Q plot precisely follow the 45 diagonal line indicating that

the quantiles are perfectly aligned. Departures from the diagonal can provide
information about poor fits to the distribution being studied (the normal, for one,
but a Q-Q plot can be drawn for any distribution one wants to model): systematic
departures – like all of the points being above or below the diagonal for a certain
region, or curvature in the plot – can give information about how the observed data
depart from the selected model and/or clues to what probability distributions
might better model the observed data.

In the case of the sample data, though, the observed data are well-modeled by a
normal distribution: there is good goodness-of-fit, and we continue to assume that
the data are sampled from a normal distribution.

data<-c(1.6, 2.1, 2.2, 2.3, 2.8, 3.7, 4.5, 4.6, 4.8, 4.9, 5.1, 5.6
9.0)
qqnorm(data)
qqline(data, col="#9452ff")
Figure 10.13: Quantile-Quantile Plot of the Goodness-of-Fit Between the Made-up
Data and a Normal Distribution

10.2.4 Homoscedasticity (a.k.a. Homogeneity of Variance)

In parametric tests involving more than one group (including the independent-
groups t-test and ANOVA) assume that the data in those groups are all sampled
from distributions with equal variance. This assumption implies in an
experimental context that even though group means might be different in different
conditions, everybody was roughly equivalent before the experiment started; in a
comparison of two different populations, we assume that each population has
equal variance (but not necessarily equal means).

10.2.4.1 Testing Homoscedasticity

10.2.4.1.1 Hartley’s F max Test

Hartley’s F test assesses homoscedasticity using the ratio of observed sample


max

variances, specifically: the ratio of the largest variance to the smallest variance.7If
all observed sample variances are precisely equal, then all evidence would point
to homoscedasticity, and the ratio of the largest variance to the smallest variance
(or technically, of any pair of variances if they are all identical) is 1. Of course, it
is extraordinarily unlikely that any two sample variances would be exactly
identical – even if two samples came from the same distribution, they would
probably have at least slightly different variances.

Hartley’s test examines the departure from a variance ratio of 1, with greater
allowances for departures given for combinations of small sample size and higher
number of variances. The test statistic is the observed F , and it is compared to
max

a critical F max value given the df of the smallest group and the number of
groups.8

For example, assume the following data from two groups are observed:

Placebo Drug
3.5 3.4
3.9 3.6
4.0 4.3
4.0 4.5
4.7 4.8
4.9 4.8
4.9 4.9
4.9 5.0
5.1 5.1
5.1 5.2
5.3 5.2
5.4 5.4
5.4 5.5
5.6 5.5
5.6 5.6
5.7 5.7
6.0 5.7
6.0 5.7
6.6 6.3
7.8 8.1
The variance of the placebo group is 0.98, and the df is n − 1 = 19. The
variance of the drug group is 0.96, and the df is also n − 1 = 19. The observed
Fmax statistic – the ratio of the larger variance to the smaller – is:
0.98
obs F max = = 1.02
0.96

We can consult an F max table to find that the critical value of F


max for df = 19
and k = 2 groups being compared is 2.53. Equivalently, we can use the
qmaxFratio() command from the SuppDists packages, entering our desired α
level, the df , the number of groups k, and indicating lower.tail=FALSE, which is
marginally faster than consulting a table:

library(SuppDists)
qmaxFratio(0.05, 19, 2, lower.tail=FALSE)

## [1] 2.526451

The critical F max is the largest possible value of F maxgiven the df of the
smallest group, the desired α level, and the number of groups k. Since the critical
Fmax for df = 19, k = 2, and α = 0.05 is 2.526451, and the observed F max

value – 1.02 – is less than the critical Fmax , then we continue to assume that the
data were sampled from populations with equal variance.

The only potential drawback to Hartley’s F test is that it assumes normality. If


max

we’re testing normality anyway, that’s not a big deal! But, if you want to test
homoscedasticity without being tied to assuming normality as well, might I
recommend…

10.2.4.2 The Brown-Forsythe Test and Levene’s Test

The Brown-Forsythe test of homogeneity is an analysis of the variance of the


absolute deviations – the absolute value of the difference between each value and
its group mean |x − x| – of different groups. Literally! The Brown-Forsythe test is
an ANOVA of the absolute deviations. We reject the null hypothesis – and
therefore the assumption of homoscedasticity – if the differences between all of the
group variances are significantly greater than the variances observed within each
group.

Levene’s test is based on the same concept as the Brown-Forsythe test, but can use
deviations either from the median or the mean of each group. The leveneTest()
command from the car (Companion to Applied Regression) package, by default,
uses the median. The bf.test() command from the onewaytests package is
related to the Brown-Forsythe test of homoscedasticity, but returns misleading
results when group medians differ. Therefore, use the leveneTest() command to
test homoscedasticity!

The Levene test command leveneTest() accepts data arranged in a long data
frame format. In a wide data frame, the data from different groups are arranged in
different columns; in a long data frame, group membership for each data point in a
column is indicated by a grouping variable in a different column. For example, the
following pair of tables show the same data in wide format (left) and in long
format (right):

Wide Format Long Format


Group A Group B Group Data
1 4 Group A 1
2 5 Group A 2
3 6 Group A 3
Group B 4
Group B 5
Group B 6

To put the two-group example data into long format, and to the run the Brown-
Forsythe test on the variances of those data, takes just a few lines of R code:

library(car)

## Loading required package: carData

##
## Attaching package: 'car'

## The following object is masked from 'package:ROCit':


##
## logit

## The following object is masked from 'package:gtools':


##
## logit

## The following object is masked from 'package:dplyr':


##
## recode

## The following object is masked from 'package:purrr':


##
## some

## The following object is masked from 'package:DescTools':


##
## Recode

Placebo<-c(3.5, 3.9, 4.0, 4.0, 4.7, 4.9, 4.9, 4.9, 5.1, 5.1, 5.3,

Drug<-c(3.4, 3.6, 4.3, 4.5, 4.8, 4.8, 4.9, 5.0, 5.1, 5.2, 5.2, 5.4

lt.example<-data.frame(Group=c(rep("Placebo", 20), rep("Drug", 20)


leveneTest(Data~Group, data=lt.example)

## Warning in leveneTest.default(y = y, group = group, ...):


group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)


## Df F value Pr(>F)
## group 1 0.0891 0.767
## 38

10.3 What to do When Assumptions are violated


If we test the assumptions regarding our data prior to running classical parametric
tests, and we find that the assumptions have been violated, there are a few options
for how to proceed.

First, we can collect more data. Although the Central Limit Theorem does not
support prescribing any specific sample size to alleviate our assumption woes, it
does imply that sample means based on larger sample size are more likely to be
normally distributed, so the normality assumption becomes less important.

Second, we may transform the data, a practice encountered in the page on


Correlation and Regression. Transformations can be theory-based, as in variables
like sound amplitude and seismic intensity where the nature of the variables
indicate that transformations of data are better analyzed using parametric tests than
the raw data themselves. In other cases, transformations of data can reveal things
about the variables. More theory-agnostic methods are available, such as the Cox-
Box set of transformations9, but transformations without theoretical underpinnings
are harder to justify.

Third, and I think this is the best option, is to use a nonparametric test. We’ll talk
about lots of those.

Finally, as noted above: parametric tests tend to be robust against violations of the
assumptions. A type-II error tends to me more likely than a type-I error, so if the
result is significant, there’s probably no harm done. The danger is more in not
observing an effect that you would have observed were the data more in line with
the classical assumptions.

1. I had learned that the n = 30 guideline came from the work of Gosset and
Fisher – and printed that in a book – but after re-researching there’s not a lot
of hard evidence that either Gosset or Fisher explicitly told anybody to
design experiments with a minimum n of 30. The only source I can find that
cited a possible source for the origin of the claim linked to either Gosset or
(noted dick) Fisher is the Stats with Cats blog.↩

2. The resulting distribution from repeatedly sampling single values from a


normal distribution is the same as the original distribution mathematically
speaking. I don’t know if it would be the same philosophically speaking or if
that would be some kind of Ship of Theseus deal.↩

3. One of the central tenets of classical statistics is that we never know


precisely what distribution observed data come from.↩

4. Why k − 1? The χ test measures membership of observations in k cells.


2

Given the overall n of values, if we know the frequency of values in k − 1


cells, then we know how many values belong to the kth cell.↩

5. Just in case you encounter the term Lilliefors test, the Lilliefors test isn’t
quite the same as the Kolmogorov-Smirnov test, but close enough that the
tests are often considered interchangeable.↩

6. Saying that the Shapiro-Wilk test is based on “correlations” is understating


the complexity of the test. A full description of the mechanics of the test – one
that is much better than I could provide – is here.↩
7. The ratios of two variances are distributed as F -distributions. The maximum
possible ratio between any two sample variances is the ratio of the largest
variance to the smallest variance (the ratio of any other pair of variances
would necessarily be smaller). Hence, the term F max .↩

8. The df in this case is the n per sample minus 1. When assessing a sample
variance, the df is based on how the sample mean (which is a part of the
2

sample variance calculation σ =


∑(x−x)
2

n−1
) is calculated. When determining
cell membership as for the χ test, if we know how many of n values are in
2

k − 1 cells, we then know how many values are in the kth cell; when

determining group means, if we know the sample mean and n − 1 of the


numbers in the group, then we know the value of the nth number, so
df = n − 1. The underlying idea is the same, but the equations are different

based on the needs of the test.↩

9. If you think it’s funny that Cox and Box collaborated on papers, you’re not
alone: George Cox and David Box thought the same thing and that’s why they
decided to work together.↩
11 Differences Between Two Things
11.1 Classical Parametric Tests of the Differences Between Two
Things: t-tests
In the simplest terms I can think of, the t-test helps us analyze the difference between two things
that are measured with numbers. There are three main types of t-test –

1. The One-sample t-test: differences between a sample mean and a single numeric value

2. The Repeated-measures t-test: differences between two measurements of the same (or
similar) entities

3. The Independent-groups t-test: differences between two sample means

– but they are all based on the same idea:

What is the difference between two things in terms of sampling error?

The first part of that question – what is the difference – is relatively straightforward. When we
are comparing two things, the most natural question to ask is how different they are. The
numeric difference between two things is half of the t-test.

The other half – in terms of sampling error – is a little trickier. Please think briefly about a
point made on the page on categorizing and summarizing information: if could measure the
entire population, we would hardly need statistical testing at all. Because that is – for all
intents and purposes – impossible, we instead base analyses on subsets of the population –
samples – and contextualize differences based on a combination of the variation in the
measurements and the size of the samples.

That combination of variation in measurements and sample size is captured in the sampling
error or, equivalently, the standard error (the terms are interchangeable). The standard error is
the standard deviation of sample means – it’s how we expect the means of our samples drawn
from the same population-level distribution to differ from each other – more on that to come
below. The t-statistic – regardless of which of the three types of t-test is being applied to the
data – is the ratio of the difference between two things and the expected differences between
them:

a dif f erence
t =
sampling error

The distinctions between the three main types of t-test come down to how we calculate the
difference and _how we calculate the sampling error.
The difference in the numerator of the t formula is always a matter of subtraction. To
understand the sampling error, we must go to the central limit theorem, first introduced in the
page on probability distributions.

11.1.1 The Central Limit Theorem

The central limit theorem (CLT) describes the distribution of means taken from a distribution
(any distribution, although we will be focusing on normal distributions with regard to t-tests).
It tells us that sample means (x) are distributed as (∼) a normal distribution N with a mean
¯

equal to the mean of the distribution from which they were sampled (μ) and a variance equal to
the variance of the population distribution (σ ) divided by the number of observations in each
2

sample n:
2
σ
x ∼ N (μ,
¯ )
n

Figure ?? illustrates an example of the CLT in action: it’s a histogram of the means of one
million samples of n = 49 each taken from a standard normal distribution. The mean of the
sample means is approximately 0 – matching the mean of the standard normal distribution. The
standard deviation of the sample means is approximately 1/7 – the ratio of the standard
deviation of the standard normal distribution (1) and the square root of the size of each sample
(√49 = 7).

As annotated in Figure ??, the majority of sample means are going to fall within 1 standard
error of the mean of the sample means. Relatively few sample means are going to fall 3 or
more standard errors from the mean. To help describe what happens when sample means are
drawn, please consider the two following examples that describe the process of using the
smallest possible sample size – 1 – and the process of using the largest possible sample size –
the size of the entire population, represented by approximately infinity – respectively.

If n = 1, then the distribution of the sample means – which, since n = 1, wouldn’t really be
means so much as they would be the value of the samples themselves – would have a mean
2

= μ and a standard deviation of


s
= σ . Thus, each observation would be a sample from
¯ x 2
¯
x μ
1

a normal distribution. As covered in the page on probability distributions, the probability of


any one of those observations falling into a range is determined by the area under the normal
curve. We know that the probability that a value sampled from a normal distribution is greater
than the mean of the distribution is 50%. We know that the probability that a value sampled
from a normal distribution is within one standard deviation of the mean is approximately 68%.
And – this is important and we will come back to it several times – we know that the
probability of sampling a value that is more than 1.645 standard deviations greater than the
mean is approximately 5%:

pnorm(1.645, lower.tail=FALSE)

## [1] 0.04998491
the probability of sampling a value that is more than 1.645 standard deviations less than the
mean is also approximately 5%:

pnorm(-1.645, lower.tail=TRUE)

## [1] 0.04998491

and that the probability that sampling a value that is either 1.96 standard deviations greater
than or less than the mean is also approximately 5%:

pnorm(-1.96, lower.tail=TRUE)+pnorm(1.96, lower.tail=FALSE)

## [1] 0.04999579

In classical statistical tests, the null hypothesis implies that the sample being studied is an
ordinary sample from a given distribution defined by certain parameters. The alternative
hypothesis is that the sample being studied was taken from different distribution, such as a
distribution of the same type but defined by different parameters (e.g., if the null hypothesis
implies sampling from a standard normal distribution, the alternative hypothesis may be that the
sample comes from another normal distribution with a different mean). We reject the null
hypothesis if the cumulative likelihood of observing the observed data given that the null
hypothesis is true is extraordinarily small. So, if we knew the mean and variance of a
population and happened to observe a single value (or the mean of a single value, which is the
value itself) that was several standard deviations away from the mean of the population that we
assume under the null that we are sampling from, then we may conclude that we should reject
the null hypothesis that that observation came from the null-hypothesis distribution in favor of
the alternative hypothesis that the observation came from some other distribution.

Figure 11.1: Buddy (Will Ferrell) learning to reject the null hypothesis that he is sampled from
a population of elves is a major plot point in the 2003 holiday classic Elf
If n = ∞, then the distribution of the sample means would have a mean equal to μ and a
variance of σ /∞ = 01. That means that if you could sample the entire population and take
2

the mean, the mean you take would always be exactly the population mean. Thus, there would
be no possible difference between the mean of the sample you take and the population mean –
because the mean you calculated would necessarily be the population mean – and thus there
would be no sampling error. In statistical terms, the end result would be a sampling error or
standard error of 0.

With a standard error of 0, each mean we calculate would be expected to be the population
mean. If we sampled an entire population and we did not calculate the expected population
mean, then one of two things are true: either the expectation about the mean is wrong, or, more
interestingly, we have sampled a different population than the one we expected.

11.1.2 One-sample t-test


In theory, the one-sample t-test assesses the difference between a sample mean and a
hypothesized population mean. The idea is that a sample is taken, and the mean is measured.
The variance of the sample data is used as an estimate of the population variance, so that the
sample mean is considered to be – in the null hypothesis – one of the many possible means
sampled from a normal distribution with the hypothesized population mean μ and a variance
estimated from the sample variance (σ ≈ s ). The central limit theorem tells us that the
2 2
obs

distribution of the sample means – of which the mean of our observed data is assumed to be
one in the null hypothesis – has a mean equal to the population mean (x = μ) and a variance
¯

equal to the population variance divided by the size of the sample (σ = σ /n) and therefore
¯
x
2 2

a standard deviation equal to the population standard deviation divided by the square root of
σ x = √σ
¯
¯
2
x
= σ/√n the size of the sample ( , which is the standard error).

The t-statistic (or just t, if you’re into the whole brevity thing) measures how unusual the
observed sample mean is from the hypothesized sample mean. It does so by measuring how far
away the observed sample mean is from the hypothesized population mean in terms of standard
errors (see the illustration in Figuref ??).

The t-distribution models the distances of sample means from the hypothesized population
mean in terms of standard errors as a function of the degrees of freedom (df ) used in the
calculation of those sample means (i.e., n − 1). Smaller (in terms of n) samples from normal
distributions are naturally more likely to be far away from the mean – there is greater sampling
error for smaller samples – and the t-distribution reflects that by being more kurtotic when df
is small. That is, relative to t-distributions with large df , the tails of t-distributions for small
df are thicker and thus there are greater areas under the curve associated with extreme values

of t (see Figure ??, which is reproduced from the probability distributions page).

Of course, it is possible to observe any mean value for a finite sample2 taken from a normal
distribution. But, at some point, an observed sample mean can be so far from the mean and
therefore the probability of observing a sample mean at least that far away is so unlikely that
we reject the null hypothesis that the observed sample mean came from the population
described in the null hypothesis. We reject the null hypothesis in favor of the alternative
hypothesis that the sample came from some other population. We use the t-statistic to help us
make that decision.

The observed t that will help us make our decision is given by the formula:
¯ x − μ
t obs =
se

¯
x μ where is the observed sample mean, is the population mean posited in the null hypothesis,
and se is the standard error of the sample mean s /√n, which approximates the standard
obs

deviation of all sample means of size n taken from the hypothesized normal distribution
σ/√ n (because we don’t really know what σ is in the population).

As noted above, μ is the mean of a hypothetical population, but in practice it can be any
number of interest. For example, if we were interested in whether the mean of a sample were
significantly not equal to zero (> 0, < 0, or ≠ 0), we could put some variation of μ = 0 (
≤ 0, ≥ 0, or = 0) in the null hypothesis to simulate what would happen if the population

mean were 0? even if we aren’t really thinking about 0 in terms of a population mean.

11.1.2.0.1 One-sample t Example

For our examples, let’s pretend that we are makers of large kitchen appliances. Let’s start by
making freezers. First, we will have to learn how freezers work and how to build them.

Looks easy enough! Now we need to test our freezers. Let’s say we have built 10 freezers and
we need to know that our sample of freezers produces temperatures that are significantly less
than 0 C .3 Here, in degrees Celsius, are our observed data:

Freezer Temperature ( ∘
C )
1 -2.14
2 -0.80
3 -2.75
4 -2.58
5 -2.26
6 -2.46
7 -1.33
8 -2.85
9 -0.93
10 -2.01

Now we will use the six-step hypothesis testing procedure to test the scientific hypothesis that
the mean of the internal temperatures of the freezers we built is significantly less than 0.
Oh, sorry, Nick Miller. First we must do the responsible thing and check the assumptions of the
t-test. Since we have only one set of data (meaning we don’t have to worry about

homoscedasticity), the only check we have to do is about normality:

shapiro.test(one.sample.data)

##
## Shapiro-Wilk normality test
##
## data: one.sample.data
## W = 0.89341, p-value = 0.1852

The Shapiro-Wilk test says we can continue to assume normality.

11.1.2.0.1.1 S ix-S tep Hypothesis Testing

1. Define null and alternative hypotheses.

For this freezer-temperature-measuring experiment, we are going to start by assuming that the
mean of our observed temperatures is sampled from a normal distribution with a mean of 0. We
don’t hypothesize a variance in this step: for now, the population variance is unknown (it will
be estimated by the sample variance in the process of doing the t-test calculations in step 5 of
the procedure).

Now, it will not do us any good – as freezer-makers – if the mean internal temperatures of our
freezers is greater than 0 C . In this case, a point (or two-tailed) hypothesis will not do,

because that would compel us to reject the null hypothesis if the mean were significantly less
than or greater than 0 C . Instead we will use a directional (or one-tailed) hypothesis, where

our null hypothesis is that we have sampled our mean freezer temperatures from a normal
distribution of freezer temperatures with a population mean that is greater than or equal to 0
and our alternative hypothesis is that we have sampled our mean freezer temperatures from a
normal distribution of freezer temperatures with a population mean that is less than 0.:
H0 : μ ≥ 0

H1 : μ < 0

2. Define the type-I error (false-alarm) rate α

Let’s say α = 0.05. Whatever.

3. Identify the statistical test to be used.

Since this is the one-sample t-test section of the page, let’s go with “one-sample t-test.”

4. Identify a rule for deciding between the null and alternative hypotheses

If the observed t indicates that the cumulative likelihood of the observed t or more extreme
unobserved t values is less than the type-I error rate (that is, if p ≤ α), we will reject H in0

favor of H .
1

We can determine whether p ≤ α in two different ways. First, we can directly compare the
area under the t curve for t ≤ t – which is p because the null hypothesis is one-tailed and
obs

we will reject H if t is significantly less than the hypothesized population parameter μ – to α


0

. We can accomplish that fairly easily using software and that is what we will do (sorry for the
spoiler).

The second way we can determine whether p ≤ α is by using critical values of t. A critical
value table such as this one lists the values of t for which p is exactly α given the df and
whether the test is one-tailed or two-tailed. Any value of t with an absolute value greater than
the t listed in the table for the given α, df , and type of test necessarily has an associated p-
value that is less than α. The critical-value method is helpful if you (a) are in a stats class that
doesn’t let you use R on quizzes and tests, (b) are stranded on a desert island with nothing but
old stats books and for some reason need to conduct t-tests, and/or (c) live in 1955.

5. Obtain data and make calculations

The mean of the observed sample data is -2.01, and the standard deviation is 0.74. We
incorporate the assumption that the observed standard deviation is our best guess for the
population standard deviation in the equation to find the standard error:
sd obs
se =
√n
although, honestly, if you don’t care much for the theoretical underpinning of the t formula, it
suffices to say the se is the sd divided by the square root of n.

Our null hypothesis indicated that the μ of the population was 0, so that’s what goes in the
numerator of the t formula. Please note that it makes no difference at this point whether H
obs 0

is μ ≤ 0, μ = 0, or μ ≥ 0: the number associated with μ goes in the t formula; the equals


obs

sign or inequality sign comes into play in the interpretation of t .


obs

Thus, the observed t is:


¯ x − μ −2.01 − 0
t = = = −8.59
se 0.23

6. Make a decision

Now that we have an observed t, we must evaluate the cumulative likelihood of observing at
least that t in the direction(s) indicated by the type of test (one-tailed or two-tailed). Because
the alternative hypothesis was that the population from which the sample was drawn had a
mean less than 0, the relevant p-value is the cumulative likelihood of observing t or a obs

lesser (more negative) t.

Because n = 10 (there were 10 freezers for which we measured the internal temperature),
df = 9. We thus are looking for the lower-tail cumulative probability of t ≤ t given that
obs

df = 9:

pt(-8.59, df=9, lower.tail=TRUE)

## [1] 6.241159e-06

That is a tiny p-value! It is much smaller than the α rate that we stipulated (α = 0.05). We
reject H .0

OR: the critical t for α = 0.05 and df = 9 for a one-tailed test is 1.833. The absolute value of
t is |−8.59|= 8.59, which is greater than the critical t. We reject H .
obs 0

OR: we could have skipped all of this and just used R from the jump:

t.test(one.sample.data, mu=0, alternative = "less")

##
## One Sample t-test
##
## data: one.sample.data
## t = -8.5851, df = 9, p-value = 6.27e-06
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
## -Inf -1.580833
## sample estimates:
## mean of x
## -2.010017
but then we wouldn’t have learned as much.

11.1.3 Repeated-Measures t-test

The repeated-measures t-test is used when we have two measurements of the same things and
we want to see if the mean of the differences for each thing is statistically significant.
Mathematically, the repeated-measures t-test is the exact same thing as the one-sample t-test!
The only special thing about it is that we get the sample of data by subtracting, for each
observation, one measure from the other measure to get difference scores.

The repeated-measures t-test is just a one-sample t-test of difference scores.

11.1.3.0.1 Repeated-measures t Example

To generate another example, let’s go back to our imaginary careers as makers of large kitchen
appliances. This time, let’s make an oven. To test our oven, we will make 10 cakes, measuring
the temperature each tin of batter in C before they go into the oven and then measuring the

temperature of each of the (hopefully) baked sponges after 45 minutes in the oven.

Here are the (made-up) data:

Cake Pre-bake ( ∘
C ) Post-bake ( ∘
C ) Difference (Post − Pre)
1 20.83 100.87 80.04
2 19.72 98.58 78.86
3 19.64 109.09 89.44
4 20.09 121.83 101.74
5 22.25 122.78 100.53
6 20.83 111.41 90.58
7 21.31 103.96 82.65
8 22.50 121.81 99.31
9 21.17 127.85 106.68
10 19.57 115.17 95.60

Once we have calculated the difference scores – which for convenience we will abbreviate
with d – we have no more use for the original paired data4. The calculations proceed using d
precisely as they do for the observed data in the one-sample t-test.

11.1.3.1 Repeated Measures and Paired Samples

This is as good a time as any to pause and note that what I have been calling the repeated-
measures t-test is often referred to as the paired-samples t-test. Either name is fine! What is
important to note is that the measures in this test do not have to refer to the same individual. A
paired sample could be identical twins. It could be pairs of animals that have been bred to be
exactly the same with regard to some variable of interest (e.g., murine models). It could also
be individuals that are matched on some demographic characteristic like age or level of
education attained. From a statistical methods point of view, paired samples from different
individuals are treated mathematically the same as are paired samples from the same
individuals. Whether individuals are appropriately matched is a research methods issue.

Back to the math: the symbols and abbreviations we use are going to be specific to the
repeated-measures t-test, but the formulas are going to be exactly the same. We calculate the
observed t for the repeated-measures test the same way as we calculated the observed t for the
one-samples test, but with assorted d’s in the formulas to remind us that we’re dealing with
difference scores.
¯
d − μd
t obs =
se d

The null assumption is that we are drawing the sample of difference scores from a population
of difference scores with a mean equal to μ . The alternative hypothesis is that we are are
d

sampling from a distribution with a different μ .d

So, now we are completely prepared to apply the six-step procedure to …

Oh, thanks, Han Solo! We have to test the normality of the differences:

shapiro.test(difference)

##
## Shapiro-Wilk normality test
##
## data: difference
## W = 0.93616, p-value = 0.5111

We can continue to assume that the differences are sampled from a normal distribution.
11.1.3.1.0.1 S ix-step Hypothesis Testing

1. Identify the null and alternative hypotheses.

Let’s start with the scientific hypothesis that the oven will make the cakes warmer. Thus, we
are going to assume a null that the oven makes the cakes no warmer or possibly less warm,
with the alternative being that the mu is any positive non-zero difference in the temperature of
d

the cakes.

H0 : μd ≤ 0

H1 : μd > 0

2. Identify the type-I error (false alarm) rate.

Again, α = 0.05, a type-I error rate that might be described as…

So, this time, let’s instead use α = 0.01

3. Identify the statistical test to be used.

Repeated-measures t-test.
4. Identify a rule for deciding between the null and alternative hypotheses

If p < α, which as noted for the one-sample t-test we can calculate with software or by
consulting a critical-values table but really we’re just going to use software, reject H in favor
0

of H . 1

5. Obtain data and make calculations

The mean of the observed sample data is 92.54, and the standard deviation is 9.77. As for the
one-sample test, we incorporate the assumption that the observed standard deviation of the
differences is our best guess for the population standard deviation of the differences in the
equation to find the standard error:

sd d(obs)
se d =
√n

The observed t for these data is:


¯
d − μd 92.54 − 0
t obs = = = 29.97
se d 3.09

Because the null hypothesis indicates that μ ≤ 0, the p-value is the cumulative likelihood that
d

of t ≥ t : an upper-tail probability. Thus, the observed p-value is:


obs

pt(29.97, df=9, lower.tail=FALSE)

## [1] 1.252983e-10

The observed p-value is less than α = 0.01, so we reject the null hypothesis in favor of the
alternative hypothesis: the population-level mean difference between pre-bake batter and post-
bake sponge warmth is greater than 0.

Of course, we can save ourselves some time by using R. Please note that when using the
t.test() command with two arrays (as we do with the repeated-measures test and will again
with the independent-samples test), we need to note whether the samples are paired or not
using the paired = TRUE/FALSE option.

t.test(postbake, prebake, mu=0, alternative="greater", paired=TRUE)

##
## Paired t-test
##
## data: postbake and prebake
## t = 29.966, df = 9, p-value = 1.255e-10
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 86.88161 Inf
## sample estimates:
## mean of the differences
## 92.54283
Based on this analysis, our ovens are in good working order.

But that result only tells us that the mean increase in temperature was significantly greater
than 0. That is probably not good enough, even for imaginary makers of large kitchen
appliances.

Suppose, then, that instead of using μ = 0 as our H , we used μ = 90 as our H . That


d 0 d 0

would mean that we would be testing whether our ovens, on average, raise the temperature of
our cakes by more than 90 C . We’ll skip all of the usual steps and just alter our R commands

by changing the value of mu:

t.test(postbake, prebake, mu=90, alternative="greater", paired=TRUE)

##
## Paired t-test
##
## data: postbake and prebake
## t = 0.82337, df = 9, p-value = 0.2158
## alternative hypothesis: true difference in means is greater than 90
## 95 percent confidence interval:
## 86.88161 Inf
## sample estimates:
## mean of the differences
## 92.54283

Thus, while the mean difference was greater than 90 ∘


C , it was not significantly so.
A couple of things to note for that. The first is the p-value changes because we changed the null
hypothesis – more evidence that the p-value is not the probability of the data themselves. The
other is that altering μ can be a helpful tool when assuming μ = 0 – which is the norm (and
d d

also the default in R if you leave out the mu= part of the t.test() command) – is not
scientifically interesting.

11.1.4 Independent-Groups t-test

The last type of t-test is the independent-groups (or independent-samples) t-test. The
independent-groups test is used when we have completely different samples. They don’t even
have to be the same size.5.

When groups are independent, the assumption is that they are drawn from different populations
that have the same variance. The independent-samples t is a measure of the difference of the
means of the two groups. The distribution that the difference of the means is sampled from is
not one of a single population but of the combination of the two populations.

The rules of linear combinations of random variables tell us that the difference of two normal
distributions is a normal with a mean equal to the difference of the means of the two
distributions and a variance equal to the sum of the means of the two distributions:
Figure 11.2: Linear (Subtractive) Combination of two Normal Distributions

And so the difference of the means is hypothesized to come from the combined distribution of
the differences of the means of the two populations.

11.1.4.1 Pooled Variance

The variance of the population distributions that each sample comes from are assumed to be the
same: that’s the homescedasticity assumption. When performing the one-sample and repeated-
measures t-tests, we used the sample variances of the single sample and the difference scores,
respectively, to estimate the populations from which those numbers came. But, with two
groups, we will most likely have two sample variances (we could have the exact same sample
variance in both groups, but that would be quite improbable). We can’t say that one sample
comes from a population with a variance approximated by the variance of the that sample and
that the other sample comes from a population with a variance approximated by the variance of
the other sample: that would imply that the populations have different variances, and we can’t
have that, now, can we?

Instead, we treat the two sample variances together as estimators of the common variance
value of the two populations by calculating what is known as the pooled variance:
2 2
s 1 (n 1 − 1) + s 2 (n 2 − 1)
2
s pooled =
n1 + n2 − 2

The pooled variance, in practice, acts like a weighted average of the two sample variances: if
the samples are of uneven size (n1 ≠ n ), by multiplying each sample variance by n − 1, the
2

variance of the larger sample is weighted more heavily than the variance of the smaller sample.
In theory (in practice, too, I guess, but it’s a little harder to see), by multiplying the sample
variances by their respective n − 1, the pooled variance takes the numerator of the sample
variances – the sums of squares (SS) for each observation – adds them together, and creates a
new variance.
2 2
¯
¯ ∑(x 1 − x 1 ) + ∑(x 2 − x 2 )
2
s pooled =
n1 + n2 − 2

The denominator of the pooled variance is the total degrees of freedom of the estimate:
because there are two means involved in the calculation of the numerator – x to calculate the
¯ 1

SS of sample 1 and x to calculate the SS of sample 2, the total df is the total n = n + n


¯ 2 1 2

minus 2.

Now, our null assumption is that the difference of the means comes from a distribution
generated by the combination of the two distributions from which the two means were
respectively sampled with a hypothesized mean of the difference of means. The variance of
that distribution is unknown at the time that the null and alternative hypotheses are determined6.
When we do estimate that variance, it will be the pooled variance. And, when we apply the
central limit theorem, we will use a standard error that – based on the rules of linear
combinations – is a sum of the standard errors of each sampling procedure:
2 2
σ σ
¯
¯
x1 − x2 ∼ N (μ 1 − μ 2 , + )
n1 n2

And therefore our formula for the observed t for the independent-samples t-test is:

¯
¯ x1 − x2 − Δ
t obs =
2 2
s s
√ pooled
+
pooled

n1 n2

where Δ is the hypothesized mean of the null-distribution of the difference of means.

Please note: in practice, as with the repeated-measures t-test, researchers rarely use a non-zero
value for the mean of the null distribution (μ in the case of the repeated-measures test; Δ in
d

the case of the independent-samples test). Still, it’s there if we need it.

The df that defines the t-distribution that our difference of sample means comes from is the
sum of the degrees of freedom for each sample:

df = (n 1 − 1) + (n 2 − 1) = n 1 + n 2 − 2

11.1.4.1.1 Independent-groups t Example

Now, let’s think up another example to work through the math. In this case, let’s say we have
expanded our kitchen-appliance-making operation to include small kitchen appliances, and we
have made two models of toaster: the Mark I and the Mark II. Imagine, please, that we want to
test if there is any difference in the time it takes each model to properly toast pieces of bread.
We test 10 toasters of each model (here n − n that’s doesn’t necessarily have to be true) of
1 2

toaster and record the time it takes to finish the job:

Time to Toast (s)


Mark I Mark II
4.72 9.19
7.40 11.44
3.50 9.64
3.85 12.09
4.47 10.80
4.09 12.71
6.34 10.04
3.30 9.06
7.13 6.31
4.99 9.44

We are now ready to test the hypothesis that there is any difference in the mean toasting time
between the two models.

Oh, I almost forgot again! Thanks, Diana Ross! We have to test two assumptions for the
independent-samples t-test: normality and homoscedasticity.

Mark.I.residuals<-Mark.I-mean(Mark.I)
Mark.II.residuals<-Mark.II-mean(Mark.II)

shapiro.test(c(Mark.I.residuals, Mark.II.residuals))

##
## Shapiro-Wilk normality test
##
## data: c(Mark.I.residuals, Mark.II.residuals)
## W = 0.94817, p-value = 0.3402

Good on normality!

model<-c(rep("Mark I", length(Mark.I)), rep("Mark II", length(Mark.II)))


times<-c(Mark I Mark II)
times< c(Mark.I, Mark.II)
independent.samples.df<-data.frame(model, times)
leveneTest(times~model, data=independent.samples.df)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced


to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)


## Df F value Pr(>F)
## group 1 0.186 0.6714
## 18

And good on homoscedasticity, too!

Now, we may proceed with the six-step procedure.

1. Identify the null and alternative hypotheses.

We are looking in this case for evidence that either the Mark I toaster or the Mark II toaster is
faster than the other. Logically, that means that we are interested in whether the difference
between the mean toasting times is significantly different from 0. Thus, we are going to assume
a null that indicates no difference between the population mean of Mark I toaster times and the
population mean of Mark II toaster times.

¯
¯
H0 : x1 − x2 = 0

¯
¯
H1 : x1 − x2 ≠ 0

2. Identify the type-I error (false alarm) rate.

The type-I error rate we used for the repeated-measures example – α = 0.01 – felt good. Let’s
use that again.

3. Identify the statistical test to be used.

Independent-groups t-test.
4. Identify a rule for deciding between the null and alternative hypotheses

If p < α, which again we can calculate with software or by consulting a critical-values table
but there’s no need to get tables involved here in the year 2020, reject H in favor of H . 0 1

5. Obtain data and make calculations

The mean of the Mark I sample is 4.98, and the variance is 2.19. The mean of the Mark II
sample is 10.07, and the variance is 3.33.

The pooled variance for the two samples is:


2 2
s (n 1 − 1) + s (n 2 − 1) 2.19(10 − 1) + 3.33(10 − 1)
2 1 2
s pooled = = = 2.76
n1 + n2 − 2 10 + 10 − 2

The observed t for these data is:


¯
¯ x1 − x2 − Δ 4.98 − 10.07 − 0
t obs = = = −6.86
2 2 2.76 2.76
s s √
√ pooled pooled +
+ 10 10
n1 n2

Because this is a two-tailed test, the p-value is the sum of the cumulative likelihood of
t ≤ −|t | or t ≥ |t
obs | – that is, the sum of the lower-tail probability that a t could be less
obs

than or equal to the negative version of t and the upper-tail probability that a t could be
obs

greater than or equal to the positive version of t : obs

pt(-6.86, df=18, lower.tail=TRUE)+pt(6.86, df=18, lower.tail=FALSE)

## [1] 2.033668e-06

And all that matches what we could have done much more quickly and easily with the
t.test() command. Note in the following code that paired=FALSE – otherwise, R would run
a repeated-measures t-test – and that we have included the option var.equal=TRUE, which
indicates that homoscedasticity is assumed. Assuming homoscedasticity is not the default
option in R: more on that below.

t.test(Mark.I, Mark.II, mu=0, paired=FALSE, alternative="two.sided", var.equ

##
## Two Sample t-test
##
## data: Mark.I and Mark.II
## t = -6.8552, df = 18, p-value = 2.053e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.654528 -3.532506
## sample estimates:
## mean of x mean of y
## 4.979966 10.073482
11.1.4.2 Welch’s t-test

The default option for the t.test() command when paired=FALSE is indicated is
var.equal=FALSE. That means that without instructing the software to assume
homoscedasticity, the default is to assume different variances. This default test is known as
Welch’s t-test. Welch’s test differs from the traditional, homoscedasticity-assuming t-test in
two ways:

1. The pooled variance is replaced by separate population variance estimates based on the
sample variances. The denominator for Welch’s t is therefore:

2 2
s s
√ 1 2
+
1 2
n n

2. The degrees of freedom of the t-distribution are adjusted to compensate for the differences
in variance. The degrees of freedom for the Welch’s test are not n + n − 2, but rather:1 2

2 2 2
s s
1 2
( + )
n1 n2

df ≈
4 4
s s
1 2

2
+ 2
n (n 1 −1) n (n 2 −1)
1 2

Really, what you need to know there is that the Welch’s test uses a different $ than the
traditional independent-samples test.

Repeating, then, the analysis of the toasters with var.equal=TRUE removed, the result of the
Welch test is:

t.test(Mark.I, Mark.II, mu=0, paired=FALSE, alternative="two.sided")

##
## Welch Two Sample t-test
##
## data: Mark.I and Mark.II
## t = -6.8552, df = 17.27, p-value = 2.566e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.659274 -3.527760
## sample estimates:
## mean of x mean of y
## 4.979966 10.073482

Note that for these data where homoscedasticity was observed, the t bs is the same. The df is
o

a non-integer value, but close to the value of df = 18 for the traditional independent-samples
test, and the p-value is essentially the same.

The advantage of Welch’s test is that it accounts for possible violations of homoscedasticity. I
don’t really see a downside except that you may have to explain it a tiny bit in a write-up of
your results.
11.1.4.2.1 Notes on the t-test

1. Although I chose one or the other for each of the above examples, any type of t-test can be
either a one-tailed test or a two-tailed test. The direction of the hypothesis depends only
on the nature of the question one is trying to answer, not on the structure of the data.

2. Some advice for one-tailed tests: always be sure to keep track of your signs. For the one-
tailed test, that is relatively easy: keep in mind whether you are testing whether the sample
mean is supposed to be greater than the null value or less than the null value. For the
repeated-measures and independent-groups tests, be careful which values you are
subtracting from which. Which measurement you subtract from which in the repeated-
measures test and which mean you subtract from which in the independent-groups test is
arbitrary. However, it can be shocking if, for example, you expect scores to increase from
one measurement to another and they appear to decrease, but only because you subtracted
what was supposed to be bigger from what was supposed to be smaller (or vice versa).

3. Related to note 2: assuming that the proper subtractions have been made, if you have a
directional hypothesis and the result is in the wrong direction, it cannot be statistically
significant. If, for example, the null hypothesis is μ ≤ 0 and the t value is negative, then
one cannot reject the null hypothesis no matter how big the magnitude of t. An
experiment testing a new drug with a directional hypothesis that the drug will make things
better is not successful if the drug makes things waaaaaaaay worse.

11.1.5 t-tests and Regression

Check this out:

Let’s run a linear regression on our toaster data where the predicted variable (y) is toasting
time and the predictor variable (x) is toaster model. Note the t-value on the “modelMark.II”
line:

toaster.long<-data.frame(model=c(rep("Mark.I", 10), rep("Mark.II", 10)), tim

summary(lm(times~model, data=toaster.long))

##
## Call:
## lm(formula = times ~ model, data = toaster.long)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7588 -0.9213 -0.3457 1.3621 2.6393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9800 0.5254 9.479 2.02e-08 ***
## modelMark.II 5.0935 0.7430 6.855 2.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.661 on 18 degrees of freedom
## Multiple R-squared: 0.7231, Adjusted R-squared: 0.7077
## F-statistic: 46.99 on 1 and 18 DF, p-value: 2.053e-06

That t-value is the same as the t that we got from the t-test (but with the sign reversed), and the
p-value is the same as well. The t-test is a special case of regression: we’ll come back to that

later.

11.1.6 Nonparametric Tests of the Differences Between Two Things

Nonparametric Tests evaluate the pattern of observed results rather than the numeric
descriptors (e.g., summary statistics like the mean and the standard deviation) of the results.
Where a parametric test might evaluate the cumulative likelihood (given the null hypothesis, of
course) that people in a study improved on average on a measure following treatment, a
nonparametric test might take the same data and evaluate the cumulative likelihood that x out of
n people improved on the same measure following treatment.

Three examples of nonparametric tests were covered in correlation and regression: the
correlation of ranks ρ, the concordance-based correlation τ , and the categorical correlation γ.
In each of those tests, it is not the relative values of paired observations with regard to the
mean and standard deviation of the variables but the relative patterns of paired observations.

11.1.7 Nonparametric Tests for 2 Independent Groups

11.1.7.1 The χ
2
Test of Statistical Independence

Statistical Independence refers to a state where the patterns of the observed data do not
depend on the number of possibilities for the data to be arranged. That means that there is no
relationship between the number of categories that a set of datapoints can be classified into and
the probability that the datapoints will be classified in a particular way. By contrast, statistical
dependence is a state where there is a relationship between the possible data structure and the
observed data structure.

The χ test of statistical independence takes statistical dependence as its null hypothesis and
2

statistical independence as its alternative hypothesis. It is essentially the same test as the χ 2

goodness-of-fit test, but used for a wider variety of applied statistical inference (that is,
beyond evaluating goodness-of-fit). χ tests of statistical independence can be categorized by
2

the number of factors being analyzed in a given test: in this section, we will talk about the one
factor case – one-way χ tests – and the two factor case – two-way χ tests.
2 2

11.1.7.1.1 One-way χ Test of Statistical Independence


2

In the one-way test, statistical independence is determined on the basis of the number of
χ
2

possible categories for the data.7


To help illustrate, please imagine that 100 people who are interested in ordering food open up
a major third-party delivery service website and see the following:

There are two choices, neither of which is accompanied by any type of description. We would
expect an approximately equal number of the 100 hungry people to pick each option. Given that
there is no compelling reason to choose one over the other, the choice responses of the people
are likely to be statistically dependent on the number of choices. We would expect
approximately = 50 people to order from Restaurant A and approximately = 50 people
n

2
n

to order from Restaurant B: statistical dependence means we can guess based solely on the
possible options.

Now, let’s say our 100 hypothetical hungry people open the same third-party food-delivery
website and instead see these options:

Given that the titles of each restaurant are properly descriptive and not ironic, this set of
options would suggest that people’s choices will not depend solely on the number of options.
The choice responses of the 100 Grubhub customers would likely follow a non-random
pattern. Their choices would likely be statistically independent of the number of possible
options.
As in the goodness-of-fit test, the χ test of statistical independence uses an observed χ test
2 2

statistic that is calculated based on observed frequencies (f , although it’s sometimes


o

abbreviated O ) and on _expected frequencies (f , sometimes E ):


f e f

k 2
(f o − f e )
2
χ obs = ∑
fe
1

where k is the number of categories or cells that the data can fall into. For the one-way test, the
expected frequencies for each cell are the total number of observations divided by the number
of cells:
n
fe = .
k

The expected frequencies do not have to be integers! In fact, unless the number of observations
is a perfect multiple of the number of cells, they will not be.

11.1.7.1.2 Degrees of Freedom

The degrees of freedom (df , sometimes abbreviated with the Greek letter ν [“nu”]) for the
one-way χ test is the number of cells k minus 1:
2

df = k − 1

Generally speaking, the degrees of freedom for a set of frequencies (as we have in the data that
can be analyzed with the χ test) are the number of cells whose count can change while
2

maintaining the same marginal frequency. For a one-way, two-cell table of data with A
observations in the first cell and B observations in the second cell, the marginal frequency is
A + B:

cell cell margin


A B A+B

thus, we can think of the marginal frequencies as totals on the margins of tables comprised of
cells. If we know A, and the marginal frequency A + B is fixed, then we know B by
subtraction. The observed frequency of A can change, and then (given fixed A + B) we would
know B; B could change, and then we would know A. We cannot freely change both A and B
while keeping A + B constant; thus, there is 1 degree of freedom for the two-cell case.

As covered in probability distributions, df is the sufficient statistic for the χ distribution: it


2

determines both the mean (df ) and the variance (2df ). Thus, the df are all we need to know to
calculate the area under the χ curve at or above the observed χ statistic: the cumulative
2 2

likelihood of the observed or more extreme unobserved χ values. The χ distribution is a


2 2

one-tailed distribution and the χ test is a one-tailed test (the alternative hypothesis of
2

statistical independence is a binary thing – there is no such thing as either negative or positive
statistical independence), so that upper-tail probability is all we need to know.
As the χ test is a classical inferential procedure (albeit one that makes an inference on the
2

pattern of observed data and no inference on any given population-level parameters), it


observes the traditional six-step procedure. The null hypothesis of the χ test – as noted above
2

– is always that there is statistical dependence and the alternative hypothesis is always that
there is statistical independence. Whatever specific form dependence and independence take,
respectively, depends (no pun intended) on the situation being analyzed. The α-rate is set a
priori, the test statistic is a χ value with k − 1 degrees of freedom, and the null hypothesis
2

will be rejected if p ≤ α.

To make the calculations, it is often convenient to keep track of the expected and observed
frequencies in a table resembling this one:

fe fe

f0 f0

We then determine whether the cumulative likelihood of the observed χ value or more
2

extreme χ values given the null hypothesis of statistical dependence – mathematically,


2

statistical dependence looks like χ ≈ 0 – is less than or equal to the predetermined α rate
2

either by using a table of χ quantiles.


2

Let’s work through an example using the fictional choices of 100 hypothetical people between
the made-up restaurants Dan’s Delicious Dishes and Homicidal Harry’s House of Literal
Poison. Suppose 77 people choose to order from Dan’s and 23 people choose to order from
Homicidal Harry’s. Under the null hypothesis of statistical dependence, we would expect an
equal frequency in each cell: 50 for Dan’s and 50 for Homicidal Harry’s:

fe1<-c("$f_e=50$", "")
fo1<-c("", "$f_0=77$")
fe2<-c("$f_e=50$", "")
fo2<-c("", "$f_0=23$")

kable(data.frame(fe1, fo1, fe2, fo2), "html", booktabs=TRUE, escape=FALSE, c


kable_styling() %>%
row_spec(1, italic=TRUE, font_size="small")%>%
row_spec(2, font_size = "large") %>%
add_header_above(c("Dan's"=2, "Harry's"=2))

Dan’s Harry’s
f e = 50 f e = 50

f 0 = 77 f 0 = 23

The observed χ statistic is:


2
k 2 2 2
(f o − f e ) (77 − 50) (23 − 50)
2
χ = ∑ = + = 14.58 + 14.58 = 29.16
obs
fe 50 50
i

The cumulative probability of χ 2


≥ 29.16 for a χ distribution with df
2
= 1 is:

pchisq(29.16, df=1, lower.tail=FALSE)

## [1] 6.66409e-08

which is smaller than an α of 0.05, or an α of 0.01, or really any α that we would choose.
Thus, we reject the null hypothesis of statistical dependence in favor of the alternative of
statistical independence. In terms of the example: we reject the null hypothesis that people are
equally likely to order from Dan’s Delicious Dishes or from Homicidal Harry’s House of
Literal Poison in favor of the alternative that there is a statistically significant difference in
people’s choices.

To perform the χ test in R is blessedly more simple:


2

chisq.test(c(77, 23))

##
## Chi-squared test for given probabilities
##
## data: c(77, 23)
## X-squared = 29.16, df = 1, p-value = 6.664e-08

11.1.7.1.3 Relationship with the Binomial

Now, you might be thinking:

Wait a minute … the two-cell one-way χ test seems an awful lot like a binomial
2

probability problem. What if we treated these data as binomial with s as the frequency of
one cell and f as the frequency of the other with a null hypothesis of π = 0.5?

And it is very impressive that you might be thinking that, because the answer is:

A one-way two-cell test will result in approximately the same p-value as


χ
2

p(s ≥ s obs |π = 0.5, N ), where s is the greater of the two observed frequencies (or
p(s ≤ s obs |π = 0.5, N ) where s is the lesser of the two observed frequencies). In fact, the
binomial test is the more accurate of the two methods – the χ is more of an approximation –
2

but the difference is largely negligible.

To demonstrate, we can use the data from the above example regarding fine dining choices:

pbinom(76, 100, 0.5, lower.tail=FALSE)

## [1] 2.75679e-08

The difference between the p-value from the binomial and from the χ is about 4 × 10 . That’s
2 8

real small!

11.1.7.1.4 Using Continuous Data with the χ


2
Test

The χ test assesses categories of data, but that doesn’t necessarily mean that the data
2

themselves have to be categorical. In the goodness-of-fit version, for example, numbers in a


dataset were categorized by their values: we can do the same for the test of statistical
independence.

Suppose a statistics class of 12


students received the following grades –
{85, 85, 92, 92, 93, 95, 95, 96, 97, 98, 99, 100} – and we wanted to know if their grade

breakdown was significantly different from a 50-50 split between A’s and B’s. We could do a
one-sample t-test based on the null hypothesis that μ = 90. Or, we could use a χ test where 2

the expected frequencies reflect an equal number of students scoring above and below 90:

fe1<-c("$f_e=6$", "")
fo1<-c("", "$f_0=2$")
fe2<-c("$f_e=6$", "")
fo2<-c("", "$f_0=12$")

kable(data.frame(fe1, fo1, fe2, fo2), "html", booktabs=TRUE, escape=FALSE, c


kable_styling() %>%
row_spec(1, italic=TRUE, font_size="small")%>%
row_spec(2, font_size = "large") %>%
add_header_above(c("B's"=2, "A's"=2))

B’s A’s
fe = 6 fe = 6

f0 = 2 f 0 = 12

The observed χ statistic would be:


2

2 2
(2 − 6) (10 − 6)
2
χ obs = + = 5.33
6 6
The p-value – given that df = 1 – would be 0.021, which would be considered significant at
the α = 0.05 level.

11.1.7.1.5 The Two-Way χ Test


2

Often in statistical analysis we are interested in examining multiple factors to see if there is a
relationship between things, like exposure and mutation, attitudes and actions, study and recall.
The two-way version of the χ test of statistical independence analyzes patterns in category
2

membership for two different factors. To illustrate: imagine we have a survey comprising two
binary-choice responses: people can answer 0 or 1 to Question 1 and they can answer 2 or 3 to
Question 2. We can organize their responses into a 2 × 2 (rows × columns) table known as a
contingency table, where the responses to each question – the marginal frequencies of
responses – are broken down contingent upon their answers to the other question.8

Question 2
2 3 Margins
Question 1 0 A B A + B

Question 1 1 C D C + D

Margins A + C B + D n

Statistical independence in the one-way χ test was determined as a function of the number of
2

possible options. Statistical independence in the two-way χ test suggests that the two factors
2

are independent of each other, that is, that the number of possibilities for one factor are
unrelated to the categorization of another factor.

11.1.7.1.6 Degrees of Freedom for the 2-way χ Test


2

Just as the degrees of freedom for the one-way χ test were the number of cells in which the
2

frequency was allowed to vary while keeping the margin total the same, the degrees of freedom
for the two-way χ test are the number of cells that are free to vary in frequency while keeping
2

both sets of margins – the margins for each factor – the same. Thus, we have a set of df for the
rows of a contingency table and a set of df for the columns of a contingency table, and the total
df is the product of the two:

df = (k rows − 1) × (k columns − 1))

11.1.7.1.7 Expected Frequencies for the 2-way χ Test


2

Unlike in the one-way test, the expected frequencies across cells in the two-way χ test do not
2

need to be equal: in most cases, they aren’t. It is possible – and not at all uncommon – for the
different response levels of each factor to have different frequencies, which the two-way test
accounts for. The expected frequencies instead are proportionally equal given the marginal
frequencies. In the arrangment in the table above, for example, we do not expect A, B, C , and
D to be equal to each other but we do expect A and B to be _proportional to A + C and
A + D, respectively; for A and C to be proportional to A + B and C + D, respectively; etc.

Thus, the expected frequency for each cell in a 2-way contingency table is:

f e = n × row proportion × column proportion

For example, let’s say that we asked 140 people a 2-question survey:

1. Is a hot dog a sandwich?


2. Does a straw have one hole or two?

and that these are the observed frequencies:

How many holes does a straw have?


1 2 Margins
Is a hot dog a sandwich? Yes 40 10 50
Is a hot dog a sandwich? No 20 70 90
Margins 60 80 140

The expected frequencies, generated by the formula


f e = n × row proportion × column proportion for each cell, are presented in the
following table above and to the left of their respective observed frequencies:

How many holes does a straw have?


1 2 Margins
Is a hot dog a sandwich? Yes 21.43 28.57
Is a hot dog a sandwich? Yes 40 10 50
Is a hot dog a sandwich? No 38.57 51.43
Is a hot dog a sandwich? No 20 70 90
Margins 60 80 140

Note, for example, that the expected number of people to say “no” to the hot dog question and
“1” to the straw question is greater than the expected number of people to say “yes” to the hot
dog question and “1” to the straw question, even though the opposite was observed. That is
because more people overall said “no” to the hot dog question, so proportionally we expect
greater number of responses associated with either answer to the other question among those
people who (correctly) said that a hot dog is not a sandwich.

The observed χ value for the above data is:


2

2 2 2 2
(40 − 21.43) (10 − 28.57) (20 − 38.57) (70 − 51.43)
2
χ obs (1) = + + + = 43.81
21.43 28.57 38.57 51.43
The associated p-value is:

pchisq(65.27, df=1, lower.tail=FALSE)

## [1] 6.530801e-16

which is smaller than any reasonable α-rate, so we reject the null hypothesis that there is no
relationship between people’s responses to the hot dog question and the straw question.

To perform the one-way χ test in R, we used the command chisq.test() with a vector of
2

values inside the parentheses. The two-way test has an added dimension, so instead of a
vector, we enter a matrix:

matrix(c(40, 20, 10, 70), nrow=2)

## [,1] [,2]
## [1,] 40 10
## [2,] 20 70

When calculating the χ test on a 2 × 2 matrix, R defaults to applying Yates’s Continuity


2

Correction. Personally, I think you can skip it, although it doesn’t seem to do too much harm.
We can turn off that default with the option correct=FALSE.

chisq.test(matrix(c(40, 20, 10, 70), nrow=2), correct=FALSE)

##
## Pearson's Chi-squared test
##
## data: matrix(c(40, 20, 10, 70), nrow = 2)
## X-squared = 43.815, df = 1, p-value = 3.61e-11

11.1.7.1.8 Beyond the 2-way χ Test


2

Having had a blast with the one-way χ test and the time of one’s life with the two-way
2
χ
2

test, one might be tempted to add increasing dimensions to the χ test…


2
Such things are mathematically doable, but not terribly advisable for two main reasons:

1. Adding dimensions increases the number of cells exponentially: a three-way test involves
x × y × z cells, a four-way test involved x × y × z × q cells, etc. In turn, that means that

an exponentially rising number of observations is needed for reasonable analysis of the


data. In a world of limited resources, that can be a dealbreaker.

2. The larger problem is one of scientific interpretation. It is relatively straightforward to


explain a scientific hypothesis involving the statistical independence of two factors. It is
far more difficult to generate a meaningful scientific hypothesis involving the statistical
independence of three or more factors, especially if it turns out – as it can – that the χ 2

test indicates statistical independence of all of the factors but not some subsets of the
factors or vice versa.

So, the official recommendation here is to use factorial analyses other than the χ
2
test should
more than two factors apply to a scientific hypothesis.

11.1.7.1.8.1 Dealing with small fe

As noted in our previous encounter with the χ statistic, the only requirement for the χ test is
2 2

that f ≥ 5. If the structure of observed data are such that you would use a 1-way χ test
e
2

except for the problem of f < 5, we can instead treat the data as binomial, testing the
e

cumulative likelihood of s ≥ s given that π = 0.5. If the structure indicate a 2-way χ test
obs
2

but f is too small, then we can use the Exact Test.


e

11.1.7.2 Exact Test

The Exact Test – technically known as Fisher’s Exact Test but since he was a historically
shitty person I see no problem dropping his name from the title – is an alternative to the χ test 2

of statisical independence to use when there is:


1. a 2 × 2 data structure and
2. inadequate f per cell to use the χ test.
e
2

The exact test returns the cumulative likelihood of a given pattern of data given that the
marginal totals remain constant. To illustrate how the exact test returns probabilities, please
consider the following labels for a 2 × 2 contingency table:

Factor 2 Margins
Factor 1 A B A + B

Factor 1 C D C + D

Margins A + C B + D n

First, let’s consider the number of possible combinations for the row margins. Given that there
are n observations, the number of combinations of observations that could put A + B of those
observations in the top row – and therefore C + D observations in the second row – is given
by:

n! n!
nCr = =
(A + B)! (n − (A + B))! (A + B)! (C + D)!

Next, let’s consider the arrangement of the row observations into column observations. The
number of combinations of observations that lead to A observations in the first column is
A + C things combined A at a time – C
A+C– and the number of combinations of
A

observations that lead to B observations in the second column is B + D things combined B at


a time – B+D C . Therefore, there are a total of
B C ×
A+C C
A possible combinations that
B+D B

lead to the observed A (and thus C because A + C is constant) and the observed B (and thus
D because B + D is constant), and the formula:

(A + C)! (B + D)! (A + C)! (B + D)!


× =
A! C! B! D! A! B! C! D!

describes the number of possible ways to get the observed arrangement of A, B, C , and D.
Because, as noted above, there are n!
total possible arrangements of the data,
(A+B)! (C+D)!

probability of the observed arrangement is the number of ways to get the observed arrangement
divided by the total possible number of arrangements:
(A+C)! (B+D)!

A! B! C! D!
(A + B)! (C + D)! (A + C)! (B + D)!
p(conf iguration) = =
n!
n! A! B! C! D!
(A+B)! (C+D)!

The p-value for the Exact Test is the cumulative likelihood of all configurations that are as
extreme or more extreme than the observed configuration.9

The extremity of configurations is determined by the magnitude of the difference between each
pair of observations that constitute a marginal frequency: that is: the difference between A and
B , the difference between C and D, the difference between A and C , and the difference
between B and D. The most extreme cases occur when the members of one group are split
entirely into one of the cross-tabulated groups, for example: if A + B = A because all
members of the A + B group are in A and 0 are in B. Less extreme cases occur when the
members of one group are indifferent to the cross-tabulated groups, for example: if
A + B ≈ 2A ≈ 2B because A ≈ B.

Consider the following three examples: two with patterns of data suggesting no relationship
between the two factors and another with a pattern of data suggesting a significant relationship
between the two factors.

Example 1: An ordinary pattern of data

How many holes does a straw have?


1 2 Margins
Is a hot dog a sandwich? Yes 4 3 7
Is a hot dog a sandwich? No 4 4 8
Margins 8 7 15

In this example, an approximately equal number of responses are given to both questions: 7
people say a hot dog is a sandwich, 8 say it is not; 8 people say a straw has 1 hole, 7 say it has
2. Further, the cross-tabulation of the two answers shows that people who give either answer
to one question have no apparent tendencies to give a particular answer to another question: the
people who say a hot dog is a sandwich are about even-odds to say a straw has 1 hole or 2;
people who say a hot dog is not a sandwich are exactly even-odds to say a straw has 1 or 2
holes (and vice versa).

The probability of the observed pattern of responses is:

(A + B)! (C + D)! (A + C)! (B + D)! 7! 8! 8! 7!


p = = = 0.381
n! A! B! C! D! 15! 4! 3! 4! 4!

Given that p = 0.381 is greater than any reasonable α-rate, calculating just the probability of
the observed pattern is enough to know that the null hypothesis will not be rejected and we will
continue to assume no relationship between the responses to the two questions. However, since
this is more-or-less a textbook, and because there is no better place than a textbook than to do
things by the book, we will examine the patterns that are more extreme than the observed data
given that the margins remain constant.

Here is one such more extreme pattern:

How many holes does a straw have?


1 2 Margins
Is a hot dog a sandwich? Yes 5 2 7
How many holes does a straw have?
1 2 Margins
Is a hot dog a sandwich? No 3 5 8
Margins 8 7 15

Note that the margins in the above table are the same as in the first table: A + B = 7,
C + D = 8, A + C = 8, and B + D=7. However, the cell counts that make up those margins

are a little more lopsided: A and D are a bit larger than B and C . The probability of this
pattern is:

(A + B)! (C + D)! (A + C)! (B + D)! 7! 8! 8! 7!


p = = = 0.183
n! A! B! C! D! 15! 5! 2! 3! 5!

The next most extreme pattern is:

How many holes does a straw have?


1 2 Margins
Is a hot dog a sandwich? Yes 6 1 7
Is a hot dog a sandwich? No 2 6 8
Margins 8 7 15

The probability of this pattern is:

(A + B)! (C + D)! (A + C)! (B + D)! 7! 8! 8! 7!


p = = = 0.03
n! A! B! C! D! 15! 6! 1! 2! 6!

Finally, the most extreme pattern possible given that the margins are constant is:

How many holes does a straw have?


1 2 Margins
Is a hot dog a sandwich? Yes 6 1 7
Is a hot dog a sandwich? No 2 6 8
Margins 8 7 15

(A good sign that you have the most extreme pattern is that there is a 0 in at least one of the
cells.)

The probability of this pattern is:

(A + B)! (C + D)! (A + C)! (B + D)! 7! 8! 8! 7!


p = = = 0.0012
n! A! B! C! D! 15! 6! 1! 2! 6!
The sum of the probabilities for the observed pattern and all more extreme unobserved
patterns:

p = 0.381 + 0.183 + 0.030 + 0.001 = 0.596

is the p-value for a directional (one-tailed) hypothesis that there will be A or more
observations in the A cell. That sort of test is helpful if we have a hypothesis about the odds
ratio associated with observations being in the A cell, but isn’t super-relevant to the kinds of
problems we are investigating here. More pertinent is the two-tailed test (of a point
hypothesis) that there is any relationship between the two factors. To get that, we would repeat
the above procedure the other way (switching values such that the B and C cells get bigger)
and add all of the probabilities.

But, that’s a little too much work, even going by the book. Modern technology provides us with
an easier solution. Using R, we can arrange the observed the data into a matrix:

exact.example.1<-matrix(c(4, 4, 3, 4), nrow=2)


exact.example.1

## [,1] [,2]
## [1,] 4 3
## [2,] 4 4

and then run an exact test with the base command fisher.test(). To check our math from
above, we can use the option alternative = "greater":

fisher.test(exact.example.1, alternative="greater")

##
## Fisher's Exact Test for Count Data
##
## data: exact.example.1
## p-value = 0.5952
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 0.1602859 Inf
## sample estimates:
## odds ratio
## 1.307924

To test the two-tailed hypothesis, we can either use the option alternative = "two.sided"
or – since two.sided is the default, just leave it out:

fisher.test(exact.example.1)

##
## Fisher's Exact Test for Count Data
##
## data: exact.example.1
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.1164956 15.9072636
## sample estimates:
## odds ratio
## 1.307924

Either way, the p-value is greater than any α-rate we might want to use, so we continue to
assume the null hypothesis that there is no relationship between the factors, in this case: that
responses on the two questions are independent.

Example 2: A Pattern Indicating a Significant Relationship

How many holes does a straw have?


1 2 Margins
Is a hot dog a sandwich? Yes 7 0 7
Is a hot dog a sandwich? No 0 8 8
Margins 7 8 15

This table represents the most extreme possible patterns of responses: everybody who said a
hot dog was a sandwich also said that a straw has one hole, and nobody who said a hot dog
was a sandwich said that a straw has two holes. There is no more possible extreme on the
other side of the responses: if A equalled 0 and B equalled 7, then C would have to be 7 and
D would have to be 1 to keep the margins the same.

Thus, for this pattern, the one-tailed p-value for the observed data and the two-tailed p-value
are both the probability of the observed pattern:

exact.example.2<-matrix(c(7, 0, 0, 8), nrow=2)


exact.example.2

## [,1] [,2]
## [1,] 7 0
## [2,] 0 8

fisher.test(exact.example.2)

##
## Fisher's Exact Test for Count Data
##
## data: exact.example.2
## p-value = 0.0001554
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 5.83681 Inf
## sample estimates:
## odds ratio
## Inf

Example 3: A Pattern That Looks Extreme But Does Not Represent a Significant
Relationship
Finally, let’s look at a pattern that seems to be as extreme as it possibly can be:

How many holes does a straw have?


1 2 Margins
Is a hot dog a sandwich? Yes 7 8 15
Is a hot dog a sandwich? No 0 0 0
Margins 7 8 15

In this observed pattern of responses, there is a big difference between A and C and between
B and D. Crucially, though, there is hardly any difference between A and B or between C and

D. Even though there are no moves we can make that maintain the same marginal numbers, the

pattern here is one where the responses for one question don’t depend at all on the responses
for the other question: in this set of data, people are split on the straw question but nobody
regardless of their answer to the straw question thinks that a hot dog is a sandwich10. The p-
value for the Exact Test accounts for the lack of dependency between the responses:

exact.example.3<-matrix(c(7, 0, 8, 0), nrow=2)


fisher.test(exact.example.3)

##
## Fisher's Exact Test for Count Data
##
## data: exact.example.3
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0 Inf
## sample estimates:
## odds ratio
## 0

11.1.7.2.1 Median Test

The core concept of the median test is: are there more values that are greater than the
overall median of the data than there are from one sample than from another sample? The
idea is that if one sample of data has values that tend to be bigger than values in another sample
– with the overall median observation as the guide to what is bigger and what is smaller – then
there is a difference between the two samples.

The median test arranges the observed data into two categories: less than the median and
greater than the median. Values that are equal to the median can be part of either category –
it’s largely a matter of preference – so those category designations, more precisely, can either
be less than or equal to the median/greater than the median or less than the median/greater
than or equal to the median (the difference will only be noticeable in rare, borderline
situations). When those binary categories are crossed with the membership of values in one of
two samples, the result is a 2 × 2 contingency table: precisely the kind of thing we analyze
with a χ test of statistical independence or – if f is too small for the χ test – the Exact Test.
2
e
2
To illustrate the median test, we will use two examples: one an example of data that do not
differ with respect to the median value between groups and another where the values do differ
between groups with respect to the overall median.

Example 1: an Ordinary Configuration

Imagine, if you will, the following data are observed for the dependent variable in an
experiment with a control condition and an experimental condition:

Median = 57
Control Condition Experimental Condition
Below Median (< 57 2, 12, 18, 23, 31, 35 4, 15, 16, 28, 44, 49
64, 67, 77, 84, 84, 85, 98, 100,
At or Above Median (≥ 57) 57, 63, 63, 66, 66, 69, 75
102

A plot of the above data with annotations for the overall median value (57, in this case),
illustrates the similarity of the two groups:

Figure 11.3: Control Group and Experimental Group Data Similar to Each Other With Respect
to the Overall Median of 57

Tallying up how many values are below the median and how many values are greater than or
equal to the median in each group results in the following contingency table:

Control Experimental
< 57 7 9
≥ 57 6 6
This arrangement lends itself nicely to a χ test of statistical independence with df
2
= 1 :
2 2 2 2
(7 − 7.43) (9 − 8.57) (6 − 5.57) (6 − 6.43)
2
χ obs (1) = + + + = 0.1077, p = 0.743
7.43 8.57 5.57 6.43

The results of the χ test lead to the continued assumption of independence between between
2

category (less than/greater than or equal to the median) and condition (control/experimental),
which we would interpret as there being no effect of being in the control vs. the experimental
group with regard to the dependent variable.

Example 2: A Special Configuration

In the second example, there are more observed values in the control condition that are less
than the median than there are values greater than or equal to the median and more observed
values in the experimental condition that are greater than or equal to the median than there are
values less than the median:

Median = 57
Control Condition Experimental Condition
2, 4, 12, 15, 16, 18, 23,
Below Median (< 57) 44, 49
28, 31, 35
57, 63, 63, 64, 66, 67, 77, 84,
At or Above Median (≥ 57) 66, 69, 75
84, 85, 98, 100, 102

Counting up how many values in each group are either less than or greater than or equal to the
overall median gives the following contingency table:

Control Experimental
Control Experimental
< 57 10 2
≥ 57 3 13

which in turn leads to the following χ test result:2

2 2 2 2
(3 − 7.43) (13 − 8.57) (10 − 5.57) (2 − 6.43)
2
χ (1) = + + + = 11.50, p < .001
obs
7.43 8.57 5.57 6.43

11.1.8 The Wilcoxon Mann-Whitney U Test

The Wilcoxon Mann-Whitney U Test11is a more powerful nonparametric test of differences


between independent groups than the Median Test. It is a direct comparison of the overlap of
the values in two different datasets: in the most extreme case, all of the values in one group of
data are greater than all of the values in the other group; if there is no difference between
groups, there will be subtantial overlap between the two groups.

The null hypothesis for the Wilcoxon-Mann-Whitney U test is that if we make an observation X
from population A and an observation Y from population B, it is equally probable that X is
greater than Y and that Y is greater than X:

H 0 : p(X > Y ) = p(Y > X)

and thus, the alternative hypothesis is that the probability that X is greater than Y is not equal to
the probability that Y is greater than X (or vice versa: the designation of X and Y is arbitrary):

H 1 : p(X > Y ) ≠ p(Y > X)

The Wilcoxon-Mann-Whitney test statistic U is calculated thus:

Given two conditions A and B (in real life they will have real-life names like “control
condition” or “experimental condition”):

1. For each value in A, count how many values in B are smaller. Call this count U A

2. For each value in B, count how many values in A are smaller. Call this count U B

3. The lesser of U and U is U


A B

4. Test the significance of U using tables for small n, normal approximation for large n.

Note: the Wilcoxon-Mann-Whitney test ignores ties. If there are tied data, the observed p-value
will be imprecise (still usable, just imprecise).

For example, imagine the following data – which suggest negligible differences between the
two group – are observed:

11.1.8.0.1 Example of an Ordinary Configuration:


Control Condition Experimental Condition
12, 22, 38, 42, 50 14, 25, 40, 44, 48

We can arrange the data in overall rank order and mark whether each datapoint comes from the
Control (C) condition or the Experimental (E) condition:

Configuration of Observed
Data
C E C E C E C E E C
12 14 22 25 38 40 42 44 48 50

To calculate U (for the control condition), we take the total number of values from the
C

experimental condition that are smaller than each value of the control condition. There are 0
experimental-condition values smaller than the smallest value in the control condition (12), 1
that is smaller than the second-smallest value (22), 2 that are smaller than the third-smallest
value (38), 3 that are smaller than 42, and 5 that are smaller than 50 for a total of U = 11.
C

To calculate U (for the experimental condition), we take the total number of values for the
E

control condition that are smaller than each value of the experimental condition. There is 1
control-condition value smaller than the smallest experimental-condition value (14), 2 that are
smaller than 25, 3 that are smaller than 40, 4 that are smaller than 44, and also 4 that are
smaller than 48 for a total of U = 14.
E

The smaller of U and U is U


C E C = 11 . Given that n for both groups is 5, we can use a critical
value table to see that 11 is greater than the critical value for a two-tailed test12

We can also use software to calculate both the U statistic and the p-value (there is also a
paired-samples version of the Wilcoxon test called the Wilcoxon Signed Rank Test; as with the
t.test() command, we get the independent-samples version with the option paired=TRUE:

control<-c(12, 22, 38, 42, 50)


experimental<-c(14, 25, 40, 44, 48)
wilcox.test(control, experimental, paired=FALSE)

##
## Wilcoxon rank sum exact test
##
## data: control and experimental
## W = 11, p-value = 0.8413
## alternative hypothesis: true location shift is not equal to 0

Oh, and R calls U “W.” Pretty sure that’s for consistency with the output for the Wilcoxon
Signed Rank Test.

The following data provide an example of a pattern that does suggest a difference between
groups:
Control Condition Experimental Condition
12, 14, 22, 25, 40 38, 42, 44, 48, 50
Configuration of Observed
Data
C C C C E C E E E E
12 14 22 25 38 40 42 44 48 50

For these data, there is only one value of the experimental condition that is smaller than any of
the control-condition values (38 is smaller than 40), and 24 control-condition values that are
smaller than experimental-condition values, so the U statistic is equal to U = 1:
C

control<-c(12, 14, 22, 25, 40)


experimental<-c(38, 42, 44, 48, 50)
wilcox.test(control, experimental, paired=FALSE)

##
## Wilcoxon rank sum exact test
##
## data: control and experimental
## W = 1, p-value = 0.01587
## alternative hypothesis: true location shift is not equal to 0

11.1.8.0.1.1 Normal Approximation:

If either sample has n > 20, then the following formulas can be used to calculate a z-score that
then can be used to get a p-value based on areas under the normal curve
n1 n2
μ =
2

n 1 n 2 (n 1 + n 2 + 1)
σ = √
2

U − μ
z =
σ

But: that is a pre-software problem that we don’t face anymore.

11.1.8.1 Randomization (Permutation) Test

The randomization test is the most powerful nonparametric test for analyzing two independent
samples of scale data.13 The central idea of the randomization test is: of all the possible
patterns of the data, how unusual is the observed pattern?

11.1.8.1.1 Example of an Ordinary Configuration:

Control Condition Experimental Condition


12, 22, 38,
Control 42, 50 Experimental
Condition 14, 25, 40, Condition
44, 48

There are 126 patterns that are as likely or less likely than this one.

n! 10!
N umber of possible permutations = = = 252
r! (n − r)! 5! (10 − 5)!

126
p = = .5, n.s.
252

11.1.8.1.2 Example of a Special Configuration:

Control Condition Experimental Condition


12, 14, 22, 25, 40 38, 42, 44, 48, 50

This is the second-least likely possible pattern of the observed data (the least likely would be
if 38 and 40 switched places).

2
p = = .0079
252

To find the p-value for the randomization test, follow this algorithm:

1. Calculate the possible number of patterns in the observed data using the combinatorial
formula14
2. Assign positive signs to one group of the data and negative signs to the other
3. Find the sum of the signed data. Call this sum D.
4. Switch the signs on the data and sum again to find all possible patterns that lead to greater
absolute values of D
5. Take the count of patterns that lead to equal or greater absolute values of D
6. Divide the count in step 5, add 1 for the observed pattern, and divide by the possible
number of patterns calculated in step 1 to find the p-value

Control:negative Experimental: positive


-12, -14, -22, -25, -40 38, 42, 44, 48, 50

D = −12 − 14 − 22 − 25 − 40 + 38 + 42 + 44 + 48 + 50 = 109

Are there more extreme patterns? Only one: switch the signs of 38 (from the experimental
condition) and 40 (from the control condition)

D = −12 − 14 − 22 − 25 − 38 + 40 + 42 + 44 + 48 + 50 = 113

In this example, there are two possible combinations that are equal to or more extreme than the
observed pattern.

There are 252 possible patterns of the data


The p-value of the observed pattern is 2/252 = .0079

∴ we reject the null hypothesis.

11.1.9 Nonparametric Tests for 2 Paired Groups

11.1.9.1 McNemar’s Test

McNemar’s test is a repeated-measures test for categorical (or categorized continuous) data.
Just as the repeated-measures t-test measures the difference between two measures of the same
entities (participants, animal subjects, etc.), McNemar’s test measures categorical change
among two entities. Given two possible categorical states – for example: healthy and unwell,
in favor and opposed, passing and failing – of which each entity can exist in either at the times
of the two different measurements, McNemar’s test is a way to analyze the differences in the
two measures as a function of how much change happened between the two measurements.

To perform the test, the data are arranged thusly (with generic labels for the two possible states
and the two measurements):

Measure 2
State 2 State 1
Measure 1 State 1 A B
Measure 1 State 2 C D

In this arrangement, A and D represent changes in the state: A represents the count of entities
that went from State 1 in the first measurement to State 2 in the second, and D represents the
count that went from State 2 in the first measurement to State 1 in the second.

If A + D > 20, the differences are measured using McNemar’s χ : 2

2
(A − D)
2
χ (1) =
obs
A + D

The larger the difference between A and D – normalized by the sum A + D – the bigger the
difference between the two measurements. And, by extension, larger differences between A
and D indicate how important whatever came between the two measurements (a treatment, an
intervention, an exposure, etc.) was.

If A + D ≤ 20, use binomial likelihood formula with π = 0.5:

N!
max(A,D) min(A,D)
p(s ≥max (A, D)|π = 0.5, N = A + D) = ∑ π (1 − π)
A! D!
to be clear: given that π = 0.5 for the McNemar test and the binomial distribution is therefore
symmetrical, the equation:

N! min(A,D) max(A,D)
p(s ≤min (A, D)|π = 0.5, N = A + D) = ∑ π (1 − π)
A! D!

will produce the same result.

For example, please imagine students in a driver’s education class are given a pre-test on the
rules of right-of-way before taking the class and a post-test on the same content after taking the
class. In the first set of example data, there is little evidence of difference indicated between
the two conditions by the observed data (n = 54):

Post-test
Fail Pass
Pre-test Pass 15 8
Pre-test Fail 16 15

In this example, 15 students pass the pre-test and fail the post-test, and another 15 students fail
the pre-test and pass the post-test. There are 24 students whose performance remains in the
same category for both the pre- and post-test (8 pass both; 16 fail both) – those data provide no
evidence either was as to the efficacy of the class.

The McNemar χ statistic – which we can use instead of the binomial because 15 + 15 > 20,
2

is:
2
(15 − 15)
2
M cN emar χ (1) = = 0
15 + 15

Which we don’t need to bother finding the p-value for: it’s not in any way possibly significant.

In the second example data set, there is evidence that the class does something:

Post-test
Fail Pass
Pre-test 1 Pass 25 8
Pre-test 1 Fail 16 5

In these data, 25 students pass the pre-test but fail the post-test, and 5 fail the pre-test but pass
the post-test. The McNemar χ statistic for this set of data is:
2

2
(25 − 5) 400
2
M cN emar χ (1) = = = 13.3
25 + 5 30
Measured in terms of a χ distribution with 1 degree of freedom – this might have been
2

intuitable from the 2 × 2 structure of the data, but the McNemar is always a df = 1 situation
– the observed p-value is:

pchisq(13.3, df=1, lower.tail=FALSE)

## [1] 0.0002654061

and so we would reject the null hypothesis had we set α = 0.5, or α = 0.01 , or even
α = 0.001. Based on these data, there is a significant effect of the class.

Reviewing these particular data, we could further interpret the results as indicating that the
class makes students significantly worse at knowing the rules of the road. Which in turn
suggests that this driver’s ed class was taught in the Commonwealth of Massachusetts.

Conducting the McNemar test in R is nearly identical to conducting the χ test in R. The one
2

thing to keep in mind is that the mcnemar.test() command expects the cells to be compared to
be on the reverse diagonal:

B A
[ ]
D C

but, it’s not much trouble to just switch the columns. Also, you may want to – as with the χ
2

test – turn off the continuity correction with the option correct=FALSE:

mcnemar.test(matrix(c(8, 5, 25, 16), nrow=2), correct=FALSE)

##
## McNemar's Chi-squared test
##
## data: matrix(c(8, 5, 25, 16), nrow = 2)
## McNemar's chi-squared = 13.333, df = 1, p-value = 0.0002607
11.1.9.2 Sign (Binomial) Test

The Sign test, like McNemar’s test, is a measure of categorical change over 2 repeated
measures. In the case of the sign test, the categories are, specifically: positive changes and
negative changes. It takes the sign of the observed changes in the dependent variable – positive
or negative; zeroes are ignored – and treats them as binomial events with π = 0.5. In other
words, the sign test treats whether the difference between conditions is positive or negative as
a coin flip. For that reason, the sign test is often referred to as the binomial test, which is fine,
but I think there are so many things (including several on this page alone) that can be
statistically analyzed with the binomial likelihood function that that name can be a little
confusing.

The p-value for a binomial test is the cumulative binomial probability of the number of
positive values or negative values observed given the total number of trials that didn’t end up
as ties and π = 0.5. If there is a one-tailed hypothesis with an alternative that indicates that
more of the changes will be positive, then that cumulative probability is the probability of
getting at least as many positive changes out of the non-zero differences; if there is a one-tailed
hypothesis with an alternative that indicates that more of the changes will be negative, then the
cumulative probability is that of getting at least as many positive changes out of the non-zero
differences. If the hypothesis is two-tailed, then the p-value is the sum of the probability of
getting the smaller of the number of the positive changes and the negative changes or fewer plus
the probability of getting n – that number or greater (since the sign test is a binomial with π
=0.5, that’s more easily computed by taking the cumulative probability of the smaller of the
positive count and the negative count or fewer and multiplying it by two).

The following is an example data set in which positive changes are roughly as frequent as
negative changes: a data set for which we would expect our sign test analysis to come up with
a non-significant result.

Observation Before After Difference Sign of Difference


1 1 3 2 +

2 8 5 -3 −

3 0 2 2 +

4 0 -1 -1 −

5 1 1 0 =

6 3 -5 -8 −

7 3 7 4 +

8 3 362 359 +

9 5 1 -4 −

10 6 0 -6 −

11 5 10 5 +

Let’s use a one-tailed test with a null hypothesis that the number of positive changes will be
greater than or equal to the number of negative changes (π ≥ 0.5) and an alternative hypothesis
that the number of positive changes will be less than the number of negative changes (π < 0.5)
Of the n = 11 observations, one had a difference score of 0. We throw that out. Of the rest,
there are 5 positive differences and 5 negative differences. Therefore, we calculate the
cumulative binomial probability of 5 or fewer successes in 10 trials with π = 0.5:

pbinom(5, 10, 0.5)

## [1] 0.6230469

and find that the changes are not significant. The software solution in R is fairly
straightforward:

binom.test(5, 10, 0.5, alternative="less")

##
## Exact binomial test
##
## data: 5 and 10
## number of successes = 5, number of trials = 10, p-value = 0.623
## alternative hypothesis: true probability of success is less than 0.5
## 95 percent confidence interval:
## 0.0000000 0.7775589
## sample estimates:
## probability of success
## 0.5

For a counterexample, let’s use the following example data where there are many more
negative changes than positive changes:

Observation Before After Difference Sign of Difference


1 1 3 2 +

2 8 5 -3 −

3 0 -2 -2 −

4 0 -1 -1 −

5 1 1 0 =

6 3 -5 -8 −

7 3 -7 -10 −

8 3 -362 -365 −

9 5 -1 -6 −

10 6 0 -6 −

11 5 -10 -15 −

In this case, there is still 1 observation with no change between measurements: we’ll toss that
and we are left with 10 observations with changes: 9 negative and 1 positive. This time, we
will use a 2-tailed test. The sign test p-value is therefore (keeping in mind that for an upper-tail
pbinom, we enter s − 1 for s):

pbinom(1, 10, 0.5)+pbinom(8, 10, 0.5, lower.tail=FALSE)


## [1] 0.02148438

Using binom.test() with the option alternative="two.sided" gives us the same result:

binom.test(1, 10, 0.5, alternative="two.sided")

##
## Exact binomial test
##
## data: 1 and 10
## number of successes = 1, number of trials = 10, p-value = 0.02148
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.002528579 0.445016117
## sample estimates:
## probability of success
## 0.1

11.1.9.3 Wilcoxon Signed-Rank Test

The Wilcoxon Signed-Rank Test is the paired-samples version of the Wilcoxon-Mann-Whitney


U test. It operates in much the same way as the sign test, but it is more powerful because it

takes into account the magnitude of the ranks of the differences. That is, a small negative
difference means less in the calculation of the test statistic than a large negative difference, and
likewise a small positive difference means less than a large positive difference. Thus,
significant values of the Wilcoxon test statistic W come from combinations of large and
frequent positive shifts or combinations of large and frequent negative shifts.

To calculate the Wilcoxon W statistic, we begin by taking differences (as we do with all of the
tests with two repeated measures). We note the sign of the difference: positive or negative, and
as in the sign test throwing out the ties. We also rank the magnitude of the differences from
smallest (1) to largest (n). If there are ties in the differences, each tied score gets the average
rank of the untied rank above and the untied rank below the tie cluster. For example, if a set of
differences is {10, 20, 20, 30}, 10 is the smallest value and gets rank 1; 40 is the largest value
and gets rank 4, and the two 20′s get the average rank 1+4

2
= 2.5, thus, the ranks would be

{1, 2.5, 2.5, 4}.

We then combine the ranks and the signs to get the signed-ranks. For example, if the observed
differences were {10, 20, 20, 30}, then the signed ranks would be {1, 2.5, 2.5, 4}. If the
observed differences were {−10, −20, −20, −30}, then the signed ranks would be
{−1, −2.5, −2.5, −4}: the ranks are based on magnitude (or, absolute value), so even though

−30 is the least of the four numbers, −10 is the smallest in terms of magnitude.

knitr::include_graphics("images/magnitude.gif")
Figure 11.4: Magnitude

For the one-tailed test, W is equal to the sum of the positive ranks. For the two-tailed test, the
W statistic is the greater of the sum of the magnitudes of the positive ranks and the sum of the

magnitudes of the negative ranks. If we denote the sum of the positive ranks as T and the sum
+

of the negative ranks as T , then we can write the W formulaw like this:

+ −
One − tailed W =max (T ,T )

+
T wo − tailed W = T

To demonstrate the Wilcoxon signed-rank test in action, we can re-use the examples from the
sign test. First, the dataset that indicates no significant effect:

Observation Before After Difference Sign of Difference Rank Signed Rank


1 1 3 -2 − 2.5 − 2.5

2 8 5 3 + 4.0 + 4

3 0 2 -2 − 2.5 − 2.5

4 0 -1 1 + 1.0 + 1

5 1 1 0 = 0.0 0
6 3 -5 8 + 9.0 + 9

7 3 7 -4 − 5.5 − 5.5

8 3 362 -359 − 10.0 − 10

9 5 1 4 + 5.5 + 5.5

10 6 0 6 + 8.0 + 8

11 5 10 -5 − 7.0 − 7

The sum of the positive signed-rank magnitudes is:


+
T = 2.5 + 2.5 + 5.5 + 10 + 7 = 27.5

which is equal to W if we want the one-tailed test.

The sum of the negative signed-rank magnitudes is (note: we drop the negative signs to get the
sum of the magnitudes):

T = 4 + 1 + 9 + 5.5 + 8 = 27.5

and thus for the two-tailed test:

W =max (32.5, 32.5) = 32.5

To test the significance of W statistics where n ≤ 30 , we can look up a critical value in a


table.

If n > 30, there is a normal approximation that will produce a z-score; the area under the
normal curve beyond which will give the approximate p-value for the W statistic:
n(n+1)
W −
4
z =
n(n+1)(2n+1) 1 g
√ − ∑ t j (t j − 1)(t j + 1)
24 2 i=1

where g is the number of tie clusters in the ranks and t is the number of ties in each cluster j. I
j

honestly hope you never use that formula.

We can just use the wilcox.test() command with the paired=TRUE option.

wilcox.test(Before, After, paired=TRUE, correct = FALSE)

## Warning in wilcox.test.default(Before, After, paired = TRUE, correct =


FALSE):
## cannot compute exact p-value with ties

## Warning in wilcox.test.default(Before, After, paired = TRUE, correct =


FALSE):
## cannot compute exact p-value with zeroes

##
## Wilcoxon signed rank test
##
## data: Before and After
## V = 27.5, p-value = 1
## alternative hypothesis: true location shift is not equal to 0

I have no idea why R decides to call W V .

Now let’s apply the same procedure to a dataset that indicates evidence of a difference
between the Before and After measures:

Observation Before After Difference Sign of Difference Rank Signed Rank


1 1 3 -2 − 2.5 − 2.5

2 8 5 3 + 4.0 + 4

3 0 -2 2 + 2.5 + 2.5

4 0 -1 1 + 1.0 + 1

5 1 1 0 = 0.0 0
Observation Before After Difference Sign of Difference Rank Signed Rank
6 3 -5 8 + 7.0 + 7

7 3 -7 10 + 8.0 + 8

8 3 -362 365 + 10.0 + 10

9 5 -1 6 + 5.5 + 5.5

10 6 0 6 + 5.5 + 5.5

11 5 -10 15 + 9.0 + 9

The sum of the positive signed-rank magnitudes (there is only one) is:
+
T = 2.5

The sum of the negative signed-rank magnitudes is (again: we drop the negative signs to get the
sum of the magnitudes):

T = 4 + 2.5 + 1 + 7 + 8 + 10 + 5.5 + 5.5 + 9 = 52.5

and thus:
+
One − tailed W = T

T wo − tailed W =max (2.5, 52.5) = 52.5

To test the significance of W , we can just take the easier and more sensible route straight to the
software solution:

wilcox.test(Before, After, paired=TRUE)

## Warning in wilcox.test.default(Before, After, paired = TRUE): cannot


compute
## exact p-value with ties

## Warning in wilcox.test.default(Before, After, paired = TRUE): cannot


compute
## exact p-value with zeroes

##
## Wilcoxon signed rank test with continuity correction
##
## data: Before and After
## V = 52.5, p-value = 0.0124
## alternative hypothesis: true location shift is not equal to 0

Note: R will always return the value of T +


as the test statistic, but the p-value is unaffected by
the difference.

11.1.9.4 Repeated-Measures Randomization Test


The repeated-measures randomization test works on the same basic principle as the
independent-groups randomization test: the observed scores – in the case of the repeated-
measures test, the observed differences – are taken as a given. The question is how unusual is
the arrangement of those scores. If the magnitudes of the positive changes and the negative
changes are about the same, then the pattern is nothing special. If the magnitudes of the positive
changes are in total much larger than the magnitudes of the negative changes or vice versa, then
the pattern is unusual and there is likely a significant effect.

The repeated-measures randomization test is also similar in principle to the Wilcoxon Signed-
Rank Test, in that it examines the relative sizes of the negative and the positive changes, but
different in that it deals with the observed magnitudes of the data themselves rather than their
ranks.

Again, we assume that the observed magnitudes of the differences are given, but that the sign on
each difference can vary. That means that number of possible permutations of the data are 2 n

(or, more accurately, 2 , because we ignore any ties). The paired-samples


n−thenumberof ties

randomization test is an analysis of how many patterns are as extreme as or more extreme than
the observed pattern; the p-value is the number of such patterns divided by the 2 possible
n

patterns.

To illustrate, please imagine that we observed the following data in a two-condition


psychological experiment:

Participant<-1:7
Condition1<-c(13,42,9,5,6,8,18)

Condition2<-c(5,36,2,0,9,7,9)
Difference<-Condition1-Condition2

kable(data.frame(Participant, Condition1, Condition2, Difference), "html", b


kable_styling()

Participant Condition 1 Condition 2 Difference


1 13 5 8
2 42 36 6
3 9 2 7
4 5 0 5
5 6 9 -3
6 8 7 1
7 18 9 9

A relatively easy way to judge the extremity of the pattern of the data is to take the sum of the
differences and to compare it to the sums of other possible patterns. For this example, the sum
of the differences d is:
obs
∑ d obs = 8 + 6 + 7 + 5 + −3 + 1 + 9 = 33

Any more extreme pattern will have a greater sum of differences. If, in this example,
Participant 5’s difference was 3 instead of −3 and Participant 6’s difference was −1 instead
of 1, the difference would be:

∑ d = 8 + 6 + 7 + 5 + 3 + −1 + 9 = 37

which indicates that that hypothetical pattern is more extreme than the observed pattern.

There is one more extreme possible pattern: the one where all of the observed differences are
positive:

∑ d = 8 + 6 + 7 + 5 + 3 + 1 + 9 = 39

Thus, there are three patterns that are either the observed pattern or more extreme patterns.
Given that there are 7 non-tied observations in the set, the number of possible patterns is
2 = 128. Therefore our observed p-value is:
7

3
p obs = = 0.023
128

More specifically, that is the p-value for a one-tailed test wherein we expect most of the
differences to be positive in the alternative hypothesis. To get the two-tailed p-value, we
simply multiply by two (the idea being that the three most extreme patterns in the other
direction are equally extreme). In this case, the two-tailed p-value would be = 0.047.
6

128

11.1.10 Confidence Intervals on Means Using t-statistics

Confidence intervals were developed as an alternative to null/alternative hypothesis testing


but are now used as a complement. It is now common to report both the results of statistical
hypothesis testing and a confidence interval e.g.
t(29) = 4.5, p < 0.001, 95% CI = [3.9, 5.1]

As noted in probability distributions, interval estimates can be calculated on all kinds of


distributions; confidence intervals can therefore be estimated for many different statistics,
including π (using the β distribution or the exact method), σ (using the χ distribution), and r
2 2

(using the t-distribution). Most commonly, confidence intervals are estimated for means.
Because the t-distribution represents the distribution of sample means (i.e., over repeated
samples, sample means are distributed in the shape of a t-distribution), it is naturally used to
calculate the confidence intervals associated with t-tests: intervals about sample means, mean
differences, and the differences between sample means.

A 1 − α confidence interval about a mean is the range in which 1 − α% of repeated samples


would theoretically capture the population-level parameter. It is famously not the range in
which we are 1 − α% confident that the population parameter lies – that would be a purely
Bayesian way of looking at things. The term 1 − α connects the confidence interval to the α
rate familiar from null-hypothesis testing: framing the width of the confidence interval in terms
of α results in using t-values that define α% areas under the t-curve outside of the confidence
interval and 1 − α% areas under the t-curve inside the confidence interval. And just as
α = 0.05 is the most popular α-rate, 1 − 0.05 = 95% is the most popular confidence interval

width.

The generic formula for a 1 − α confidence interval is:

¯
(1 − α)% CI = x ± t α se
2

where t is the value of the t-distribution for the relevant df that puts
α

2
in the upper tail of
α

the area under the t-curve (and thus −t puts in the lower tail of the curve).
α

2
α

In the above equation, stands for whatever mean value is relevant: for a confidence interval
¯
x

on a sample mean, is itself appropriate; for a confidence interval on a mean difference, we


¯
x

substitute d¯ for x; and for a difference between means, we substitute x − x for x. Likewise,
¯
¯
¯
¯ 1 2

the se can be se for a single group, se for a set of difference scores, and se to represent the
x d p

pooled standard error for the difference between sample means.

For example, let’s construct a 95% confidence interval on the single-sample:

x = {55, 60, 63, 67, 68}

x
¯ x = 62.6 x sd x /√n = 5.32/2.24 = 2.38The mean of is and the standard error of is .
Given that n = 5, df = 4. The t value that puts α/2 = 0.025 in the upper tail of the t-
distribution with df = 4 is t = 2.78. Thus, the 95% confidence interval on the sample mean
α

is:

95% CI = 62.6 ± 2.78(2.38) = [55.99, 69.21].

As is usually the case, finding the confidence interval is easier with software. By default, the
t.test() command in R will return a 95% confidence interval on whatever mean (sample
mean, mean difference, or difference between means) is associated with the t-test. In this case,
the one-sample t-test produces a 95% confidence interval on the sample mean:

t.test(x)

##
## One Sample t-test
##
## data: x
## t = 26.313, df = 4, p-value = 1.24e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 55.99463 69.20537
## sample estimates:
## mean of x
## 62.6
If one is interested in a confidence interval of a different width – say, a 99% confidence
interval – one needs only to specify that with the conf.level option within the t.test()
command:

t.test(x, conf.level=0.99)

##
## One Sample t-test
##
## data: x
## t = 26.313, df = 4, p-value = 1.24e-05
## alternative hypothesis: true mean is not equal to 0
## 99 percent confidence interval:
## 51.64651 73.55349
## sample estimates:
## mean of x
## 62.6

11.1.10.1 Interpreting Confidence Intervals

Confidence intervals are good for evaluating the relative precision of two estimates. From the
relative width of two 1 − α confidence intervals, we can infer that the narrower interval is
based on data with some combination of lesser variance in the data and greater n.

Confidence intervals can also be used to make inferences. For example, if a 1 − α confidence
interval does not include 0, we can say that the mean described by that confidence interval is
significantly different from 0 at the α level. If two confidence intervals do not overlap, we may
say that the means described by those confidence intervals are significantly different from each
other. This feature results from the similarity between the construction of a 1 − α confidence
interval and a two-tailed t-test with a false alarm rate of α: because the very same t-
distributions are used in both, the confidence interval is functionally equivalent to a two-tailed
t-test.

The use of confidence intervals to make inferences is especially helpful when confidence
intervals are generated for statistics for which the population probability distribution is
unknown. If we have sufficient data and meet all of the required assumptions to construct
confidence intervals in the manner described above using t-statistics, there is probably nothing
stopping us from just performing t-tests on the data without having to rely on confidence
intervals and with what they do or do not overlap. But, if we are dealing with statistics that are
new (and thus untested with regard to their parent distributions) or are generated by repeated-
sampling procedures such as bootstrapping, then we can use those confidence intervals to make
inferences where t-tests may be unavailable to us and/or inappropriate for the data.

11.1.11 Effect Size with t-tests


A finding of a statistically significant effect means only that a specific null hypothesis has been
rejected for a specific inquiry: for that one statistical test, the conditional likelihood of those
data would have been less than the specified α rate. In the case of a t-test – where the standard
error depends largely on the sample size – the observed difference between a mean and a value
of interest (for a one-sample t-test) or the observed mean difference (for a repeated-measures t
-test) or the difference between sample means (for an independent-groups t-test) might not be
all that impressive, especially for studies involving large n.

The effect size associated with a statistical test is a measure of how big the observed effect – a
sample-level association (e.g., for a correlation) or a sample-level difference (e.g., for a t-
test) – is. We have already encountered one pair of effect-size statistics in the context of
correlation and regression: r (more specifically, the magnitude of r) and R . The product-
2

moment correlation r is the slope of the standardized regression equation: the closer the
changes in z are to corresponding changes in z , the stronger the association. The
y x

proportional reduction in error R is a measure of how much of the variance is explained by


2

the model rather than associated with the error.15 The magnitude of the r statistic was
associated with guidelines for whether the correlation was weak (0.1 ≤ r < 0.3), moderate (
0.3 ≤ r < 0.5), or strong ( r ≥ 0.5): in Figure 11.5, scatterplots of data with weak, moderate,

and strong are shown. As described by Cohen (2013)16, weak effects are present but difficult
to see, moderate effects are noticeable, and strong effects are obvious.

Figure 11.5: Weak, Moderate, and Strong Correlations

In the context of t-tests, effect sizes refer to differences. The effect size associated with a t-test
is the difference between the observed mean statistic (x for a one-sample test, d¯ for a
¯

repeated-measures test, or x − x for an independent-groups test) and the population mean


¯
¯ 1 2

parameter (μ for a one-sample test, μ for a repeated-measures test, or μ − μ for an d 1 2

independent-groups test) specified in the null hypothesis. However, because the size of an
observed differences can vary widely depending on what is being measured and in what units,
the raw effect size can be misleading to the point of uselessness. The much more commonly
used measure of effect size for the difference between two things is the standardized effect
size: the difference divided by the standard deviation of the data. The standardized effect size
is actually so much more commonly used that when scientists speak of effect size for
differences between two things, they mean the standardized effect size (the standardized is
understood).

In frequentist statistics, the standardized effect size goes by the name of Cohen’s d. Cohen’s d
addresses the issue of the importance of sample size to statistical significance by removing n
from the t equation, using standard deviations in the denominator instead of standard errors.
Otherwise, the calculation of d is closely aligned with the formulae for the t-statistics for each
flavor of t-test. Just as each type of t-test is calculated differently but produces a t that can be
evaluated using the same t-distribution, the standardized effect size calculations for each t-test
produces a d that can be evaluated independently of the type of test.

For the one-sample t-test:

¯ x − μH
0
d =
sd x

for the repeated-measures t-test:

¯ − μ
d d
d =
sd d

And for the independent-groups t-test:


¯
¯ x1 − x2 − Δ
d =
sd p

where the pooled standard deviation sd is the square root of the pooled variance.
p

Note: although d can take on an unbounded range of values, d is usually reported as positive,
so the d we report is more accurately |d|.

The Cohen (2013)17 guidelines for interpreting d are:

Range of d Effect-size Interpretation


0.2 ≤ d < 0.5 weak
0.5 ≤ d < 0.8 moderate
d ≥ 0.8 strong

but, as Cohen himself wrote in the literal book on effect size:

“there is a certain risk inherent in offering conventional operational definitions for those
terms for use in power analysis in as diverse a field of inquiry as behavioral science”
so please interpret effect-size statistics with caution.

11.1.11.1 Effect-size Examples

On this page, we have already calculated the values we need to produce d statistics for
examples of a one-sample t-test, a repeated-measures t-test, and an independent-groups t-test:
all we need are the differences and the standard deviations from each set of calculations.

11.1.11.1.1 One-sample d example

For our one-sample t-example, we had a set of temperatures produced by 10 freezers:

Freezer Temperature ( ∘
C )
1 -2.14
2 -0.80
3 -2.75
4 -2.58
5 -2.26
6 -2.46
7 -1.33
8 -2.85
9 -0.93
10 -2.01

The mean temperature is -2.01, and the standard deviation of the temperatures is 0.74. In this
example, the null hypothesis was μ ≥ 0 C , so the numerator of the d statistic is −2.01 − 0,

with the denominator equal to the observed standard deviation:

¯ x − μ −2.01 − 0
d = = = −2.71
sd x 0.74

For these data, d = −2.71, which we would report as d = 2.71 . Because d ≥ 0.8 , these data
represent a large effect.

11.1.11.1.2 Repeated-measures d example

For our repeated-measures t-test example, we used the following data about ovens and cakes:

Cake Pre-bake ( ∘
C ) Post-bake ( ∘
C ) Difference (Post − Pre)
1 20.83 100.87 80.04
2 19.72 98.58 78.86
3 19.64 109.09 89.44
4 20.09 121.83 101.74
Cake Pre-bake ( ∘
C ) Post-bake ( ∘
C ) Difference (Post − Pre)
5 22.25 122.78 100.53
6 20.83 111.41 90.58
7 21.31 103.96 82.65
8 22.50 121.81 99.31
9 21.17 127.85 106.68
10 19.57 115.17 95.60

As when calculating the t-statistic, to calculate d we are only interested in the difference
scores. The mean difference score is 92.54 and the standard deviation of the difference scores
is 9.77. We previously tested these data with both H : μ ≤ 0 C and H : μ ≤ 90 C : let’s
0 d

0 d

just examine the effect size for H : μ ≤ 90 C but keep in mind that effect size – like t’s and
0 d

p-values – depends in part on the choice of null hypothesis. With H : μ = 90: 0 d

¯ x − μ 92.54 − 0
d = = = 9.48
sd x 9.77

The d for these data is 9.48, indicating a large effect.18

11.1.11.1.3 Independent-groups d Example

Finally, for the independent-groups t-test example, we examined the following data:

Time to Toast (s)


Mark I Mark II
4.72 9.19
7.40 11.44
3.50 9.64
3.85 12.09
4.47 10.80
4.09 12.71
6.34 10.04
3.30 9.06
7.13 6.31
4.99 9.44

The numerator for for independent groups is the difference between the means
¯
¯
d x1 − x2 and
the hypothesized population-level difference μ − μ . For these data, the null hypothesis was 1 2

μ − μ = 0, so for this d, the numerator is simply x − x = −5.09. The denominator is the


¯
¯ 1 2 1 2

pooled standard deviation, which is the square root of the pooled variance; for these data, the
pooled standard deviation is $=. Thus:
−5.09
d =
1.66

and d = |d|= 3.07 represents a large effect.

11.1.11.2 A Note on Effect Size

The main difference between a t statistic and its corresponding d statistic is the sample size.
That’s by design: effect size measurement is meant to remove the impact of sample size from
assessing the magnitude of differences. Thus, it is possible to have things like statistically
significant effects with very small (< 0.2) sizes. Another possible outcome of that separation
is that the test statistic and the effect size statistic can be distinct to the point of disagreement: a
small, medium, or even large d value can be computed for data where there is no significant
effect. If there is no significant effect, then there is no effect size to measure: please don’t
report an effect size if the null hypothesis hasn’t been rejected.

11.1.12 Power of t-tests

11.1.12.1 What is power?

Figure 11.6: Love watching you dunk on Littlefinger, but for our purposes that’s not very
helpful, Cersei

Statistical power is the rate at which population-level similarities (as in a correlation) or


differences (as in a t-test) show up in a statistical test (which is, by definition, a sample-level
test). As noted in Classical and Bayesian Inference, in the Classical (Frequentist) framework,
there are two states of nature for a population-level effect – it either exists or does not – and
two possibilities for the result of a statistical test – either reject H or continue to assume H .
0 0

Is a Population-level Effect Present?


Is aYes No Present?
Population-level Effect
Yes No
Decision Regarding H Reject H
0 0
Hit α error

Decision Regarding H Continue to Assume


0 H0 β Error H0 Assumption Correct

There are two types of errors we can make in this framework. The Type-I error, also known as
the α error, is a false alarm: it happens when is no effect – similarity or difference – present in
the population-level data but we reject H regardless. The Type II error, also known as the β
0

error, is a miss: it happend when there is an effect – a similarity or a difference – present in the
population-level data but we continue to assume H regardless.
0

The rate at which we make the α error is more-or-less the α rate.19 When we set the α rate –
most frequently it is set at α = 0.05, but more stringent rates like α = 0.01 or α = 0.001 are
also used – we are implicitly agreeing to a decision-making algorithm that will lead to false
alarms α% of the time given the assumptions of the statistical test we are using.

For example, if we are conducting a one-sample t-test with H : μ = 0, H : μ ≠ 0, df = 20,


0 1

and α = 0.05, we are saying that our null hypothesis is that our observed sample is drawn
from a normal distribution with μ = 0. If the mean of the population from which we are
sampling is, in fact, 0, and we were to run the same experiment some large, infinity-
approaching number of times, we would expect the resulting t-statistics to be distributed thus:

Figure 11.7: A Basic t-distribution with df = 20

With α = 0.05 and a two-tailed test, the rejection regions for the t-distribution with df = 20
are defined as anything less than t = −2.09 and anything greater than t = 2.09. If an observed
t is located in either of those regions, H will be rejected. However, if the null hypothesis is
0
true, then exactly 5% of t-statistics calculated based on draws of n = 21 (that’s df + 1) from
a normal distribution with a mean of 0 will naturally live in those rejection regions, as we can
see by superimposing line segments representing the boundaries of the rejection regions over
the null t-distribution in Figure 11.8:

Figure 11.8: Rejection Regions for a Basic t-distribution with df = 20

We can simulate the process of taking a large number of samples from a normal distribution
with μ = 0 with the rnorm() command in R. In addition to μ = 0, we can stipulate any value
for σ because it will all cancel out with the calculation of the t-statistic. Let’s take 1,000,000
2

samples of n = 21 each from a normal distribution with μ = 0, σ = 1 and calculate a one-


2

sample t-statistic for each based on H : μ = 0 (t =


¯
x
0
s/√n
). With df = n − 1 = 20, the t-
values that define a lower-tail area of 2.5% and an upper-tail area of 2.5% are -2.09 and 2.09,
respectively. Taking repeated samples from the normal distribution defined by the null
hypothesis, we should expect 2.5% of the observed samples to produce t-statistics that are less
than or equal to -2.09 and another 2.5% to produce t-statistics that are greater than or equal to
2.09. If we take 1,000,000 samples, that means about 25,000 should be less than or equal to
-2.09 and another 25,000ish should be greater than or equal to 2.09. Let’s see what we get!

n=21 #Number of observations per sample


k=1000000 #Number of samples
t.statistics<-rep(NA, k) #Vector to hold the t-statistics

for (i in 1:k){
sample<-rnorm(n, mean=0, sd=1)
t<-mean(sample)/
(sd(sample)/sqrt(n))
t.statistics[i]<-t
}
t.table<-cut(t.statistics, c(-Inf, qt(0.025, df=20), qt(0.975, df=20), Inf)
table(t.table)

## t.table
## (-Inf,-2.09] (-2.09,2.09] (2.09, Inf]
## 24907 950070 25023

That is all to say that the rates at which α-errors are made are basically fixed by the α-rate.

The rates at which β-errors are made are not fixed, and are only partly associated with α-
rates. In the example illustrated in Figure 11.8, any observed t-statistic that falls between the
rejection regions will lead to continued assumption of H . That would be true even if the
0

sample were drawn from a different distribution than the one specified in the null, that is, if the
null were false. That is the β error: when the null is false – as in, the sample comes from a
different population than the one specified in the null – but we continue to assume H anyway.
0

The complement of the β error – the rate at which the the null is false and H is rejected in
0

favor of H – is therefore 1 − β.
1

Power is the complement of the 1 − β: the complement of the β-rate.

Power is the rate at which the null hypothesis is rejected given that the alternative is true.

To illustrate, please imagine that H defines a null distribution with μ = 0 and σ = 1 but that
0
2

the observed samples instead come from a different distribution – an alternative distribution
– with μ = 2 and σ = 1. If this were the case, the central limit theorem tells us that the
2

alternative sampling distribution will be a normal distribution with a mean of 2 and a variance
of 1/n. We can represent the set of t-statistics we would expect from this distribution with a
regular t-distribution shifted two units to the right: an alternative t-distribution with t̄ = 2
and df = n − 1. Figure 11.9 presents both the null and alternative parent distributions (i.e., the
normals) and the null and alternative sampling distributions (i.e., the t’s).
Figure 11.9: Null and Alternative Distributions of N (top) and t (bottom)

In this particular situation, with H : μ = 0 and H : μ ≠ 0, with df = 20, and the t-values
0 1

that define the rejection regions being -2.09 and 2.09, then the proportion of the alternative t-
distribution that lives between the rejection regions is given by pt(2-2.09, df=20)= 0.46.
That means that we would expect approximately 46% of the t-statistics generated by sampling
from the alternative distribution to lead to continuing to assume H , thus, we expect to commit
0

the β-error about 46% of the time. We would also expect to correctly reject H approximately
0

54% of the time: the power is therefore 0.54 (power is usually expressed as a decimal,
although expressing it as a percentage wouldn’t be wrong). See Figure 11.10 for an illustration.
Figure 11.10: Visualization of β and Power for a t-test Where H 0 = μ = 0 and the True
Population μ = 2

So the power for this particular situation is 0.54: there is a real population-level effect (in
this case, a population mean that is not equal to 0) and a 54% chance that H will be rejected
0

when a sample is taken.

The minimum limit to power, given the stipulated α-rate, is α itself. That minimum would
occur if the alternative hypothesis completely overlapped the null distribution: the only place
from which a sample could be drawn that would lead to rejecting H would be in the rejection
0

region. Of course, an alternative distribution can’t be identical to the null distribution –


otherwise it wouldn’t really be an alternative to the null – so the case of complete overlap is
really a theoretical, asymptotic limit. The maximum limit to power is 1, occurring if there were
no overlap whatsoever between the null and alternative distributions. That situation is also
theoretically impossible in the case of the t-test because t-distributions, like normal
distributions, extend forever in each direction, but given enough separation between null and
alternative distributions, the power can approach 1. Figure 11.11 illustrates both upper limits
in the context of a one-tailed test (just for the graphical simplicity afforded by a single-tailed
rejection region.)
Figure 11.11: Minimum Power (α and Maximum Power (≈ 1)

Figure 11.12: Settle down, Palpatine. Power has a lower limit of α and an upper limit of 1.

11.1.12.2 Power Analysis for t-tests


In the foregoing description of power, certain values were stipulated in order to estimate
power. Power analysis is a procedure for using stipulations on effect size and power to predict
required sample size, stipulating effect size and sample size to predict power, and/or
stipulating power and sample size to predict effect size. Typically, power analyses stipulate the
first pair: given a desired power level – 0.8 and 0.9 are popular choices – and an effect size –
which is predicated on the approximate effect size expected, usually based on the results of
previous similar experiments – to estimate the number of participants that will be required in
an experiment. Such analyses are particularly valuable to planning experiments and/or securing
research funding: if one is planning a study or asking a funding agency for resources, one
would like to be able to show a relatively high probability of finding significant effects (given
that population-level effects exist).

Power analyses for t-tests can be conducted by solving simultaneous equations: given a null
distribution with rejection regions defined by α and the effect size d, we can calculate the area
under the alternative distribution curve – the power – as a function of n. If you’d rather not –
and I wouldn’t blame you one bit – there are several software packages that can give
convenient power analyses with a minimum effort. For example, using the pwr package, we can
use the following code to find out how many participants we would expect to need to have
power of 0.8 given an effect size d = 0.5 (for a medium-sized effect) and an α-rate of 0.05 for
a two-tailed hypothesis on a one-sample t-test:

library(pwr)
pwr.t.test(n=NULL, d=0.5, sig.level=0.05, power=0.8, type="one.sample", alte

##
## One-sample t test power calculation
##
## n = 33.36713
## d = 0.5
## sig.level = 0.05
## power = 0.8
## alternative = two.sided

Using the same pwr package, we can also plot a power curve indicating the relationship
between n and power by wrapping the pwr.t.test() command in plot():

plot(pwr.t.test(n=NULL, d=0.5, sig.level=0.05, power=0.8, type="one.sample"


11.1.12.3 Notes on Power

11.1.12.3.1 “Observed” or “Retrospective” Power

Occasionally, a scientific paper will report the “observed” or “retrospective” power of an


experiment that has been conducted. That means that based on the observed difference between
a sample mean (or a mean difference, or a difference of sample means) and the null hypothesis
μ, the variance (or pooled variance) of the data, the α-rate, and n, the authors have calculated

what the power should have been, assuming that an alternative distribution can be estimated
from the observed data. That is a bad practice, for at least a couple of reasons. First: it
assumes that the alternative distribution is something that can be estimated rather than
stipulated, putting aside that the point of power analysis is to stipulate various reasonable
conditions for an experiment. Second: citing the probability of an event after an event has
occurred isn’t terribly meaningful – it’s like losing a bet but afterwards saying that you should
have won because you had a 51% chance of winning. So, like reporting an effect size for a
non-significant effect: don’t do it.

11.1.12.3.2 Misses vs. True Null Effects

In the null-hypothesis testing framework, failing to reject the null hypothesis can happen for one
of two reasons:

1. There is a population-level effect but the scientific algorithm didn’t detect it, in other
words: a type-II error

2. There is no population-level effect, in other words: a correct continuation of assuming H 0

.
A major flaw of frequentist null-hypothesis testing is that there is no way of knowing whether
reason 1 (a type-II error) or reason 2 (a correct assumption of H has occurred) is behind a
0

failure to reject the null. If there actually is no population-level effect, a non-significant


statistical result could either be the result of a failed experiment or a successful experiment:
there’s no way of knowing.

Thus, there is perfectly-warranted skepticism regarding null results: statistical analyses that
show neither differences nor similarities, which in the null-hypothesis testing framework take
the form of failure to reject H . Such results go unsubmitted and unpublished at rates that we
0

can’t possibly know precisely because they are neither submitted nor published (this is known
as the file-drawer effect). The problem is that sometimes null results are really important. A
famous example is the finding that the speed of light is unaffected by the rotation of the earth. In
psychological science, null results in cognitive testing between people of different racial,
ethnic, and/or gender identification are both scientifically and socially important; if we only
publish the relative few studies that find small differences (which themselves may be type-I
errors), those findings receive inordinate attention among what should be a sea of contradictory
findings.

In classical statistics, confidence intervals likely provide the best way of the available options
to support null results – confidence intervals that include 0 imply that 0 will be sampled in the
majority of repeated tests. Bayesian methods include more natural and effective ways of
producing support for the null: in the Bayesian framework, results including zero-points can be
treated as would be any other posterior distribution. One can create Bayesian credible
interval estimates that include values like 0 and can calculate Bayes Factors indicating support
for posterior models that include null results. So: point for Bayesians there.

11.2 Bayesian t-tests


We previously used Bayesian methods to create a posterior distribution on a proportion
parameter π using the Metropolis-Hastings algorithm. From that distribution, we could
calculate things like the probability that the parameter fell into a given range, or the highest-
density interval of the parameter.

We can do the same thing for mean data. For a relatively (that word is doing a lot of work
here) simple example, we can use the MH algorithm to produce a posterior distribution on the
mean freezer temperature from the one-sample t-test example above. (Note: this code also
calculates the posterior distribution on the variance, but I included that more as a way to let the
variance…vary).

#MCMC estimation of one-sample data (the freezer data)


iterations<-1000000
means<-c(-2, rep(NA, iterations-1))
vars<-c(0.5, rep(NA, iterations-1))

for (i in 2:iterations){
mean.prop<-runif(1,-5, 1)
vars.prop<-runif(1, 0.00001, 2)
u<-runif(1)
r<-prod(dnorm(freezer.data, mean.prop, sqrt(vars.prop)))/
prod(dnorm(freezer.data, means[i-1], sqrt(vars[i-1])))
if (r>=1){
means[i]<-mean.prop
vars[i]<-vars.prop
} else if (r > u){
means[i]<-mean.prop
vars[i]<-vars.prop
} else {
means[i]<-means[i-1]
vars[i]<-vars[i-1]
}
}

ggplot(data.frame(means), aes(means))+
geom_histogram(binwidth=0.02)+
theme_tufte(base_size=16, base_family="sans", ticks=FALSE)

Figure 11.13: Posterior Distribution of Mean Freezer Temperatures

This posterior distribution indicates that the probability that the mean freezer temperature is
less than 0 is 1, and that the 95% highest-density interval of the mean is
p(−2.59 ≤ x ≤ −1.44) = 0.95 .
¯

But, Bayesian analyses requiring more complexity can be a problem, especially if we are
interested in calculating the Bayes factor. When we calculated the Bayes factor comparing two
binomial models that differed only in the specification of the π parameter, the change from the
prior odds to the posterior odds was given by the ratio of two binomial likelihood formulae:
20! 16 4 16 4
p(D|H 1 ) (0.8) (0.2) (0.8) (0.2)
16! 4!
B.F . = = =
20! 16 4 16 4
p(D|H 2 ) (0.5) (0.5) (0.5 (0.5) )
16! 4!

For more complex models than the binomial – even the normal distribution works out to be a
much more complex model than the binomial model – the Bayes factor is calculated by
integrating competing models over pre-defined parameter spaces, that is, areas where we might
reasonably expect the posterior parameters to be. Given a comparison between a model 1 and
a model 0, that looks basically like this:

∫ p(D|θ H )p(θ H )dθ H


1 1 1
B.F . 10 =
∫ p(D|θ H 0 )p(θ H 0 )dθ H 0

where θ represents the model parameter(s). In addition to the difficulty of integrating


probability distributions, specifying the proper parameter space involves its own set of tricky
decisions.

Several R packages have recently been developed to make that process much, much easier. In
this section, I would like to highlight the BayesFactor package20, which does a phenomenal
job of putting complex Bayesian analyses into familiar and easy-to-use formats.

ttestBF() is a key function in the BayesFactor package. It produces results of Bayesian t-


tests using commands that are nearly identical to those used in R to produce results of classical
t-tests. Before we get to those blessedly familiar-looking commands, let’s take a brief look at

the theory behind the Bayesian t-test.

The key insight that makes the Bayesian t-test both computationally tractable and more familiar
to users of the classical version comes from Gönen et al (2005), who formulated t-test data in
a way similar to the regression-model form of the t-test. Taking for example an independent-
groups t-test, the observed data points y in each group are distributed as normals with a mean
determined by the grand mean of all the data μ plus or minus half the (unstandardized) effect
size δ and a variance equal to the variance of the data:

σδ
2
Group 1 : y 1i ∼ N (μ + ,σ )
2

σδ 2
Group 2 : y 2j ∼ N (μ − ,σ )
2

where i = 1, 2, ...n and j = 1, 2, ...n .


1 2

This insight means that the null model (not to be confused with a null hypothesis because that’s
the other kind of stats – this is just a model that says there is no difference between the groups)
M 0 is one where there is no effect size (δ = 0), and that can be compared to another model
M where there is an effect size ( δ ≠ 0). Moreover, a prior distribution can be stipulated for
1
δ and, based on the data, a posterior distribution can be derived based on that prior
distribution and the likelihood of the data given the prior.21

For the prior distribution, the Cauchy distribution – also awesomely known as The Witch of
Agnesi – is considered ideal for its shape and its flexibility. The Cauchy distribution has a
location parameter (x ) and a scale parameter (formally γ but the BayesFactor output calls it
0

r so we’ll comply with that), but for our purposes the location parameter will be fixed at 0.
When the scale parameter r = 1, the resulting Cauchy distribution is the t-distribution with
df = 1. The ttestBF() command in the BayesFactor package allows any positive value for

r to be entered in the options: smaller values indicate prior probabilities that favor smaller

effect sizes; larger values indicate prior probabilities favoring larger effect sizes. Figure 11.14
illustrates the Cauchy distribution with the three categorical options for width of the prior
distribution in the ttestBF() command: the default “medium” which corresponds to
r = 1/√ 2 ≈ 0.71, “wide” which corresponds to r = 1, and “ultrawide” which corresponds

to r = √2 ≈ 1.41.

Figure 11.14: The Cauchy Distribution with Scale Parameters 1

√2
(for Medium Priors), 1 (for
Wide Priors), and √2 (for Ultra-Wide Priors)

The main output of the ttestBF() command is the Bayes factor between model 1 – the model
determined by the maximum combination of the prior hypothesis and likelihood function – and
model 2 – the model that assumes no effect size. To interpret Bayes factors, the following sets
of guidelines have been proposed – both are good – reposted here from the page on classical
and Bayesian inference:

Jeffreys (1961) Kass & Raftery (1995)


Bayes Factor
JeffreysInterpretation
(1961) Bayes FactorKass & Raftery
Interpretation
(1995)

Bayes Factor Interpretation Bayes Factor Interpretation


1 → 3.2 Barely Worth Mentioning 1 → 3.2 Not Worth More Than a Bare Mention
3.2 → 10 Substantial 3.2 → 10 Substantial
10 → 31.6 Strong 10 → 100 Strong
31.6 → 100 Very Strong $$ Decisive
$$ Decisive

The ttestBF() output also includes the value of r that was used to define the Cauchy
distribution for the prior and a margin-of-error estimate for the Bayes factor calculation (which
will be ~0 unless there are very few data points in the analysis).

To start, let’s conduct a Bayesian t-test on the example data we used for the one-sample t-test:

library(BayesFactor)

## Loading required package: coda

## Loading required package: Matrix

##
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':


##
## expand, pack, unpack

## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact
Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************

ttestBF(c(freezer.data))

## Bayes factor analysis


## --------------
## [1] Alt., r=0.707 : 1701.953 ±0%
##
## Against denominator:
## Null, mu = 0
## ---
## Bayes factor type: BFoneSample, JZS

The result of the Bayes factor analysis, which uses a Cauchy prior on δ with a scale parameter
of r = 1

√2
(which is the default for ttestBf() and the value one would use if expecting a
medium-sized effect), indicates that the posterior odds in favor of the alternative model
increase the prior odds by a factor of 1,702; that is, the posterior model is about 1,702 times as
likely as the null model. That result agrees with the significant difference from 0 we got with
the frequentist t-test – the alternative is more likely than the null.

We can also wrap the ttestBF() results in the posterior() command – specifying in the
options the number of MCMC iterations we want to use – to produce posterior estimates of the
population mean, the population variance, the effect size (delta), and a parameter called g that
you do not need to worry about. To get summary statistics for those variables, we then wrap
posterior() in the summary() command. While the Bayes factor is probably the most
important output, the summary statistics on the posterior distributions help fill out the reporting
of our results.

summary(posterior(ttestBF(freezer.data), iterations=5000))

##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu -1.9338 0.2877 0.004068 0.004438
## sig2 0.7994 0.5505 0.007785 0.011190
## delta -2.4209 0.7219 0.010209 0.013686
## g 27.1980 192.7326 2.725651 3.067133
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu -2.4948 -2.1182 -1.9351 -1.7574 -1.345
## sig2 0.2773 0.4768 0.6607 0.9597 2.129
## delta -3.9121 -2.9096 -2.3910 -1.9103 -1.099
## g 0.5086 2.0731 4.4705 11.5516 130.982

If we desire visualizations of the posterior distributions, we can wrap the posterior()


command with plot() instead of summary (the traces indicate the different values of statistics
produced at each iteration of the MCMC process).

plot(posterior(ttestBF(freezer.data), iterations=5000))
As shown with the examples above, ttestBF() uses the same syntax as t.test(). It is
therefore straightforward to translate the commands for the classical repeated-measures t-test
into the Bayesian repeated-measures t-test (with the posterior() options in there for
completeness)

ttestBF(prebake, postbake, paired=TRUE, mu=90)

## t is large; approximation invoked.

## Bayes factor analysis


## --------------
## [1] Alt., r=0.707 : 5013929000 ±0%
##
## Against denominator:
## Null, mu = 90
## ---
## Bayes factor type: BFoneSample, JZS

summary(posterior(ttestBF(prebake, postbake, paired=TRUE, mu=90), iteration

## t is large; approximation invoked.

##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu -92.315 3.749 5.301e-02 5.301e-02
## sig2 140.810 90.368 1.278e+00 1.787e+00
## delta -8.642 2.175 3.076e-02 4.056e-02
## g 1469.930 82453.962 1.166e+03 1.166e+03
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu -99.783 -94.67 -92.342 -90.049 -84.81
## sig2 49.556 85.03 116.378 169.251 376.78
## delta -13.184 -10.05 -8.533 -7.083 -4.66
## g 6.995 25.28 53.540 133.545 1736.64

plot(posterior(ttestBF(prebake, postbake, paired=TRUE, mu=90), iterations=5

## t is large; approximation invoked.


And it is just as easy to translate the independent-groups t-test into the Bayesian independent-
groups t-test.

ind.groups.BF<-ttestBF(Mark.I, Mark.II, paired=FALSE)

ind.groups.BF

## Bayes factor analysis


## --------------
## [1] Alt., r=0.707 : 5682.522 ±0%
##
## Against denominator:
## Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS

summary(posterior(ind.groups.BF, iterations=5000))

##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 7.517 4.092e-01 5.786e-03 5.786e-03
## beta (x - y) -4.842 8.401e-01 1.188e-02 1.319e-02
## sig2 3.303 1.319e+00 1.865e-02 2.275e-02
## delta -2.803 6.935e-01 9.808e-03 1.229e-02
## g 346.610 1.788e+04 2.528e+02 2.528e+02
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 6.6793 7.251 7.514 7.783 8.308
## beta (x - y) -6.5054 -5.390 -4.851 -4.319 -3.114
## sig2 1.6687 2.412 3.013 3.869 6.604
## delta -4.1766 -3.269 -2.799 -2.326 -1.461
## g 0.7926 2.727 6.012 15.278 171.300

plot(posterior(ind.groups.BF, iterations=5000))
1. I think it would be more appropriate to say that the limit of the variance of the sample
means as n goes to infinity is 0 and that the proper mathematical way to say it is not
σ /∞ = 0 but rather
2

2
σ
lim = 0
n→∞ n

but I’m not much of a math guy.↩

2. The qualifier “finite” is used to exclude a sample of the entire population, because in that
(practically impossible) case the only possible sample mean is the population mean↩

3. I know, I glanced over the links above and 0 ∘


C is not cold enough. Let’s just go with it for
the example↩

4. Don’t delete your raw data, though! We won’t need them for the rest of the calculations,
but that doesn’t mean they might not come in handy for a different analysis.↩

5. Although if there is unequal n between the two groups, it is more likely that the data
violate the homoscedasticity assumption↩

6. In practice, the six-step procedure is institutionally internalized – we collectively assume


it is happening – and that is reflected in the way we can throw data into an R command
and immediately get the results without laying out each of the steps. But, a good and
important feature of the six-step procedure is that setting the null and alternative
hypotheses, setting an α-rate, and putting down rules for rejecting the null come before
collecting data, doing calculations (which in the case of the t-test includes estimating the
variance of the parent distribution), and making a decision. Those steps should be a
priori decisions that are in no way influenced by the data or how we analyze them.
Otherwise, we might be tempted to change the rules of the game after knowing the
outcome.↩

7. In this section, we will focus only on one-way tests with two possible categories simply
because this is a page about differences between two things. It is an incredibly arbitrary
distinction and there probably is too little difference between χ tests with two categories
2

and χ tests with three or more categories to justify discussing them separately. But, the
2

distinction makes a lot more sense for the other tests being discussed in this chapter, so
we will revisit χ tests again in Differences Between Three or More Things.↩
2

8. It is also common to say that such a table represents a crosstabulation (or crosstab) of the
responses.↩

9. The probabilities in the exact test follow a hypergeometric distribution, which gives the
probability of drawing s successes out of N trials without replacement.↩

10. If you’re curious, I very much agree: a hot dog is not a sandwich. While we’re here: I
understand how a straw has just one hole in a topological sense, in the sense in which we
use a straw it has two.↩

11. Frank Wilcoxon invented it, Mann and Whitney worked out the important details↩

12. The linked critical value table also includes critical values for a one-tailed test, which
involves multiplying the expected p-value by two. However, that procedure doesn’t really
square with the method of calculation of the U statistic, so I am not convinced by non-
two-tailed U tests.↩

13. Unfortunately, in addition to being the most powerful, it’s the biggest pain in the ass.↩

14. Seeing as the randomization test is also known as the permutation test, it is odd that we
use the combination formula instead of the permutation formula. We could use the
permutation formula to calculate the number of permutations of the data, but we would
end up with the same result, so we use the computationally easier of the two
approaches.↩

15. The R type of effect size – proportions of variance explained by the effect – will be
2

revisited in effect-size statistics for ANOVA models.↩

16. Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic
press.↩
17. Cohen, J. (2013). Statistical power analysis for the behavioral sciences. Academic
press.↩

18. There may no more magnificent collection of incredibly large effect sizes than in the
pages of statistics textbooks. Please don’t expect to regularly see effect sizes this large in
the wild.↩

19. The qualifier more-or-less is there to remind us that α won’t be super-precise if the
assumptions of parametric tests are violated, particularly with small sample sizes.↩

20. Richard D. Morey and Jeffrey N. Rouder (2018). BayesFactor: Computation of Bayes
Factors for Common Designs. R package version 0.9.12-4.2. https://CRAN.R-
project.org/package=BayesFactor↩

21. The base rate p(D), as is often the case, cancels out↩
12 Differences Between Three or More Things
12.1 Differences Between Things and Differences Within Things
Tufts University (“Tufts”) is an institution of higher learning in the Commonwealth of Massachusetts in
the United States of America. Tufts has three campuses: a main campus which covers parts of each of
the neighboring cities of Medford, MA and Somerville, MA; a medical center located in Boston, MA;
and a veterinary school located in Grafton, MA. Here’s a handy map:

Imagine that somebody asked you – given your access to the map pictured above – to describe the
distances between Tufts buildings. What if they asked you if they could comfortably walk or use a
wheelchair to get between any two of the buildings? Your answer likely would be that it depends: if
one wanted to travel between two buildings that are each on the Medford/Somerville campus, or each
on the Boston campus, or each on the Grafton campus, then those distances are small and easily
covered without motorized vehicles. If one wanted to travel between campuses, then, yeah, one would
need a car or at least a bicycle and probably a bottle of water, too.

We can conceive of the distances between the nine locations marked on the map as if they were just
nine unrelated places: some happen to be closer to each other, some happen to be more distant from
each other. Or, we can conceive of the distances as a function of campus membership: the distances are
small within each campus and the distances are relatively large between campuses. In that conception,
there are two sources of variance in the distances: within campus variance and between campus
variance, with the former much shorter than the latter. If one were trying to locate a Tufts building,
getting the campus correct is much more important than getting the location within each campus correct:
if you had an appointment at a building on the Medford/Somerville campus and you went to the wrong
Medford/Somerville campus building, there might be time to recover and find the right building; if you
went to, say, the Grafton campus, you should probably call, explain, apologize, and try to reschedule.

The example of campus buildings (hopefully) illustrates the basic logic of the Analysis of Variance,
more frequently known by the acronym ANOVA. We use ANOVA when there are three or more groups
being compared at a time: we could compare pairs of groups using t-tests, but this would A. inflate the
type-I error rate (the more tests you run, the more chances of false alarms),1 and B. fail to give a clear
analysis of the omnibus effect of a factor along which all of the groups differ.
As noted in the introduction, ANOVA is based on comparisons of sources of variance. For example,
the structure of the simplest ANOVA design – the one-way, between-groups ANOVA – creates a
comparison between the variance in the data that comes from between-groups differences and the
variance in the data that comes from within-groups differences: if differences in the data that are
attributable to the group membership of each datapoint are generally more important than the
differences that just occur naturally (and are observable within each group). The logic of ANOVA is
represented visually in Figure 12.1 and Figure 12.2. Figure 12.1 is a representation of what happens
when the populations of interest are either the same or close to it: there the variance between data
points in different groups is the same as the variance between data points within each group, and
the is no variance* added by group membership.

Figure 12.1: Visual Representation of ANOVA Logic: No Difference Between Populations

In Figure 12.2, group membership makes a substantial contribution to the overall variance of the data:
the variance within each group is the same as in the situation depicted in Figure 12.1, but the
differences between the groups slide the populations up and down the x-axis, spreading out the data as
they go.
Figure 12.2: Visual Representation of ANOVA Logic: Significant Difference Between Populations

12.1.1 The F -ratio

The central inference of ANOVA is on the population-level between variance, labeled as σ 2


between
in
Figure 12.1 and Figure 12.2. The null hypothesis for the one-way between-groups ANOVA is:
2
H 0 : σ between = 0

and the alternative hypothesis is:


2
H 1 : σ between > 0

2The overall variance in the observed data can be decomposed into the sum of the two sources of
variance between and within. If the null hypothesis is true and there is no contribution of between-
groups differences to the overall variance – which is what is meant by σ = 0, then all of the
2
between

variance is attributable to the within-group differences: σ + σ


2
= σ
between
2
. If, however, the
within
2
within

null hypothesis is false and there is any contribution of between-groups difference to the overall
variance – σ 2
between
> 0, then there will be variance beyond what is attributable to within-groups

differences: σ2
between
+ σ
2
> σ
within
2
.
within

Thus, the ANOVA compares σ 2


+ σ
between
2
and σ
within
. More specifically, it compares
2
within


2
between
+ σ
2
within
and σ 2
: the influence of σ
within
multiplies with every observation that it
2
between

impacts (e.g., the influence of between-groups differences cause more variation if there are n = 30
observations per group than if there are n = 3 observations per group). The way that ANOVA makes
this comparison is by evaluating ratios of between-groups variances and within-groups variances: on
the population-level, if the ratio is equal to 1, then the two things are the same, if the ratio is greater
than one, then the between-groups variance is at least a little bit bigger than 0. Of course, as we’ve
seen before with inference, if we estimate that ratio using sample data, any little difference isn’t going
to impress us the way a hypothetical population-level little difference would: we’re likely going to
need a sample statistic that is comfortably greater than 1.
When we estimate the between-groups plus within-groups variance and within-groups variance using
sample data, we calculate what is known as the F -ratio, also known as the F -statistic, or just plain F
.3The cumulative likelihood of observed F -ratios are evaluated in terms of the F -distribution. When
we calculate an F -ratio, we are calculating the ratio between two variances: between variance and
within variance. As we know from the chapter on probability distributions, variances are modeled by
χ distributions: χ distributions comprise the squares of normally-distributed values and given the
2 2

normality assumption, the numerator of the variance calculation ∑(x − x) is a normally-distributed


¯
2

set of variables (and dividing by the constant n − 1 doesn’t disqualify it). Developed to model the
ratio of variances, the F -distribution is literally the ratio of χ distributions. The F -distribution has 2

two sufficient statistics: df and df , which correspond to the df of the χ


numerator denominator
2

distribution in the numerator and the df of the denominator χ distribution by which it is divided to get2

the F -distribution (see illustration in Figure 12.3).

Figure 12.3: Example Derivation of the F -statistic From a Numerator χ Distribution with df
2
= 3 and
a Denominator χ Distribution with df = 11
2

2 2

The observed -ratio is a sample-level estimate of the population-level ratio


nσ +σ
F
between

σ
2
within
. If the
within

sample data indicate no difference between the observed variance caused by a combination of
between-groups plus within-groups variance and the variance observed within-groups, then F ≈ 1 and
we will continue to assume H . Otherwise, F >> 1, and we will reject H in favor of the H that
0 0 1

there is some contribution of the differences between groups to the overall variance in the data.

Because of the assumptions involved in constructing the F -distribution – that at the population-level
the residuals are normally distributed and the within-groups variance is the same across groups – the
assumptions of normality and homoscedasticity apply to all flavors of ANOVA. For some of those
flavors, there will be additional assumptions that we will discuss when we get to them, but normality
and homoscedasticity are univerally assumed for classical ANOVAs.

Conceptually, the ANOVA logic applies to all flavors of ANOVA. The tricky part, of course, is how to
calculate the F -ratio, and that will differ across flavors of ANOVA. There are also a number of other
features of ANOVA-related analysis that we will discuss, including estimating effect sizes, making
more specific between-group comparisons post hoc, a couple of nonparametric alternatives, and the
Bayesian counterpart to the classical ANOVA (don’t get too excited: it’s just a “Bayesian ANOVA”).
The flavors of ANOVA are more frequently (but less colorfully) described as different ANOVA models
or ANOVA designs. We will start our ANOVA journey with the one-way between-groups ANOVA,
describing it and its related features before proceeding to more complex models.

Figure 12.4: The features of this ANOVA model include Bluetooth capability, a dishwasher safe
detachable stainless steel skirt and disks, cooking notifications, and a 2-year warranty.

12.2 Between-Groups ANOVA models


12.2.1 One-way

12.2.1.1 The Model

To use a one-way between-groups ANOVA is to conceive of dependent-variable scores as being the


product of a number of different elements. One is the grand mean μ, which is the baseline state-of-
nature average value in a population. Each number is then also influenced by some condition that
causes some numbers to be greater than another. In the case of the one-way between-groups model, the
conditions are the levels j of the factor α4, for example, if the dependent variable is the width of
leaves on plants, and the factor α in question is the amount of time exposed to sunlight and the levels of
the factor are 8 hours/day, 10 hours/day, and 12 hours/day, then α = 8, α = 10, and α = 12, and
1 2 3

the between-groups difference is the difference caused by the j different levels of the between-groups
factor α. The last element is the within-groups variance, which falls into the more general categories
of error or residuals. It is the naturally-occurring variation within each condition. To continue the
example of the plants and the sunlight: not every plant that is exposed to the same sunlight for the same
duration will have leaves that are exactly the same width. In this model, the within-groups influence is
symbolized by ϵ , indicating that the error (ϵ) exists within each individual observation (i) and at each
ij

level of the treatment (j).

In a between-groups design, each observation comes from an independent group, that is, each
observation comes from one and only one level of the treatment.

Could we account for at least some of the within-groups variation by identifying individual
differences? Sure, and that’s the point of within-groups designs, but it’s not possible unless we observe
each plant under different sunlight Conditions, which is not part of a between-groups design. For more
on when we want to (or have to) use between-groups designs instead of within-groups designs, please
refer to the chapter on differences between two things.

Thus, each value y of the dependent variable is said to be determined by the sum of the grand mean μ,
the mean of the factor-level α , and the individual contribution of error ϵ ; and because y is influenced
j ij

by factor level and treatment, it gets the subscript ij:


y ij = μ + α j + ϵ i

If you think that equation vaguely resembles a regression model, that’s because it is:

ANOVA is a special case of multiple regression where there is equal n per factor-level.

ANOVA is regression-based much as the t-test can be expressed as a regression analysis. In fact, while
we are making connections, the t-test is a special case of ANOVA. If, for example, you enter data with
two levels of a between-groups factor, you will get precisely the same results – with an F statistic
instead of a t statistic – as you would for an independent-groups t-test.5

12.2.1.2 The ANOVA table

The results of ANOVA can be represented by an ANOVA table. The standard ANOVA table includes
the following elements:

1. Source (or Source of Variance, or SV): the leftmost column represents the decomposition of the
elements that influence values of the dependent variable (except the grand mean μ). For the one-
way between-groups ANOVA, the listed sources are typically between groups (or just between)
and within groups (or just within), with an entry for total variance often included. To increase the
generality of the table – which will be helpful because we are about to talk about a whole bunch
of ANOVA tables, we, we can call the between variance A to indicate that it is the variance
associated with factor A and we can call the within variance the error.

2. Degrees of Freedom (df ): the degrees of freedom associated with each source. These are the
degrees of freedom that are associated with the variance calculation (i.e., the variance
denominator) for each source of variance.
SS
¯ 3. Sums of Squares (
∑(x − x) ): The numerator of the calculation for the variance ( 2
) associated
with each source.

4. Mean Squares (M S ): The estimate of the variance for each source. It is equal to the ratio of SS
and df that are respectively associated with each source. Note: M S are not calculated for the
total because there is no reason whatsoever to care about the total M S .

5. F : The F -ratio. The F -ratio is the test statistic for the ANOVA. For the one-way between-groups
design (and also for the one-way within-groups design but we’ll get to that later), there is only one
F -ratio, and it is the ratio of the between-groups M S ( M S ) to within-groups M S M S .
A e

In addition to the five columns described above, there is a sixth column that is extremely optional to
include:

6. Expected Mean Squares (EM S ): a purely algebraic, non-numeric representation of the


population-level variances associated with each source. As noted earlier, the population-level
2 2
nσ +σ
comparison represented with sample data by the F -statistic is between

σ
2
within
, which can
within
2 2

somewhat more simply be annotated . In other words, it is the ratio of the sum of between-
nσ α +σ ϵ
2
σ
ϵ

groups variance – which is multiplied by the number of observations per group – plus the within-
groups variance divided by the within-groups variance. The EM S for each group lists the terms
that are represented by each source. For between-groups variance – or factor α – the EM S is the
numerator of the population-level F -ratio nσ + σ . For the within-groups variance – or the
2
α
2
ϵ

error (ϵ) – it is σ . The EM S are helpful for two things: keeping track of how to calculate the F -
2
ϵ

ratio for a particular factor in a particular ANOVA model (admittedly, this isn’t really an issue for
the one-way between-groups ANOVA since there is only one way to calculate an F -statistic and
it’s relatively easy to remember it), and calculating the effect-size statistic ω . Both of those things 2

can be accomplished without explicitly knowing the EM S for each source of variance, which is
why the EM S column of the ANOVA table is eminently skippable (but, if you do need to include
it – presumably at the behest of some evil stats professor – please keep in mind that there are no
numbers to be calculated for EM S : they are strictly algebraic formulae).

Here is the ANOVA table – with EM S – for the one-way between-groups ANOVA model with
formula guides for how to calculate the values that go in each cell:

SV df SS MS F EM S

2 2 2
A j − 1 n ∑(y ∙j − y ∙∙ ) SS A /df A M S A /M S e nσ α + σ ϵ

2 2
Error n − 1 ∑(y ij − y ∙j ) SS A /df e σϵ

2
T otal jn − 1 ∑(y ij − y ∙∙ )

In the table above, the bullet (∙) replaces the subscript i, j, or both to indicate that the y values are
averaged across each level of the subscript being replaced. y indicates the average of all of the y
∙j

values in each j averaged across all i observations and y is the grand mean of all of the y values
∙∙

averaged across all i observations in each of j Conditions. Thus, SS is equal to the sum of the A

squared differences between each group mean y and the grand mean y times the number of
∙j ∙∙

observations in each group n, SS is equal to the sum of the sums of the squared differences between
e

each observation y and the mean y of the group to which it belongs, and SS otal is equal to the sum
ij ∙j t

of the squared differences between each observation y and the grand mean y . ij ∙∙
In the context of ANOVA, n always refers to the number of observations per group. If one needs to
refer to all of the observations (e.g. all the participants in an experiment that uses ANOVA to analyze
the data), N is preferred to (hopefully) avoid confusion. In the classic ANOVA design, n is always the
same between all groups: if n is not the same between groups, then technically the analysis is not an
ANOVA. That’s not a dealbreaker for statistical analysis, though: if n is unequal between groups, the
data can be analyzed in a regression context with the general linear model and the procedure will
return output that is pretty-much-identical to the ANOVA table. The handling of unequal n is beyond the
scope of this course, but I assure you, it’s not that bad, and you’ll probably get to it next semester.

12.2.1.3 Example

Please imagine that the following data are observed in a between-groups design:

Condition 1 Condition 2 Condition 3 Condition 4


-3.10 7.28 0.12 8.18
0.18 3.06 5.51 9.05
-0.72 4.74 5.72 11.21
0.09 5.29 5.93 7.31
-1.66 7.88 6.56 8.83
Grand Mean = 4.57

12.2.1.3.0.1 Calculating df

There are j = 4 levels of factor


, thus, df = j − 1 = 3. There are n = 5 observations in each
α A

group, so . The total df


df e = j(n − 1) = 4(4) = 16 = jn − 1 = (4)(5) − 1 = 19. Note that
total

df A+ df = df e , and that df
total = N − 1 (or the total number of observations minus 1). Both of
total

those facts will be good ways to check your df math as models become more complex.

12.2.1.3.0.2 Calculating SS A

SS A is the sum of squared deviations between the level means and the grand mean multiplied by the
number of observations in each level. Another way of saying that is that for each observation, we take
the mean of the level to which that observation belongs, subtract the grand mean, square the difference,
and sum all of those squared differences. Because n is the same across groups by the ANOVA
definition, it is no difference to multiply the squared differences between the group means and the
grand mean by n or to add up the squared differences between the group means for every observation; I
happen to find the latter more visually convincing for pedagogical purposes, as in the table below:

Condition 2
Condition 2
Condition 2
Condition 2
(y ∙1 − y ∙∙ ) (y ∙2 − y ∙∙ ) (y ∙3 − y ∙∙ ) (y ∙4 − y ∙∙ )
1 2 3 4
-3.10 31.49 7.28 1.17 0.12 0.04 8.18 18.89
0.18 31.49 3.06 1.17 5.51 0.04 9.05 18.89
-0.72 31.49 4.74 1.17 5.72 0.04 11.21 18.89
0.09 31.49 5.29 1.17 5.93 0.04 7.31 18.89
-1.66 31.49 7.88 1.17 6.56 0.04 8.83 18.89

The sum of the values in the shaded columns is 257.94: that is the value of SS . A
12.2.1.3.0.3 Calculating SS e

The SS for this model – or, the within-groups variance – is the sum across levels of factor A of the
e

sums of the squared differences between each y j value of the dependent variable and the mean y of
i ∙j

the level to which it belongs:

Condition 1 (y i1 − y ∙j )
2
Condition 2 (y i2 − y ∙j )
2
Condition 3 (y i3 − y ∙j )
2
Condition 4 (y i4 − y ∙j )
2

-3.10 4.24 7.28 2.66 0.12 21.60 8.18 0.54


0.18 1.49 3.06 6.71 5.51 0.55 9.05 0.02
-0.72 0.10 4.74 0.83 5.72 0.91 11.21 5.26
0.09 1.28 5.29 0.13 5.93 1.35 7.31 2.58
-1.66 0.38 7.88 4.97 6.56 3.21 8.83 0.01

The sum of the values in the shaded columns is 58.82: that is SS . e

12.2.1.3.0.4 Calculating SS total

SS total is the total sums of squares of the dependent variable y:


N

2
∑ (y ij − y ∙∙ )

i=1

To calculate SS total , we sum the squared differences between each observed value y ij and the grand
mean y : ∙∙

Condition Condition Condition


(y i1 − y ∙∙ )
2
(y i2 − y ∙∙ )
2
(y i3 − y ∙∙ )
2
Condition 4 (y − y ∙∙ )
2

1 2 3 i4

-3.10 58.83 7.28 7.34 0.12 19.80 8.18 13.03


0.18 19.27 3.06 2.28 5.51 0.88 9.05 20.07
-0.72 27.98 4.74 0.03 5.72 1.32 11.21 44.09
0.09 20.07 5.29 0.52 5.93 1.85 7.31 7.51
-1.66 38.81 7.88 10.96 6.56 3.96 8.83 18.15

The sum of the shaded values is 316.76, which is SS total . It is also equal to SS A +SS , which is
e

important for two reasons:

1. It means we did the math right, and

2. The total sums of squares of the dependent variable is completely broken down by the sums of
squares associated with the between-groups factor and the sums of squares associated with
within-groups variation, with no overlap between the two. The fact that the total variation is
exactly equal to the sum of the variation contributed by the two sources demonstrates the principle
of orthogonality: where two or more factors do not account for the same variance because they
are uncorrelated with each other.

Orthogonality is an important statistical concept that shows up in multiple applications, including one
of the post hoc tests to be discussed below and, critically, all ANOVA models. In every ANOVA model,
the variance contributed to the dependent variable will be broken down orthogonally between each of
the sources, meaning that SS will always be precisely equal to the sum of the SS for all sources of
total

variance for each model.

12.2.1.3.0.5 Calculating M S , M S , and F


A B

As noted in the ANOVA table above, the mean squares for each source of variance is the ratio of the
sums of squares associated with that source divided by the degrees of freedom for that source. For the
example data:

SS A 257.94
M SA = = = 85.98
df A 3

SS e 58.82
M Se = = = 3.68
df e 16

The observed F -ratio is the ratio of the M S to M S , and is annotated with the degrees of freedom
A e

associated with the numerator of that ratio df (usually abbreviated df


numerator ) and degrees of num

freedom associated with the denominator of that ratio df (usually abbreviated df ) in


denominator denom

parenthesis following the letter F :

M SA 85.98
F (3, 16) = = = 23.39
M Se 3.68

12.2.1.3.0.6 There is nothing to calculate for EM S

The EMS for this model are simply nσ 2


α
2
+ σϵ for factor A and σ for the error.
2
ϵ

Now we are ready to fill in our ANOVA table.

12.2.1.3.0.7 Example ANOVA Table and S tatistical Inference

Here is the ANOVA table for the example data:

SV df SS MS F EM S

A 3 257.94 85.98 23.39 nσ + σ 2


α
2
ϵ

Error 16 58.82 3.68


2
σ ϵ

T otal 19 316.76

The F-ratio is evaluated in the context of the F -distribution associated with the observed df and num

df denom . For our example data, the F -ratio is 23.39: the cumulative likelihood of observing an F -
statistic of 23.39 or greater given df = 3 and df num = 16 – the area under the F -distribution
denom

curve with those specific parameters at or beyond 23.39, is:

pf(23.39, df1=3, df2=16, lower.tail=FALSE)

## [1] 4.30986e-06

Since 4.3 × 10 6 is smaller than any reasonable value of α, we will reject the null hypothesis that

2
σα = 0 in favor of the alternative hypothesis that σ > 0: there is a significant effect of the between-
2
α
groups factor.

12.2.1.3.0.8 ANOVA in R

There are several packages to calculate ANOVA results in R, but for most purposes, the base command
aov() will do. The aov() command returns an object that can then be summarized with the summary()
command wrapper to give the elements of the traditional ANOVA table.

aov() expects a formula and data structure similar to what we used for linear regression models and
for tests of homoscedasticity. Thus, it is best to put the data into long format. Using the example data
from above:

DV<-c(Condition1, Condition2, Condition3, Condition4)


Factor.A<-c(rep("Condition 1", length(Condition1)),
rep("Condition 2", length(Condition2)),
rep("Condition 3", length(Condition3)),
rep("Condition 4", length(Condition4)))

one.way.between.groups.example.df<-data.frame(DV, Factor.A)

summary(aov(DV~Factor.A,
data=one.way.between.groups.example.df))

## Df Sum Sq Mean Sq F value Pr(>F)


## Factor.A 3 257.94 85.98 23.39 4.31e-06 ***
## Residuals 16 58.82 3.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12.2.2 Effect Size Statistics

The effect-size statistics for ANOVA models are based on the same principle as R : they are measures
2

of the proportion of the overall variance data that is explained by the factor(s) in the model (as
opposed to the error). In fact, one ANOVA-based effect-size statistic – η – is precisely equal to R
2 2

for the one-way between-groups model (remember: ANOVA is a special case of multiple regression).
Since they are proportions, the ANOVA-based effect-size statistics range from 0 to 1, with larger
values indicating increased predictive power for the factor(s) in the model relative to the error.

We here will discuss two ANOVA-based effect-size statistics: η and ω .6 Estimates of η and of ω
2 2 2 2

tend to be extremely close: if you calculate both statistics for the same data, you are less likely to
observe that the two statistics disagree as to the general size of an effect than you are to see them be
separated by a couple percent, or tenths of a percent, or hundredths of a percent.

Of η and ω , η is vastly easier to calculate, and as noted just above, not far off from estimates of ω
2 2 2 2

for the same data. η is also more widely used and understood. The only disadvantage is that η is a
2 2

biased estimator of effect size: it’s consistently too big (one can test that with Monte Carlo simulations
and/or synthetic data – using things like the rnorm() command in R – to check the relationship between
the magnitude of a statistical estimate and what it should be given that the underlying distribution
characteristics are known). That bias is usually more pronounced for smaller sample sizes: if you
google “eta squared disadvantage,” the first few results will be about the estimation bias of η and the
2

problems it has with relatively small n. Again, it’s not a gamechanging error, but if you are pitching
your results to an audience of ANOVA-based effect-size purists, they may be looking for you to use the
unbiased estimator ω .
2

In addition to being unbiased, the other advantage that ω has is that it is equivalent to the statistic
2

known as the intraclass correlation (ICC), which is usually considered a measure of the psychometric
construct reliability. ANOVA-based effect size statistics – including both η and ω – can also be
2 2

interpreted as measuring reliability of a measure: if a factor or factors being measured make a big
difference in the scores, then assessing dependent variables based on those factors should return
reliable results over time. In the world of psychometrics – which I can tell you from lived experience
is exactly as exciting and glamorous as it sounds – some analysts report ICC statistics and some report
ω statistics and a suprisingly large proportion don’t know that they are talking about the same thing.
2

12.2.2.1 Calculating η 2

For the one-way between-groups ANOVA model, η is a ratio of sums of squares: the sums of squares
2

associated with the factor SS to the total sums of squares SS


A . It is thus the proportion of
total

observed sample-level variance associated with the factor:

2
SS A
η =
SS total

For the example data, the observed η is: 2

SS A 257.94
2
η = = = 0.81
SS total 316.76

which is a large effect (see the guidelines for interpreting ANOVA-based effect sizes below).

12.2.2.2 Calculating ω 2

The ω statistic is an estimate of the population-level variance that is explained by the factor(s) in a
2

model relative to the population-level error.7 The theoretical population-level variances are known as
population variance components: for the one-way between-groups ANOVA model, the two
components are the variance from the independent-factor variable σ and the variance associated with
2
α

error σ . 2
ϵ
As noted above, the between-groups variance in the one-way model is classified as nσ + σ , and is 2
α
2
ϵ

estimated in the sample data by M S ; the within-groups variance is classified as σ , and is estiamted
A
2
ϵ

in the sample data by M S : these are the terms listed for factor A and for the error, respectively, in the
e

EM S of the ANOVA table.

ω is the ratio of the population variance component for factor A


2 2
σα divided by the sum of the
population variance components σ + σ : 2
α
2
ϵ

2
σα
2
ω =
2 2
σα + σϵ

As noted above, M S is an estimator of σ (σ gets a hat because it’s an estimate):


e
2
ϵ
2
ϵ

2
σ̂ ϵ = M S e

That takes care of one population variance component. M S is an estimator of nσ + ϵ: if we subtract


A
2
α

M S (being the estimate of σ ) from M S and divide by n, we get the population variance component
2
e ϵ A

σ :
2
α

2
M SA − M Se
σ̂ α =
n

Please note that a simpler and probably more meaningful way to conceive of σ is that it is literally 2
α

the variance of the condition means. For the one-way between-groups ANOVA model, it is more
straightforward to simply calculate the variance of the condition means than to use the above formula,
but it gets moderately more tricky to apply that concept to more complex designs.

We’re not quite done with σ̂ yet: there is one more consideration.
2
α

12.2.2.2.1 Random vs. Fixed Effects

The value of σ̂ as calculated above has for its denominator df , which is j − 1. Please recall that the
2
α A
2
¯
formula for a sample variance – s =
∑(x−x)
2
also has an x − 1 term in the denominator. In the case
n−1

of a sample variance, that’s an indicator that the variance refers to a sample that is much smaller than
the population from which it comes. In the case of a population variance component, that indicates that
the number of levels j of the independent-variable factor A is a sample of levels much smaller than
the total number of possible levels of the factor. When the number of levels of the independent
variable represent just some of the possible levels of the independent variable, that IV is known as a
random-effects factor (the word “random” may be misleading: it doesn’t mean that factors were
chosen out of a hat or via a random number generator, just that there are many more possible levels that
could have been tested). For example, a health sciences researcher might be interested in the effect of
different U.S. hospitals on health outcomes. They may not be able to include every hospital in the
United States in the study, but rather 3 or 4 hospitals that are willing to participate in the study: in that
case, hospital would be a random effect with those 3 or 4 levels (and the within-groups variable
would be the patients in the hospitals).

A fixed-effect factor is one in which all (or, technically, very nearly all) possible levels of the
independent variables are examined. To recycle the example of studying the effect of hospitals on
patient outcomes, we might imagine that instead of hospitals in the United States, the independent
variable of interest were hospitals in one particular city, in which case 3 or 4 levels might represent
all possible levels of the variable. Please note that it is not the absolute quantity of levels that
determines whether an effect is random or fixed but the relative quantity of observed levels to the total
possible set of levels of interest.

When factor A is fixed, we have to replace the denominator j − 1 with j (to indicate that the variance
is a measure of all of the levels of interest of the IV). We can do that pretty easily by multiplying the
population variance component σ̂ by j − 1 – which is equal to df – and then dividing the result by j
2
α A

– which is equal to df + 1. We can accomplish that more simply by multiplying the estimate of σ by
A
2
α

the term .
df A

df A +1

So, for random effects in terms of Expected Mean Squares:

2
M SA − M Se
σ̂ α =
n

σα
2
, as noted above, is the variance of the means of the levels of factor A. If factor A is random, then
σα
2
is, specifically, the sample variance of the means of the levels of factor A about the grand mean y ∙∙

:
2
∑(y ∙j − y ∙∙ )
2
σ̂ α =
j − 1

and for fixed effects:

2
M SA − M Se df A
σ̂ α = [ ]
n df A + 1

which is equivalent to the population variance of the means of the levels of factor A about the grand
mean y : ∙∙

2
∑(y ∙j − y ∙∙ )
2
σ̂ α =
j

One more note on population variance components before we proceed to calculating ω for the sample 2

data: the fact that the numerator of σ̂ (using x as a generic indicator of a factor that can be α, β, γ, etc.
2
x

for more complex models) includes a subtraction term (in the above equation, it’s M S − M S ), it is a e

possible to end up with a negative estimate for σ̂ if the F -ratio for that factor is not significant. This
2
x

really isn’t an issue for the one-way between-groups ANOVA design because if F is not significant A

then there is no effect to report and, as noted in the chapter on differences between two things, we do
not report effect sizes when effects are not significant. But, it can happen for population variance
components in factorial designs. If that is the case, we set all negative population variance components
to 0.

Using the example data, assuming that factor A is a random effect:


2
σ̂ ϵ = M S e = 3.68

M SA − M Se 85.98 − 3.68
2
σ̂ α = = = 16.46
n 5
and:
2
σ̂ α 16.46
2
ω = = = 0.82
2 2
σ̂ α + σ̂ ϵ 16.46 + 3.68

Assuming that factor A is a fixed effect:


2
σ̂ ϵ = M S e = 3.68

2
M SA − M Se df A 85.98 − 3.68 3
σ̂ α = [ ] = ( ) = 12.35
n df A + 1 5 4

and:
2
σ̂ α 12.35
2
ω = = = 0.77
2 2
σ̂ α + σ̂ ϵ 12.35 + 3.68

12.2.2.3 Effect-size Guidelines for η and ω 2 2

As is the case for most classical effect-size guidelines, Cohen (2013) provided interpretations for η 2

and ω . In his book Statistical Power Analysis for the Behavioral Sciences – which is the source for
2

all of Cohen’s effect-size guidelines – Cohen claims that η and ω are measures of the same thing
2 2

(which is sort of true but also sort of not) and thus should have the same guidelines for what effects are
small, medium, and large (which more-or-less works out because η and ω are usually so similar in2 2

magnitude):

η
2
or ω
Interpretation 2

0.01 → 0.6 Small


0.6 → 0.14 Medium
0.14 → 1 Large

Thus, the η value and the ω values assuming either random or fixed effects for the sample data all fall
2 2

comfortably into the large effect range.8

12.2.3 Post hoc tests

Roughly translated, post hoc is Latin for after this. Post hoc tests are conducted after a significant
effect of a factor has been established to learn more about the nature of an effect, i.e.: which levels of
the factor are associated with higher or lower scores?

Post hoc tests can help tell the story of data beyond just a significant F -ratio in several ways. There is
a test that compares experimental conditions to a single control condition (the Dunnett common control
procedure). There are tests that compare the average(s) of one set of conditions to the average(s) of
another set of conditions (including orthogonal contrasts and Scheffé contrasts). There are also post
hoc tests that compare the differences between each possible pair of condition means (including Dunn-
Bonferroni tests, Tukey’s HSD, and the Hayter-Fisher test). The choice of post hoc test depends on the
story you are trying to tell with the data
12.2.3.1 Pairwise Comparisons

12.2.3.1.1 Dunn-Bonferroni Tests

The use of Dunn-Bonferroni tests, also known as applying the Dunn-Bonferroni correction, is a
catch-all term for any kind of statistical procedure where multiple hypotheses are conducted
simultaneously and the type-I error rate is reduced for each test of hypotheses in order to avoid
inflating the overall type-I error rate. For example, a researcher testing eight different simple
regression models using eight independent variables to separately predict the dependent variable – that
is, running eight different regression commands – might compare the p-values associated each of the
eight models to an α that is 1/8th the size of the overall desired α-rate (otherwise, she would risk
committing eight times the type-I errors she would with a single model).

In the specific context of post hoc tests, the Dunn-Bonferroni test is an independent-samples t-test for
each of the condition means. The only difference is that one takes the α-rate of one’s preference – say,
α = 0.05, and divides it by the number of possible multiple comparisons that can be performed on the

condition means:
α overall
α Dunn−Bonf erroni =
# of pairwise comparisons

(it may be tempting to divide the overall α by the number of comparisons performed instead of the
number possible, but then one could just run one comparison with regular α and thus defeat the
purpose)

To use the example data from above, let’s imagine that we are interested in applying the Dunn-
Bonferroni correction to a post hoc comparison of the mean of Condition 1 and Condition 2. To do so,
we simply follow the same procedure as for the independent-samples t-test and adjust the α-rate.
Assuming we start with α = 0.05, since there are 4 groups in the data, there are C = 6 possible
4 2

pairwise comparisons, so the Dunn-Bonferroni-adjusted α-rate is 0.05/6 = 0.0083

t.test(Condition1, Condition2, paired=FALSE)

##
## Welch Two Sample t-test
##
## data: Condition1 and Condition2
## t = -6.2688, df = 7.1613, p-value = 0.0003798
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -9.204768 -4.179232
## sample estimates:
## mean of x mean of y
## -1.042 5.650

The observed p-value of 0.0002408 < 0.0125, so we can say that there is a signficant difference
between the groups after applying the Dunn-Bonferroni correction.

Please also note that if you need to for any reason, you could also multiply the observed p-value by the
number of possible pairwise comparisons and compare the result to the original α. It works out to be
the same thing, a fact that is exploited by the base R command pairwise.t.test(), in which you can
use the p.adjust="bonferroni"9 to conduct pairwise Dunn-Bonferroni tests on all of the conditions
in a set simultaneously:
DV<-c(Condition1, Condition2, Condition3, Condition4)
Factor.A<-c(rep("Condition 1", length(Condition1)),
rep("Condition 2", length(Condition2)),
rep("Condition 3", length(Condition3)),
rep("Condition 4", length(Condition4)))

pairwise.t.test(DV, Factor.A, p.adjust="bonferroni", pool.sd = FALSE, paired=FALSE

##
## Pairwise comparisons using t tests with non-pooled SD
##
## data: DV and Factor.A
##
## Condition 1 Condition 2 Condition 3
## Condition 2 0.0023 - -
## Condition 3 0.0276 1.0000 -
## Condition 4 2.3e-05 0.1125 0.1222
##
## P value adjustment method: bonferroni

12.2.3.1.2 Tukey’s HSD

Tukey’s Honestly Significant Difference (HSD) Test is a way of determining whether the difference
between two condition means is (honestly) significant. The test statistic for Tukey’s HSD test is the
studentized range statistic q, where for two means y and y :
∙1 ∙2

y ∙1 − y ∙2
q =
M Se

n

^[In this chapter, we are dealing exclusively with classical ANOVA models with equal n per group.
The q statistic for the HSD test can also handle cases of unequal n for post hoc tests of results from
generalized linear models by using the denominator:

M Se 1 1
√ ( + )
2 n1 n2

which is known as the Kramer correction. If using the Kramer correction, the more appropriate name
for the test is not the Tukey test but rather the Tukey-Kramer procedure.]

The critical value of q depends on the number of conditions and on df : those values can be found in
e

tables such as this one.

For any given pair of means, the observed value of q is given by the difference between those means
divided by √M S /n, where M S comes from the ANOVA table and n is the number in each group:
e e

if the observed q exceeds the critical q, then the difference between those means is honestly
significantly different.

When performing the Tukey HSD test by hand, it also can be helpful to start with the critical q and
calculate a critical HSD value such that any observed difference between means that exceeds the
critical HSD is itself honestly significant:
M Se
H SD crit = q crit √
n

But, it is unlikely that you will be calculating HSDs by hand. The base R command TukeyHSD() wraps
an aov() object – in the same way that we can wrap summary() around an aov() object to get a full-
ish ANOVA table – and returns results about the differences between means with confidence intervals
and p-values:

TukeyHSD(aov(DV~Factor.A,
data=one.way.between.groups.example.df))

## Tukey multiple comparisons of means


## 95% family-wise confidence level
##
## Fit: aov(formula = DV ~ Factor.A, data = one.way.between.groups.example.df)
##
## $Factor.A
## diff lwr upr p adj
## Condition 2-Condition 1 6.692 3.2225407 10.161459 0.0002473
## Condition 3-Condition 1 5.810 2.3405407 9.279459 0.0010334
## Condition 4-Condition 1 9.958 6.4885407 13.427459 0.0000022
## Condition 3-Condition 2 -0.882 -4.3514593 2.587459 0.8847332
## Condition 4-Condition 2 3.266 -0.2034593 6.735459 0.0687136
## Condition 4-Condition 3 4.148 0.6785407 7.617459 0.0165968

According to the results here, there are honestly significant differences (assuming α = 0.05) between
Condition 2 and Condition 1, between Condition 3 and Condition 1, between Condition 4 and
Condition 1, and between Condition 4 and Condition 3. All other differences between level means are
not honestly significant.

12.2.3.1.3 Hayter-Fisher Test

The Hayter-Fisher test is identical to the Tukey HSD, but with a smaller critical q value, which makes
this test more powerful than the Tukey HSD test. The critical value of q used for the Hayter-Fisher test
is the value listed in tables for k − 1 groups given the same df (in the table linked above, it is the
e

critical q listed one column immediately to the left of the value one would use for Tukey’s HSD).

The condition that comes from being able to use a smaller critical q value is that the Hayter-Fisher test
can only be used following significant F test. However, since you shouldn’t be reporting results of post
hoc tests following non-significant F -tests anyway, that’s not really an issue. The issue with the
Hayter-Fisher test is that fewer people have heard about it: probably not a dealbreaker, but something
you might have to write in a response to Reviewer 2 if you do use it.
12.2.3.2 Contrasts

A contrast is a comparison of multiple elements. In post hoc testing, just as a pairwise comparison
will return a test statistic like q in the Tukey HSD test or the Dunn-Bonferroni-corrected t, a contrast is
a single statistic – which can be tested for statistical significance – that helps us evaluate a comparison
that we have constructed. For example, please imagine that we have six experimental conditions and
we are interested in comparing the average of the means of the first three with the average of the means
of the second three. The resulting contrast would be a value equal to the differences of the averages
defined by the contrast. The two methods of contrasting discussed here – orthogonal contrasts and
Scheffé contrasts – differ mainly in how the statistical significance of the contrast values are
evaluated.

12.2.3.2.1 Orthogonal Contrasts

As noted above, orthogonality is the condition of being uncorrelated. Take for example the x- and y-
axes in the Cartesian Coordinate System: the two axes are at right angles to each other, and the x value
of a point in the plane tells us nothing about the y value of the point and vice verse. We need both
values to accurately locate a point. The earlier mention of orthogonality in this chapter referred to the
orthogonality of between-groups variance and within-groups variance: the sum of the two is precisely
equal to the total variance observed in the dependent variable, indicating that the two sources of
variance do not overlap, and each explain separate variability in the data.

Orthogonal contrasts employ this principle by making comparisons between condition means that
don’t compare the same things more than once. Pairwise comparisons like the ones made in Tukey’s
HSD test, the Hayter-Fisher test, and the Dunn-Bonferroni test are not orthogonal because the same
condition means appear in more than one of the comparisons: if there are three condition means in a
dataset and thus three possible pairwise comparisons, then each condition mean will appear in two of
the comparisons.^[For this example, if we call the three condition means “C ,” “C ,” and “C ,” the
1 2 3

three possible pairwise comparisons are:

1. C vs. C
1 2

2. C vs. C
1 3

3. C vs. C
2 3

C1 appears in comparisons (1) and (2), C2 appears in comparisons (1) and (3), and C3 appears in
comparisons (2) and (3).]
By contrast (ha, ha), orthogonal contrasts are constructed in such a way that each condition mean is
represented once and in a balanced way. This is a cleaner way of making comparisons and is a more
powerful way of making comparisons because it does not invite the extra statistical noise associated
with accounting for within-groups variance appearing multiple times.

To construct orthogonal contrasts, we choose positive and negative coefficients that are going to be
multiplied by the condition means. We multiply each condition mean by its respective coefficient and
take the sum of all of those terms to get the value of a contrast Ψ .

For each set of contrasts to be orthogonal, the sum of the products of all coefficients associated with
each condition mean must be 0.

For example, in a 4-group design like the one in the example data, the coefficients for one contrast Ψ 1

could be:

1 1 1 1
c 1j = [ , ,− ,− ]
2 2 2 2

That set of contrasts represents the average of the first two condition means (the sum of the first two
condition means divided by two) minus the average of the second two condition means (the sum of the
second two condition means divided by two times negative 1).

A contrast Ψ that is orthogonal to Ψ would be:


2 1

c 2j = [0, 0, 1, −1]

which would represent the difference between the condition 3 mean and the condition 4 mean, ignoring
the means of conditions 1 and 2. Ψ and Ψ are orthogonal to each other because:
1 2

1 1 1 1
( )(0) + ( )(0) + (− )(1) + (− )(−1) = 0
2 2 2 2

A third possible contrast Ψ would have the coefficients:


3

c 3j = [1, −1, 0, 0]

which would represent the difference between the condition 1 mean and the condition 2 mean, ignoring
the means of conditions 3 and 4. Ψ and Ψ are orthogonal to each other because:
1 3

1 1 1 1
( )(1) + ( )(−1) + (− )(0) + (− )(0) = 0
2 2 2 2

and Ψ is orthogonal to Ψ because:


2 3

(0)(1) + (0)(−1) + (1)(0) + (−1)(0) = 0

Thus, {Ψ 1
, Ψ2 , Ψ3 } is a set of orthogonal contrasts.

To calculate Ψ using the example data:


1
j
1 1 1 1
Ψ 1 = ∑ c 1j y ∙j = ( )(−1.04) + ( )(5.65) + (− )(4.77) + (− )(8.92) = −4.538
2 2 2 2
1

The significance of Ψ is tested using an F -statistic. The numerator of this F -statistic is Ψ : contrasts 2

always have one degree of freedom, the sums of squares is the square of the contrast, and the mean
squares (which is required for an F -ratio numerator) is equal to the sums of squares whenever df = 1.
The denominator of the F -statistic for testing Ψ has df equal to df from the ANOVA table and its e
2
c
value is given by M Se ∑
nj
j
, where M Se comes from the ANOVA table, cj is the contrast c for
condition mean j and n is the number of observations in condition j (for a classic ANOVA model,
j n

is the same for all groups so the subscript j isn’t all that relevant):
2
Ψ
Ψ
F =
obs 2
c
ij
M Se ∑
nj

For Ψ calculated above, the observed F -statistic therefore would be:


1

2
Ψ1
−4.54 20.61
F = = = 27.85
obs 2 2 2 2
(1/2) (1/2) (−1/2) (−1/2) 0.74
3.68 [ + + + ]
5 5 5 5

The p-value for an F -ratio of 27.85 with df num = 1, df denom = 16 is:

pf(27.85, df1=1, df2=16, lower.tail=FALSE)

## [1] 7.514161e-05

Unless our α-rate is something absurdly small, we can say that Ψ – the contrast of the average of the 1

first two condition means to average of the last two condition means – is statistically significant.

As for the other contrasts in our orthogonal set:


j

Ψ 2 = ∑ c 2j y ∙j = 0(−1.04) + 0(5.65) + (1)(4.77) + (−1)(8.92) = −4.15

2
Ψ2
−4.15 17.22
F = = = 27.85, p = 0.003
obs
0
2
0
2
1
2
−1
2
1.47
3.68 [ + + + ]
5 5 5 5

Ψ 3 = ∑ c 3j y ∙j = 1(−1.04) + (−1)(5.65) + 0(4.77) + 0(8.92) = −6.69

2
Ψ3
−6.69 44.76
F = = = 30.45, p = 0.00005
obs
1
2
−1
2
0
2
0
2
1.47
3.68 [ + + + ]
5 5 5 5

12.2.3.3 Scheffé Contrasts


When using Scheffé contrasts, one does not have to be concerned with the orthogonality of any given
contrast with any other given contrast. That makes the Scheffé contrast ideal for making complex
comparisons (e.g., for a factor with 20 levels, one could easily calculate a Scheffé contrast comparing
the average of the means of the third, fourth, and seventh levels levels to the average of levels ten
through 18), and/or for making many contrast comparisons. The drawback of the Scheffé contrast is that
it is by design less powerful than orthogonal contrasts. That is the price of freedom (in this specific
context). Although power is sacrificed, it is true that if the F test for a factor is signficant, then at least
one Scheffé contrast constructed from the level means of that factor will be signficant.

The Ψ value for any given Scheffé contrast is found in the same way as for orthogonal contrasts, except
that any coefficients can be chosen without concern for the coefficients of other contrasts. Say we were
interested, for our sample data, in comparing the mean of condition 2 with the mean of the other three
condition means. The Scheffé Ψ would therefore be:
1 1 1 −1.042 4.768 8.916
Ψ Schef f e
ˊ
= (y ∙1 ) + −1(y ∙2 ) + (y ∙3 ) + (y ∙4 ) = − 5.65 + + = −1.44
3 3 3 3 3 3

The F -statistic formula for the Scheffé is likewise the same as the F -statistic formula for orthogonal
contrasts:
2
Ψ
F Schef f e
ˊ
=
2
c
j
M Se ∑
nj

For our Scheffé example:


2
−1.44 2.07
F Schef f e
ˊ
= = = 2.11
2 2 2
(1/3) −1
2 (−1/3) (−1/3) 0.98
3.68 [ + + + ]
5 5 5 5

The more stringent signficance criterion for the Scheffé contrast is given by multiplying the F -statistic
that would be used for a similar orthogonal contrast – the F with df = 1 and df num denom = df e , by
the number of conditions minus one:
ˊ
Schef f e
F = (j − 1)F crit (df num = 1, df denom = df e )
crit

So, to test our example Ψ of 2.11, we compare it to a critical value equal to (j − 1) times the F
Schef f e
ˊ

that puts α in the upper tail of the F distribution with df = 1 and df = df :


num denom e

alpha=0.05

(4-1)*qf(alpha, df1 = 1, df2 = 16, lower.tail=FALSE)

## [1] 13.482

The observed Ψ is less than the critical value of 13.482, so this particular Scheffé contrast is not
significant at the α = 0.05 level.

12.2.3.4 Dunnett Common Control Procedure


The last post hoc test we will discuss in this chapter10 is one that is useful for only one particular data
structure, but that one data structure is commonly used: it is the Dunnett Common Control Procedure,
used when there is a single control group in the data. The Dunnett test calculates a modified t-statistic
based on the differences between condition means and the M S from the ANOVA. Given the control
e

condition and any one of the experimental conditions:


y ∙control − y ∙experimental
t Dunnett =
2M S e

n

Critical values of the Dunnett test can be found in tables. There is also an easy-to-use DunnettTest()
command in the DescTools package. To produce the results of the Dunnett Test, enter the data array for
each factor level as a list(): the DunnettTest() command will take the first array in the list to be
the control condition. For example, if we treat Condition 1 of our sample data as the control condition:

library(DescTools)

DunnettTest(list(Condition1, Condition2, Condition3, Condition4))

##
## Dunnett's test for comparing several treatments with a control :
## 95% family-wise confidence level
##
## $`1`
## diff lwr.ci upr.ci pval
## 2-1 6.692 3.548031 9.835969 0.00015 ***
## 3-1 5.810 2.666031 8.953969 0.00057 ***
## 4-1 9.958 6.814031 13.101969 9.6e-07 ***
##
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12.2.4 Power Analysis for the One-Way Between-Groups ANOVA

The basic concept behind power analysis for ANOVA is the same as it was for classical t-tests: how
many observations does an experiment require to expect to be able to reject a null hypothesis over a
long-term rate defined by our desired power level? For the t-test, that calculation was relatively
simple: given a t-distribution representing sample means drawn from a parent distribution as defined
by the null hypothesis and a measure of alternative-hypothesis expectations, one can calculate the
proportion of expected sample means that would fall in the rejection region of that null-defined t-
distribution. With ANOVA, however, there are more parent distributions reflecting the different levels
of each factor, and those parent distributions may be related to each other in complex ways
(particularly in repeated-measures and factorial ANOVA designs).

There are multiple software packages designed to help with the potentially complex task of power
analysis for ANOVA designs. For the between-groups one-way ANOVA, we can use the base R
command power.anova.test(). The power.anova.test() command takes as its options the number
of groups, the within variance (given by M Se), the between variance (given by σ : the variance of the
2
α

group means), n, α, and the desired power. To run the power analysis, leave exactly one of those
options out and power.anova.test() will estimate the missing value. For example, imagine that we
wanted to run a study that produces results similar to what we see in the example data. There are 4
groups, α = 0.05, and we desire a power of 0.9. We can estimate the within-groups variance and
between-groups variance for the next study based on rough estimate of what we would expect (we
could use the precise values from our example data but we don’t want to get too specific because that’s
just one study: to be conservative, we can overestimate the within-groups variance and underestimate
the between-groups variance), and the output will recommend our n per group:

power.anova.test(groups=4, between.var=8, within.var = 8, sig.level = 0.05, power=

##
## Balanced one-way analysis of variance power calculation
##
## groups = 4
## n = 5.809905
## between.var = 8
## within.var = 8
## sig.level = 0.05
## power = 0.9
##
## NOTE: n is number in each group

If we expect our between-groups variance to be 8 (about half of what we observed in our sample data)
and our within-groups variance to be about 8 as well (about twice what we observed in our sample
data), R tells us that would require 5.81 participants per group. However, we are smarter than R, and
we know that we can only recruit whole participants – we always round up the results for n from a
power analysis to the next whole number – so we should plan for n = 6 for power to equal 0.9.

There are R packages and other software platforms that can handle more complex ANOVA models.
However, if you can’t find one that suits your particular needs, then power analysis can be
accomplished in an ad hoc manner by simulating parent distributions (using commands like rnorm())
that are related to each other in the ways one expects – in terms of things like the distance between the
distributions of levels of between-groups factors and/or correlations between distributions of levels of
repeated-measures factors – and then derive n based on repeated sampling from those synthetic
datasets.

12.3 Nonparametric Differences Between 3 or More Things


12.3.1 The χ Test of Statistical Independence with k ≥ 3/Extension of the
2

Median Test

The χ test of statistical independence was covered in depth in the chapter on differences between two
2

things for k = 2 and 2 × 2 contingency tables. The exact same procedure is followed for k > 2 and for
contingency tables with more than 2 rows and/or columns. The one thing to keep in mind is that the df
for the χ test is equal to df
2
rows
× df (in all of the examples from the differences between two
columns

things chapter, df = 1 because was always k = 2 rows and/or columns).

Everything said above about the χ test applies to the extension of the median test. It’s exactly the same
2

procedure, but with more than k = 2 groups.

We can’t use the sample data for the one-way between-groups ANOVA model to demonstrate these
tests because the sample is too small to ensure that f ≥ 5 for each cell. So, here is another group of
e

sample data:

Condition 1 Condition 2 Condition 3 Condition 4


Condition 1 Condition 2 Condition 3 Condition 4
-5.41 -5.55 3.79 0.33
-5.96 0.51 3.07 3.18
-3.51 -2.10 0.94 4.05
-9.54 -0.65 0.47 5.82
-2.61 0.13 2.55 7.27
-6.13 2.02 4.49 4.12
-10.14 0.32 3.67 7.70
-8.87 1.24 0.75 4.16
-6.87 -2.57 1.59 6.84
-7.00 -1.85 1.54 6.92
Median = 0.63

The median of all of the observed data is 0.63. We can construct a contingency table with Conditions
as the columns and the greater than or equal to/less than the median status of each observed value of
the dependent variable as the rows, populating the cells of the table with the counts of values that fall
into each cross-tabulated category:

Condition 1 Condition 2 Condition 3 Condition 4 totals


≥ 0.63 0 2 9 9 20
< 0.63 10 8 1 1 20
totals 10 10 10 10 40

We then can perform a χ on the contingency table for the extension of the median test:
2

chisq.test(matrix(c(0, 10, 2, 8, 9, 1, 9, 1), ncol=4))

##
## Pearson's Chi-squared test
##
## data: matrix(c(0, 10, 2, 8, 9, 1, 9, 1), ncol = 4)
## X-squared = 26.4, df = 3, p-value = 7.864e-06

The observed p-value is 7.9 × 10 6, and so is likely smaller than whatever α-rate we might choose.

Therefore, we can conclude that there is a significant difference between the conditions.

12.3.2 The Kruskal-Wallis One-Way Analysis of Variance

The closest nonparametric analogue to the classical one-way between-groups ANOVA is the Kruskal-
Wallis test. The Kruskal-Wallis test relies wholly on the ranks of the data, determining whether there is
a difference between groups based on the distribution of the ranks in each group.

We can use the example data from the one-way between-groups ANOVA to demonstrate the Kruskal-
Wallis test. The first step in is to assign an overall rank (independent of group membership) to each
observed value of the dependent variable from smallest to largest, with any ties receiving the average
of the rank above and below the tie cluster:

Condition 1 ranks Condition 2 ranks Condition 3 ranks Condition 4 ranks


Condition 1 ranks Condition 2 ranks Condition 3 ranks Condition 4 ranks
-3.1 1 7.28 14 0.12 5 8.18 17
0.18 6 3.06 7 5.51 10 9.05 19
-0.72 3 4.74 8 5.72 11 11.21 20
0.09 4 5.29 9 5.93 12 7.31 15
-1.66 2 7.88 16 6.56 13 8.83 18
R =1 16 R = 2 54 R = 51 3R = 89 4

The sum of the ranks for each condition j are called R as noted in the table above. The test statistic H
j

for the Kruskal-Wallis test is given by:


2
R
12 k j
[∑ ] − 3(N + 1)
N (N +1) i=1 ni

H =
T
3
∑ t −t
1
1 − 3
N −N

where N is the total number of observations, n is the number of observations per group (they don’t
have to be equal for the Kruskal-Wallis test), T is the number of tie clusters and t is the number of
values clustered in each tie: if there are no tied ranks, then the denominator is 1.11

The observed H for these data is:


12
(2738.8) − 3(21)
20(21)
H = = 15.25
1

Observed values of H for relatively small samples can be compared to ancient-looking tables of
critical values based on the n per each group. If – as in this case – the values are not included in the
table, we can take advantage of the fact that H can be modeled by a χ distribution with df = k − 1.
2

In our case, that means that the p-value for the observed H = 15.25 is:

pchisq(15.25, df=3, lower.tail=FALSE)

## [1] 0.001615003

which would be significant if we chose α = 0.05 or α = 0.01.

Of course there is, as usual, a better way afforded us through the miracle of modern computing:

## Arrange the data into long format and add Condition labels

data<-c(Condition1, Condition2, Condition3, Condition4)


Condition<-c(rep("Condition1", 5), rep("Condition2", 5), rep("Condition3", 5), rep
one.way.between.example.long<-data.frame(data, Condition)

kruskal.test(data~Condition, data=one.way.between.example.long)

##
## Kruskal-Wallis rank sum test
##
## data: data by Condition
## Kruskal-Wallis chi-squared = 15.251, df = 3, p-value = 0.001614
12.4 Bayesian One-Way Between-Groups ANOVA
Bayesian ANOVA works via the same principles as the Bayesian t-tests. There is a null model that
posits that there is no variance between groups: in that model, the data are normally distributed, all
sitting atop each other. In the alternative model, there is some non-zero variance produced by the
factor(s) of interest, and the prior probabilities of those variance estimates are distributed by a
multivariate Cauchy distribution (a generalization of the univariate Cauchy distribution that served as
the prior for the Bayesian t-tests).

As with the Bayesian t-tests, the BayesFactor() package provides a command structure that is
maximally similar to the commands used for the classical ANOVA. For the sample data, we can
calculate an ANOVA and use the posterior() wrapper to estimate posterior condition means,
standard deviations, and quantile estimates:

# The Bayes Factor software is very sensitive about variable types, so we have to

one.way.between.groups.example.df$Factor.A<-as.factor(one.way.between.groups.examp

# Now we can run the Bayesian ANOVA:

anovaBF(DV~Factor.A, data=one.way.between.groups.example.df)

## Bayes factor analysis


## --------------
## [1] Factor.A : 3425.795 ±0.01%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

posterior.data<-posterior(anovaBF(DV~Factor.A, data=one.way.between.groups.example
summary(posterior.data)

##
## Iterations = 1:5000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 5000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu 4.5728 0.4674 0.006609 0.006609
## Factor.A-Condition 1 -5.1719 0.8918 0.012612 0.016256
## Factor.A-Condition 2 0.9865 0.7878 0.011141 0.011452
## Factor.A-Condition 3 0.1824 0.7929 0.011213 0.011734
## Factor.A-Condition 4 4.0029 0.8777 0.012413 0.015091
## sig2 4.5923 2.2721 0.032132 0.055177
## g_Factor.A 5.7630 8.7128 0.123218 0.162367
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu 3.6411 4.2688 4.5773 4.8821 5.477
## Factor.A-Condition 1 -6.7866 -5.7710 -5.2048 -4.6443 -3.283
## Factor.A-Condition 2 -0.6042 0.4818 0.9911 1.5021 2.514
## Factor.A-Condition 3 -1.3928 -0.3121 0.1922 0.7023 1.741
## Factor.A-Condition 4 2.2114 3.4612 4.0392 4.5868 5.634
## sig2 2.1423 3.1978 4.1298 5.3586 9.919
## g_Factor.A 0.5707 1.8616 3.3765 6.3762 25.685

A useful feature offered by BayesFactor is the option to change the priors for an analysis by including
a one-word option. The default prior width is “medium;” for situations where we may expect a larger
difference between groups, choosing “wide” or “ultrawide” will give greater support to alternative
models – but likely won’t give wildly different Bayes factor results – when the actual effect size is
large. The options make no difference for our sample data, but for purposes of reviewing the syntax of
the prior options:

anovaBF(DV~Factor.A, data=one.way.between.groups.example.df, "wide")

## Bayes factor analysis


## --------------
## [1] Factor.A : 3425.795 ±0.01%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

anovaBF(DV~Factor.A, data=one.way.between.groups.example.df, "ultrawide")

## Bayes factor analysis


## --------------
## [1] Factor.A : 3425.795 ±0.01%
##
## Against denominator:
## Intercept only
## ---
## Bayes factor type: BFlinearModel, JZS

1. There are ways to conduct multiple pairwise tests that manage the overall α-rate when more than
two groups are being studied at a time. One way is to reduce the α-rate for each test in order to
keep the overall α-rate where one wants it (e.g., α = 0.05): this approach is generally known as
applying the Bonferroni correction but should be known as applying the Dunn-Bonferroni
correction to honor Olive Jean Dunn as well as Carlo Bonferroni. The other way is through the
use of post-hoc pairwise tests like Tukey’s Honestly Signficant Difference (HSD) test and the
Hayter-Fisher test, which are discussed below.↩

2. This pair of null and alternative hypotheses may appear odd, because usually when the null
hypothesis includes the = sign, the alternative hypothesis features the ≠ sign, and when the
alternative has the > sign, the null has the ≤ sign. Technically, either pair of signs – = / ≠ or
≤ / > – would be correct to use, it’s just that because variances can’t be negative, σ ≤ 0 is
2

somewhat less meaningful than σ = 0, and σ ≠ 0 is somewhat less meaningful than σ > 0.↩
2 2 2

3. The F stands for Fisher but it was named for him, not by him, for what it’s worth.↩

4. The choice of the letter α is to indicate that it is the first (α being the Greek counterpart to A) of
possibly multiple factors: when we encounter designs with more than one factor, the next factors
will be called β, γ, etc.↩

5. The SAS software platform doesn’t even have a t-test command: to perform a t-test in SAS, one
uses the ANOVA procedure.↩

6. To my knowledge, there are only three such effect-size statistics. The third, ϵ , is super-obscure:
2

possibly so obscure as to be for ANOVA-hipsters only. The good news is that estimates of ϵ tend2

to fall between estimates of η and estimates of ω for the same data, so it can safely be
2 2

ignored.↩

7. Population-level error may seem an odd term because we so frequently talk of error as coming
from sampling error – that is the whole basis for the standard error statistic. However, even on
the population level, there is variance that is not explained by the factor(s) that serve(s) as the
independent variable(s): the naturally-occurring width of the population-level distribution of a
variable.↩

8. Honestly, the effect-size statistics for the example data are absurdly large. There should be
another category above large called stats-class-example large. Or something like that.↩

9. Base R is not sufficiently respectful of the contributions of Olive Jean Dunn↩

10. There are more post hoc tests, but the ones listed here (a) all control the α -rate at the overall
specified level and (b) are plenty.↩

11. The 12 is always just 12 and the 3 is always just 3, in case you were wondering.↩

You might also like