Professional Documents
Culture Documents
Chapter 1
Introduction
Stats 250 is a course that will develop your statistical thinking abilities. Regardless of your future path
with statistics (practitioner or consumer), it is important to know how data can be used to answer
research questions. What are the best ways to collect the data needed to answer a research question?
How can we summarize and display data in ways that allow the data to tell its story? What conclusions
can be made from the data?
As data science continues to grow, we now see data sets that are not collected to answer a research
question. In cases like these, the statistical investigation process may look something like the following:
1. Wrangle or import the data
2. Tidy the data2
3. Explore the data, providing graphical displays and numerical summaries
4. Use statistical analysis methods to draw inferences from the data
5. Formulate conclusions, communicate the results, and answer the research question
6. Reflect and look forward (point out limitations and suggest further studies)
Regardless of how the data were collected, the ideas and techniques that you learn in Stats 250 will
start you on your journey of statistical thinking. , It would be impossible to learn all statistical
techniques in one term, but we endeavor to teach you a few core data collection techniques, some
core data analysis ideas, and fundamental concepts of statistical inference.
1
Adapted from Tintle et al. (2021). Introduction to Statistical Investigations (2nd edition). John Wiley & Sons, Inc. and
Carnegie et al. Montana State Introductory Statistics with R.
2
“Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is
stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important
because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into
the right form for different functions.” (Wickham, H. and Grolemund, G. R for Data Science,
https://r4ds.had.co.nz/index.html)
Intro to Data, page 1
Section 1.1 Case Study – Dolphin Therapy
Swimming with dolphins can certainly be fun, but is it also therapeutic for patients suffering from
clinical depression? To investigate this possibility, researchers recruited 30 participants aged 18-65
with a clinical diagnosis of mild to moderate depression (Antonioli & Reveley, 2005) 3. The study
participants were required to discontinue use of any antidepressant drugs or psychotherapy four
weeks prior to the experiment and for the duration of the experiment. After we examine the data, we
will talk about the ethics of this study.
These 30 individuals went to an island off the coast of Honduras, where they were randomly assigned
to one of two treatment groups. Both groups engaged in the same amount of swimming and
snorkeling each day (the outdoor nature program), but one group did so in the presence of bottlenose
dolphins while the other (control) group did not.
Each person’s level of depression was evaluated at the beginning of the study and then again at the
end. In the dolphin therapy group, 10 of the 15 participants showed substantial improvement, while in
the control group, 3 of 15 participants showed substantial improvement.
The results from the study are summarized in the contingency table below:
Showed substantial Did not show substantial
Treatment Total
improvement improvement
Dolphin therapy 10 5 15
Control group 3 12 15
Total 13 17 30
b. What proportion of study participants in the control group showed substantial improvement?
3
Antonioli, C., & Reveley, M. (2005). “Randomized controlled trial of animal facilitated therapy with dolphins in the
treatment of depression.” British Medical Journal, 331(7527), 1231 – 1234.
Intro to Data, page 2
d. What concerns, if any, do you have with the study as it was described above?
yearBuil bedroom
t exterior livingArea bathFull bathHalf s fireplaces garageCars salePrice
1960 BrkFace 1656 1 0 3 2 2 215000
1961 VinylSd 896 1 0 2 0 1 105000
1958 WdSdng 1329 1 1 3 0 1 172000
1968 BrkFace 2110 2 1 3 2 2 244000
1997 VinylSd 1629 2 1 3 1 2 189900
1998 VinylSd 1604 2 1 3 1 2 195500
1999 VinylSd 1804 2 1 3 1 2 189000
1993 HdBoard 1655 2 1 3 1 2 175900
1992 HdBoard 1187 2 0 3 0 2 185000
1998 VinylSd 1465 2 1 3 1 2 180400
1990 HdBoard 1341 1 1 2 1 2 171500
2003 CemntBd 3279 3 1 4 1 3 538000
1988 WdSdng 1752 2 0 4 0 2 164000
1951 VinylSd 864 1 0 2 0 2 141000
1978 Plywood 2073 2 0 3 2 2 210000
2000 VinylSd 1674 2 1 3 0 2 216000
1970 WdSdng 1004 1 0 2 1 2 149000
1971 VinylSd 1078 1 1 3 1 2 149900
1968 VinylSd 1056 1 0 3 1 1 142000
1970 Plywood 882 1 0 2 0 2 126000
4
Data courtesy of Dean De Cock, retrieved from https://doi.org/10.1080/10691898.2011.11889627. The larger data set
contains 2930 observations.
Each row in the table represents one single-family home (sold between 2006 and 2010 in Ames, Iowa),
or case (or observational unit). The columns represent characteristics, called variables, for each of the
homes. The variables are explained below:
variable description
yearBuilt Original construction year
exteriorMateria Primary exterior covering on house ( Brkface indicates Brick Face,
l CemntBd indicates Cement Board, Hdboard indicates Hard Board,
Plywood, VinylSd indicates Vinyl Siding, WdSdng indicates Wood Siding)
livingArea Living area (square feet)
bathFull Number of full baths
bathHalf Number of half baths
bedrooms Number of bedrooms
fireplaces Number of fireplaces
garageCars Size of garage in car capacity
Intro to Data, page 4
salePrice Sales price (USD)
For example, the first row (repeated below) represents a single-family home built in 1960 that has
Brick Face as its exterior material, 1656 square feet of living space, 1 full bathroom, no half bathrooms,
3 bedrooms, 2 fireplaces, a 2-car garage, and sold for $215,000.
yearBuil bedroom
t exterior livingArea bathFull bathHalf s fireplaces garageCars salePrice
1960 BrkFace 1656 1 0 3 2 2 215000
The full table represents a data matrix, which is a common way to organize raw, unprocessed data.
Data matrices are a convenient way to record and store data. Another observation can easily be added
as a new row at the bottom of the matrix, or another column could be added to represent a new
variable recorded for each case.
Types of Variables
Examine the variables in the ames50 data set. What similarities do you see among the variables? What
differences do you see? All variables are either numerical or categorical.
Numerical variables (also called quantitative or measurement variables) take on a wide range
of numerical values, and it is sensible to do math (e.g., addition, subtraction, averaging) with
numerical variables.
Categorical variables (also called qualitative variables) place an individual or item into one of
several groups or categories, which are called levels.
Different types of variables provide different kinds of information. The variable type will guide what
kinds of summaries (graphs/numerical) are appropriate.
Numerical and categorical variables can be broken down further are shown in the following diagram
and as explained below.
Note: While it’s possible to turn numerical variables into categorical ones by creative grouping, it is not
possible to go the other direction and change categorical variables into numerical ones, even if
sometimes a categorical variable is coded using numbers. Responses to the question “What is your
marital status?” might be coded in a dataset as 1-Single, 2-Married/Partnered, 3-Separated/Divorced,
or 4-Widow, but the variable would still be a categorical variable.
yearBuilt bedrooms
exteriorMateria fireplaces
l
livingArea garageCars
bathFull salePrice
bathHalf
Each research question is about a population, the entire group we are interested in learning about. For
example, question 2 is about all undergraduate students at the University of Michigan. Typically, we
cannot answer these questions definitively because we would need to observe every case in the
population. This usually takes too long, costs too much, and—for some research questions—actually
destroys the item in the process of measurement (e.g., the breaking strength of a wire rope).
Instead of measuring every item in a population, we take a sample, a subset of the cases that is often a
small fraction of the overall population. For instance, we might speak to 20–30 (or some other number)
University of Michigan alumni and ask them how long it took them to complete their undergraduate
degrees; their responses could be used to provide an estimate of the average time to complete a
degree for the overall population of undergraduate students.
Each of these conclusions appears to be based on anecdotal evidence! Anecdotal evidence is typically
composed of unusual cases that are recalled based on their striking characteristics. Instead of using
these unusual cases to draw conclusions about a population, we should examine a sample of many
cases from this population and be cautious about making inferences.
5
Images on this page courtesy of Kari Lock Morgan
Intro to Data, page 8
Two Primary Forms of Data Collection
There are two primary types of data collection: observational studies and experiments. Observational
studies (covered in Section 1.4 of the textbook) refer to instances where researchers collect data in a
way that does not directly interfere with how the data arise. Experiments (covered in Section 1.5 of
the textbook), on the other hand, refer to instances in which researchers directly influence the process
by which data arise. Usually, this involves assigning study participants to one or more treatments. Soon
we will learn key distinguishing factors between observational studies and experiments.
The researchers suspected darkness during sleep influences vision later in life, so darkness can be
considered the explanatory variable of the study and vision can be considered the response variable.
Identifying explanatory & response variables
To identify the explanatory variable in a pair of variables, identify which of the two is suspected of
affecting the other.
Note: You may find in your field that the explanatory variable is called the independent variable (or the
predictor) and that the response variable is called the dependent variable (or outcome). In Stats 250,
we avoid using the terms “independent” and “dependent” so that the ideas don’t get confused with
the ideas that a pair of variables may be independent or dependent.
In the study above, researchers recorded whether each child selected to be in the study slept with or
without a night light. They returned to each child years later and found that those who slept with
night-lights were much more likely to have developed myopia (nearsightedness) and need glasses.
This is called an observational study. Generally, data in observational studies are collected only by
monitoring what occurs, while experiments require the primary explanatory variable in a study be
assigned to each subject by the researchers.
6
Kaneshi, Y. et al. Influence of light exposure at nighttime on sleep development and body growth of preterm
infants. Sci. Rep. 6, 21680; doi: 10.1038/srep21680 (2016).
Intro to Data, page 9
Making causal conclusions based on experiments is often reasonable, depending on how the
explanatory variable was assigned. However, making the same causal conclusions based on
observational data can be difficult, but not impossible. While causal inference is an active area of
statistical research, it will not be a focus of our course. Most observational studies we encounter in
Stats 250 are generally only sufficient to show associations.
Although there are many different types of observational studies, we focus here on a few main ways of
collecting data, along with their benefits and drawbacks, each of which has a good chance of creating a
representative sample.
Simple random sampling is the best way of ensuring that your sample is representative of the
population it is chosen from. Sampling this way allows us to generalize our results from the smaller
sample to the larger population. Note: It is possible for a SRS to not be representative of the
population, but a non-representative sample would be extremely rare. To quote Paul Velleman, “rare
things happen, but they don’t happen to me.”
The image to the right7 shows a simple random sample of 4
people selected at random. Here, individuals 2, 5, 8, and 10
were chosen. Note that, in a simple random sample, each of
the 12 people had the same chance of being chosen and
each sample of size 4 had the same chance of being the
chosen sample.
Example
Consider the cumulative grade point averages (GPAs) of
University of Michigan undergraduate students. To conduct
a simple random sample, we might write the cumulative
GPA for each of the approximately 30,000 undergraduate students at Michigan on a scrap of paper and
randomly jumble them in a bag. (We would need a huge bag!) Thereafter, we could blindly pull a
sample of n of these paper scraps out of the bag.
Stratified Sampling
In stratified sampling, a “divide-and-conquer” sampling strategy, the population is divided into
nonoverlapping groups called strata. The strata are chosen such that each group is similar with respect
7
Source: Dan Kernler, https://faculty.elgin.edu/dkernler/statistics/ch01/1-4.html
Intro to Data, page 10
to the outcome of interest. Thereafter, a simple random sample (SRS) is taken from each stratum. This
method works best when there is a lot of variability between each stratum, but not much variability
within each stratum.
The following images shows a stratified sample of 4 people. Here, the individuals are divided into 3
strata: one with the 3 blue people, one with the 6 red people, and the other with the 3 green people.
Then, we take a random sample of 1/3 of each stratum—we randomly sample 1 blue person, 2 red
people, and 1 green person. We still have 4 individuals in our sample and we have made sure to have
representation from each stratum.
Example
Instead of randomly sampling from our population of University of Michigan undergraduate students
indiscriminately, we could categorize (or stratify) the students according to their class rank.
Alternatively, we could stratify the students according to their college (e.g., LSA, Art and Design,
Engineering, Kinesiology). Ideally, we’d like to choose a way of stratifying the population such that all
the observations in each stratum are similar with respect to the outcome of interest. Do you think it
would be smarter in to stratify by class rank or by college in our GPA example?
Convenience Samples
A convenience sample refers to samples that are obtained by measuring whatever or whoever is
available to be measured. While it’s possible to get interesting information from a convenience sample,
they are rarely (if ever) representative of a larger population.
Example
A psychology professor wants to test a new behavioral theory so they tell their large class of 300
students that if they participate in an experiment, they will get extra credit. This sample would only be
able to tell the professor how college students who want extra credit behave, and would not
necessarily be applicable to other groups of people.
Example
You want to find out how often high school students text or email while driving. You take a sample of
high school students at local Pioneer High School and find that 35% of those sampled say that they
have texted or emailed while driving. Do you think that this sample is representative of the population
of all high school students who drive? Why or why not?
Selection bias occurs if the method for selecting the participants produces a sample that does not
represent the population of interest.
Nonresponse bias occurs when a representative sample is chosen for a survey, but a subset cannot be
contacted or does not respond.
Response bias occurs when participants respond differently from how they truly feel. The way
questions are worded, the way the interviewer behaves, as well as many other factors might lead an
individual to provide false information.
Example
In order to find out how Ann Arbor residents feel about animal cruelty, volunteers at the Humane Society
of Huron Valley put together a list of all donors to the shelter. They then called a random sample of people
from the list. The results might suffer from what type of bias?
Sampling bias is a type of selection bias that occurs if the method for selecting the participants causes
some individuals in the population to be more or less likely to be included in the sample than others.
Example
Here’s part of the survey methodology section of a recent Gallup poll: “results for this Gallup poll are
based on telephone interviews conducted July 30-Aug. 12, 2020, with a random sample of 1,031
adults, aged 18 and older, living in all 50 U.S. states and the District of Columbia… Landline and cellular
8
Bock, Velleman, De Veaux, and Bullard, Stats 5e (Boston, MA: Pearson Education, Inc., 2019), p. 279.
Intro to Data, page 12
telephone numbers are selected using random-digit-dial methods.” 9 These methods are much better
than they were in the past when telephone interviews were only conducted with people who had
landline phones (and only in the contiguous United States). Think about populations that will be missed
in telephone surveys…
It may be tempting to conclude that sleeping with the lights causes myopia. In fact, it turns out
teenagers with myopia are quite likely to have myopic parents, who in turn were more likely to leave a
light on so that they could see when tending to their infant children throughout the night.
In this case, whether the parents have myopia is an example of a confounding variable.
Confounding variables are variables that are associated with both the explanatory and response
variables.
Confounding variables get in the way of being able to make causal conclusions about the relationship
between explanatory and response variables. In the above example, we can’t say night light use causes
myopia because there’s an alternative explanation for that association: parental myopia.
While one method to justify making causal conclusions from observational studies is to exhaust the
search for confounding variables, there is no guarantee that all confounding variables can be examined
or measured. This is the main reason why it is so difficult to demonstrate a causal relationship with an
observational study. Where feasible, researchers will do an experiment to overcome this obstacle.
Example: Can You Find the Confounding Variable?
Each of the examples below describes an observational study that [incorrectly] attempts to
demonstrate a causal relationship between an explanatory and response variable. First, identify the
explanatory and response variables, and then come up with at least one possible confounding variable.
a. An observational study tracked sunscreen use and melanoma (skin cancer) diagnoses, and it
later found that increases in sunscreen use led to an increased risk of skin cancer.
Explanatory:
Response:
Confounding:
9
https://news.gallup.com/poll/317567/public-reengages-election-early-pandemic-dip.aspx
Intro to Data, page 13
b. A study examined how external clues influence student performance. Undergraduate students
were randomly assigned to one of four different forms for their midterm exam. Form 1 was
printed on blue paper and contained difficult questions, while Form 2 was also printed on blue
paper but contained simple questions. Form 3 was printed on red paper, with difficult
questions, and Form 4 was printed on red paper with simple questions. The researchers were
interested in the impact that color and type of question had on exam score (out of 100 points).
Suppose we learned that the students in the “blue paper” group performed better on average
over those in the “red paper” group, but that the “blue paper” group were mostly upper-
classmen and the “red paper” group were mostly first- and second-year students.
Explanatory:
Response:
Confounding:
Why Conduct Observational Studies if It’s Hard to Prove Causal Relationships with Them?
In both of the cases above, we can see clear associations between the variables being studied, but we
can’t claim these relationships are causal. It’s worth thinking of causal relationships as describing an
asymmetrical (one-way) relationship between two variables, and associative relationships as describing
symmetrical (two-way) relationships between variables.
For instance, consider two variables of interest to public officials: the percentage of the population that
is homeless and the crime rate. We could claim these variables share an association and point out they
tend to be high in the same places and low in the same places. However, you now know that such an
association does not necessarily mean one causes the other. Furthermore, it’s not clear which causes
which! To claim that homelessness causes crime or that crime causes homelessness would be very
different claims to justify.10 A confounding variable, such as the unemployment rate, could be at play.
This is not to say that it is impossible to draw causal conclusions from observational data. Some major
milestones in medicine, for example, have been achieved thanks to observational studies! Often, it’s
not feasible or ethical to conduct randomized experiments to make causal conclusions, but we know
10
It’s also important to think deeply about the consequences of an analysis which would prove that homelessness causes
crime. That wouldn’t imply that all homeless people are criminals, but a response might be to criminalize homelessness,
which would be unjust. Statisticians must remember that data often represent people and that our work has real human
consequences.
Intro to Data, page 14
that human papillomavirus (HPV) causes cervical cancer11 and that smoking causes lung cancer12. This is
not because researchers gave people HPV or forced them to smoke: we know these things thanks to
careful use of observational data.
We won’t be focusing on causal inference from observational studies in this course, but it’s an active
area of statistical research and is commonly used in the social sciences.
Causal Relationships
In short, observational studies are good at showing that a relationship exists. It is difficult to use data
from observational studies to show why a relationship exists.
11
Walboomers, J.M.M., Jacobs, M.V., Manos, M.M., Bosch, F.X., Kummer, J.A., Shah, K.V., Snijders, P.J.F., Peto, J., Meijer,
C.J.L.M. and Muñoz, N. (1999), Human papillomavirus is a necessary cause of invasive cervical cancer worldwide. J. Pathol.,
189: 12-19. doi:10.1002/(SICI)1096-9896(199909)189:1<12::AID-PATH431>3.0.CO;2-F
12
Parascandola, M., Weed, D.L. & Dasgupta, A. Two Surgeon General's reports on smoking and cancer: a historical
investigation of the practice of causal inference. Emerg Themes Epidemiol 3, 1 (2006). https://doi.org/10.1186/1742-7622-
3-1
Intro to Data, page 15
The Four Principles of Experimental Design
1. Controlling: It’s generally important to have a comparison group in a study in order to monitor
how the treatment group performs relative to something you already understand. This control
group lets you understand what the effect of the treatment is: do fewer students given the
antidepressant relapse compared to students who weren’t given the antidepressant?
The control group is also designed to reduce or eliminate the effects of any other variables that
might influence the result. In the dolphin study on page 1, the control group was still brought to
the island near Honduras so they would be able to say that it was the dolphins, not the tropical
environment, that caused the difference in symptom improvement. In our vaping example, how
might we use controlling to make sure it’s the antidepressants that cause lower relapse rates?
It is common for both researcher and subject expectations to influence the results of an
experiment. How might this appear in our example?
To control for this effect, researchers could give the control group a placebo, an inert pill that
does not contain antidepressant medication. Since the students in both the treatment group
and the control group would both believe they might be receiving medication, the placebo
effect would be spread evenly among the groups. Studies where study participants do not know
which group they are in are called blind.
To control for researcher expectations, studies are sometimes conducted double-blind. In the
context of this example, the medical professional evaluating student responses would also not
know who was in the control group and who was in the treatment group.
2. Randomization: Part of the reason it’s hard to make causal conclusions from observational
studies that it’s hard to account for every possible confounding variable. Researchers
randomize the assignment of treatments to each of its cases to account for confounding
variables that cannot be controlled (or that they don’t know they should control). For example,
some students might be more likely to relapse into vaping because individuals in their friend
network also vape. Randomizing students into the treatment and control groups tends to even
out these differences and produce groups that are comparable. (This is what gets us causal
conclusions.)
Note: Randomized controlled experiments, while not perfect, are the “gold standard” for causal
inference.
Aside: Designing new kinds of experiments is a really active research area in Statistics these days,
because scientists are asking more questions that more traditional experiments can’t answer well. One
example is the sequential, multiple-assignment randomized trial (SMART), which helps researchers
answer questions about how to develop sequences of treatments which can adapt to patients’
changing needs over time.13 There’s also work being done on randomized trials to help develop mobile
health (“mHealth”) apps.
13
Almirall D., Nahum-Shani, I., Sherwood, N.E., Murphy S.A. (2014). Introduction to SMART Designs for the Development of
Adaptive Interventions: With Application to Weight Loss Research. Translational Behavioral Medicine, 4:260-274. DOI:
10.1007/s13142-014-0265-0
Intro to Data, page 17
Controlling:
Randomization:
Replication:
Blocking:
I took a random sample of 20 students who were seniors in high school in Michigan. Let’s take a look at
the values for the travelTime and texts variables for these 20 students.
Student travelTim texts Student travelTim texts
e e
1 6 150 11 13 45
2 20 8 12 3 20
3 10 7 13 15 20
4 3 0 14 30 30
5 5 30 15 3 20
6 17 30 16 7 25
7 10 40 17 10 30
8 12 54 18 6 200
9 10 4 19 7 150
10 3 20 20 15 5
It’s difficult to look at a set of data, even one this small, to figure out the story the data set is telling us.
c. What if the data value for the student who travels 30 minutes to school was incorrectly entered
as 60 minutes instead of 30 minutes? How would the mean change? The median?
Mean: the numerator in the calculation of the mean would be 235, so the mean would be
235
x= =11.75 minutes
20
Intro to Data, page 20
Median: since the middle two values are the same as they were without the data entry error,
the median is still 10 minutes
Notes: The mean is sensitive to extreme observations (can change dramatically due to a few extreme
observations). The median is resistant/robust to extreme observations.
Describing Variability
Midterm exams are returned, and the “average” was reported as 76 points out of 100 points. You
received a score of 88 points. Under which of the following scenarios did you perform better?
Often what is missing when the central tendency of something is reported is a corresponding measure
of variability or “spread” that describes how tightly or loosely clustered the observations in the data set
are clustered around that measure of central tendency. A measure of variability is perhaps the most
important quantity in statistical analysis. Here we discuss several measures of variation, each useful in
some situations, each with some limitations.
Note: We will stay away from the word “spread” when we talk about variability in class because the
word has many meanings, some of which cause confusion for students learning statistics. 14 The authors
(Kaplan, Rogness, and Fisher) have an excellent example of the issues with the word “spread”:
Said the statistician from west Texas…
Howdy! Welcome to my spread and make yourself at home. Go ahead and spread out your papers and
help yourself to all the food – it's quite a spread , huh? Be sure to try some of that blueberry spread ;
just spread a little on a cracker. Yum! And take a look at that fancy tablecloth; my grandma made
that spread for me. Once you're all settled in we'll open up that spread sheet and see if we can't figure
out the spread of those data.
Only the last instance of the word “spread” is actually related to the idea of variability.
14
Kaplan, J.J., Rogness, N.T. and Fisher, D.G. (2012), Lexical ambiguity: making a case against spread . Teaching Statistics,
34: 56-60. doi:10.1111/j.1467-9639.2011.00477.x
Intro to Data, page 21
Range and IQR
One way to describe the variability in travel times would be to compute the range. Statisticians tend to
think of the range as the difference between the maximum and minimum values:
Range = maximum – minimum
Calculate the range for the travel times to school for our 20 Michigan high school seniors.
Range = maximum – minimum = 30 – 3 = 27 minutes
The range is easy to calculate, but there’s a catch… Since the range consists of the minimum and
maximum values, it is impacted greatly by extreme points in the data set.
What would happen to the range if the data value for the student who travels 30 minutes to school
was incorrectly entered as 60 minutes instead of 30 minutes?
The range only uses two observations to describe the variation in an entire data set, regardless of the
sample size. There are obviously situations where it will not do a particularly good job, especially when
the maximum or minimum observed values are extreme relative to the other data.
Another measure of variation, called the interquartile range (IQR), tries to address this issue. To
understand how the IQR works, we must first introduce the idea of percentiles.
Percentiles
The pth percentile is the value such that p% of the observations fall at or below that value.
Some common percentiles are the first quartile (Q1, the 25th percentile) and the third quartile (Q3, the
75th percentile). The median is the second quartile (the 50th percentile), but we typically don’t call it
anything other than the median.
Whereas the range describes the variability for 100% of the data, the IQR describes the variability of
the middle 50% of the data.
Five-Number Summary
Sometimes we will choose to summarize our data in a five-number summary, which includes the
minimum, Q1, the median, Q3, and the maximum. Together these five values give a quick overview of
the data set. It’s helpful to display the five-number summary in a table. Here is the five-number
summary for the travelTime variable, which we can pull directly from the R output (or we can use
the values we calculated by hand):
Even though the IQR is not sensitive to extreme values, it still suffers from the same problem as the
range: it only uses two of the data points from the data set. We want all of our data to contribute to
the measure of variability, not just the two data points that happen to be at the 25 th and 75th
percentiles.
It might be helpful if we define a deviation before we talk about the standard deviation. A deviation
tells us how much an observation departs from the mean:
deviation = observation – mean = x−x
Intro to Data, page 23
It seems like it would make sense for us to talk about the average deviation, but that’s actually
problematic. Why?
As a workaround, we commonly square the deviations before adding them all up. This is, somewhat
unimaginatively, called the sum of squares, and will always be positive. Problematically, the sum of
squares will increase with every additional observation.
A simple solution would be to take the sum of squares and divide it by the number of observations to
find a mean squared deviation. This works well if we are dealing with population data, but consistently
underestimates the population variance when we use sample data.
Because we most often deal with sample data, when calculating the mean squared deviation we
instead divide the sum of squares by n−1. This is called the variance of the data. If we take the square
root of the variance, we call that the standard deviation. In this course, we emphasize the standard
deviation over the variance since the standard deviation is in the original units of the data (making it
much friendlier to work with and understand).
Calculating the range by hand is trivial; calculating the IQR by hand takes a little more work; calculating
the standard deviation by hand is cruel, especially for a large data set. We will never ask you to
calculate the standard deviation by hand! Here’s the one line we need to get the standard deviation of
the travel times for our 20 students:
sd(travelTime)
[1] 6.812334
We won’t focus on a precisely worded interpretation of the standard deviation. It’s more important to
us that you understand what the standard deviation is. The standard deviation is our measure of
variability (the “wiggle room”) we use when we talk about how data are distributed. In this example,
we might say something like, “Travel times for our 20 high school seniors in Michigan average 10.25
minutes give or take about 6.81 minutes.”
s = 0 means…
As with food and fine wines, there are natural pairings between measures of center and variability:
Median and IQR
Mean and standard deviation
a. What are the observational units (cases) in this scenario? What is the outcome variable?
b. Arrange these professors in order from smallest to largest standard deviation of their ratings.
Example
The table below asks you to compare the standard deviations of two data sets. Without doing any
calculations, choose one of the four statements below to describe the relationship between the data
sets compared.
i. The quantity in column A is greater.
ii. The quantity in column B is greater.
iii. The two quantities are equal.
iv. The relationship cannot be determined from the given information.
While dot plots are simple, they are not good solutions for
when we either have a large amount of discrete data or for
small samples of continuous variables. Look at this dot plot
of the total rooms in a larger sample of 2,930 homes in Ames
(a large number of discrete observations). This is not a
particularly useful plot! We have so many homes that it’s not
possible to fit all the dots in the plot (there are 844 homes
with 6 rooms in this larger sample, and only 1 home with 2
rooms). The scale of the y -axis is very limiting in this case.
In this plot, there are only two stacked dots because all of the living areas (except two) take on unique
values. Looking at this plot might give us some sense of the variability of the data (notice that cluster of
dots around 1600 ft2 and how the dots get more spaced out towards the edges of the plot), but we can
do better.
Let’s look at the values of livingArea for our random sample of 50 homes. The living areas (in square
feet) for these homes (in increasing order) are:
864 1078 1341 1535 1629 1698 1752 1947 2110 2599
882 1097 1374 1564 1655 1704 1804 1960 2110 2622
896 1187 1430 1566 1656 1720 1822 2035 2250 2696
1004 1324 1465 1595 1659 1733 1839 2073 2270 3238
1056 1329 1468 1604 1674 1744 1845 2084 2475 3279
How might we go about creating a graphical summary of the living area of the homes? These
measurements do vary. How do they vary? What is the range of values? What is the pattern of
variation?
The following are a frequency table and a histogram for these data:
Summary Table
Class Interval Frequency
(or count)
500 – 1000 3
1000 – 1500 12
1500 – 2000 22
2000 – 2500 8
2500 – 3000 3
3000 – 3500 2
Total 50
Note: each bar (or “bin”) represents a class, and the base of the bar covers the class. The table and
histogram above show the distribution of the numerical variable livingArea; that is, it provides
Robust Statistics
Consider the summary statistics we’ve explored so far this chapter: mean, median, standard deviation,
range, IQR. The median and IQR are statistics that are robust to outliers, while the mean, range, and
standard deviation are sensitive to outliers.
Consider the following histograms. For each histogram, determine whether you would expect the
mean and median values to be approximately equal, for the mean > median, or for the mean < median.
15
From Lock, Lock, Lock Morgan, Lock, and Lock’s Statistics: Unlocking the Power of Data (2nd edition), John Wiley & Sons,
2017.
Intro to Data, page 32
Example
Use the following histogram and summary statistics to describe the gross sales (in millions, USD) for
the 50 top-ranked movies in 2019.
Box Plots
In general, a box plot is a data visualization of the five-number summary. A box plot is comprised of a
box (shocking, we know), a line marking the location of the median, and “whiskers.” Box plots can be
vertical or horizontal (the default in R produces a vertical box plot):
Notes:
Sometimes points flagged as potential outliers in a box plot are not really outliers. When the
distribution of the data is skewed one direction or the other, points flagged as potential outliers
may really just be part of the long tail. Be sure to examine a histogram in addition to the box
plot before declaring the potential outliers to be outliers.
Be careful when using a box plot to determine the shape of a distribution. Box plots tell us
nothing about the number of modes in a distribution. Here are two identical box plots with
identical 5-number summaries (min = 0, Q1 = 25, Q2 = 50, Q3 = 75, max = 100), but very
different shapes!
A recent study examines the relationship between class start times, sleep, circadian preference,
alcohol use, academic performance, and other variables in college students. The data were obtained
16
Exercise 1.37, ISRS
Intro to Data, page 35
from a sample of n=253 students who completed skills tests to measure cognitive function, completed
a survey that asked many questions about attitudes and habits, and kept a sleep diary to record time
and quality of sleep over a two-week period. Below are some of the recorded variables.
Variable Coding
gender 0 = female, 1 = male
classYear Year in school, 1 = first year, …, 4 = senior
larkOwl Early riser or night owl? Lark, Neither, Owl
classesMissed Number of classes missed in a semester
earlyClass 0 = no early class, 1 = early class
anxietyScore Measure of amount of anxiety
Does the proportion of students who have an early class differ across the class year for these students?
This question asks about the association/relationship between two categorical variables. To answer it,
we’ll need a contingency table, where the categories for one variable are listed across the rows and
the categories for the second variable are listed across the columns. Each cell of the table contains the
(relative) frequency of cases that are in the joint categories of a given row and column.
Consider the two-way table below which displays the relationship between earlyClass and classYear.
Class Year
First year Sophomore Junior Senior Total
Early Yes 39 64 33 32 168
Class? No 8 31 21 25 85
Total 47 95 54 57 253
Note that the solutions to (c) and (d) are talking about the same 64 students! The questions “what
proportion of sophomores have an early class?” and “what proportion of students with an early class
are sophomores?” sound similar but are asking different questions.17
17
One way I remember this: The proportion of U.S. Senators who identify as female and the proportion of people who
identify as female who are U.S. senators are clearly not the same! The first is upsettingly small (27%), the second is
understandably very small (27 out of about 167 million).
Intro to Data, page 36
Bar plots (or Bar Graphs or Bar Charts)
Several types of graphs can display the relationship between two categorical variables. One is a
segmented or side-by-side bar plot and the other is a mosaic plot. We will not discuss mosaic plots in
Stats 250, but feel free to read about them in the book. Let’s consider both a segmented (or stacked)
bar plot and a side-by-side bar plot for displaying the data in our two-way table.
Consider each of the following questions. After answering each, consider which graphical display does
a better job at helping you toward your answer.
a. Were more juniors or more seniors present in the sample of n=253 students? Which plot does a
better job of showing this?
b. Do a greater number of first-year or sophomore students have early classes at least once per week?
Which graph is more helpful to answer this question?
c. Do a greater percentage of first-year or sophomore students have early classes at least once per
week? Which graph is more helpful to answer this question?
d. Which of these two plots provides a better way of analyzing the relationship between earlyClass
and classYear? Explain.