You are on page 1of 38

Introduction to Data and Data Collection

Chapter 1

Introduction
Stats 250 is a course that will develop your statistical thinking abilities. Regardless of your future path
with statistics (practitioner or consumer), it is important to know how data can be used to answer
research questions. What are the best ways to collect the data needed to answer a research question?
How can we summarize and display data in ways that allow the data to tell its story? What conclusions
can be made from the data?

Traditionally, the statistical investigation process can be summarized as follows: 1


1. Ask a research question
2. Design a study and collect data
3. Explore the data, providing graphical displays and numerical summaries
4. Use statistical analysis methods to draw inferences from the data
5. Formulate conclusions, communicate the results, and answer the research question
6. Reflect and look forward (point out limitations and suggest further studies)

As data science continues to grow, we now see data sets that are not collected to answer a research
question. In cases like these, the statistical investigation process may look something like the following:
1. Wrangle or import the data
2. Tidy the data2
3. Explore the data, providing graphical displays and numerical summaries
4. Use statistical analysis methods to draw inferences from the data
5. Formulate conclusions, communicate the results, and answer the research question
6. Reflect and look forward (point out limitations and suggest further studies)

Regardless of how the data were collected, the ideas and techniques that you learn in Stats 250 will
start you on your journey of statistical thinking. , It would be impossible to learn all statistical
techniques in one term, but we endeavor to teach you a few core data collection techniques, some
core data analysis ideas, and fundamental concepts of statistical inference.

We begin with a short case study.

1
Adapted from Tintle et al. (2021). Introduction to Statistical Investigations (2nd edition). John Wiley & Sons, Inc. and
Carnegie et al. Montana State Introductory Statistics with R.
2
“Tidying your data means storing it in a consistent form that matches the semantics of the dataset with the way it is
stored. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important
because the consistent structure lets you focus your struggle on questions about the data, not fighting to get the data into
the right form for different functions.” (Wickham, H. and Grolemund, G. R for Data Science,
https://r4ds.had.co.nz/index.html)
Intro to Data, page 1
Section 1.1 Case Study – Dolphin Therapy
Swimming with dolphins can certainly be fun, but is it also therapeutic for patients suffering from
clinical depression? To investigate this possibility, researchers recruited 30 participants aged 18-65
with a clinical diagnosis of mild to moderate depression (Antonioli & Reveley, 2005) 3. The study
participants were required to discontinue use of any antidepressant drugs or psychotherapy four
weeks prior to the experiment and for the duration of the experiment. After we examine the data, we
will talk about the ethics of this study.

These 30 individuals went to an island off the coast of Honduras, where they were randomly assigned
to one of two treatment groups. Both groups engaged in the same amount of swimming and
snorkeling each day (the outdoor nature program), but one group did so in the presence of bottlenose
dolphins while the other (control) group did not.

Each person’s level of depression was evaluated at the beginning of the study and then again at the
end. In the dolphin therapy group, 10 of the 15 participants showed substantial improvement, while in
the control group, 3 of 15 participants showed substantial improvement.

The results from the study are summarized in the contingency table below:
Showed substantial Did not show substantial
Treatment Total
improvement improvement
Dolphin therapy 10 5 15

Control group 3 12 15

Total 13 17 30

a. What proportion of study participants receiving dolphin therapy showed substantial


improvement?

b. What proportion of study participants in the control group showed substantial improvement?

c. Does the dolphin therapy appear to be more effective?

3
Antonioli, C., & Reveley, M. (2005). “Randomized controlled trial of animal facilitated therapy with dolphins in the
treatment of depression.” British Medical Journal, 331(7527), 1231 – 1234.
Intro to Data, page 2
d. What concerns, if any, do you have with the study as it was described above?

Section 1.2 Data Basics


Data sets don’t come to us all neatly summarized so that we can immediately see the story that the
data tell us. We need to organize our data so that we can effectively describe the data. Data can be
messy! Cleaning data is both an art and a science. We won’t get into data cleaning in this course, but
you may wind up doing it in the future. In this section, we talk about one structure for organizing data
an introduce some terminology that will be used throughout the course.

Observations, Variables, and Data Matrices


The table below displays all rows of a data set concerning 50 single-family homes that sold between
2006 and 2010 in Ames, Iowa. These observations will be referred to as the ames50 data set, and they
are a random sample from a larger data set.4

yearBuil bedroom
t exterior livingArea bathFull bathHalf s fireplaces garageCars salePrice
1960 BrkFace 1656 1 0 3 2 2 215000
1961 VinylSd 896 1 0 2 0 1 105000
1958 WdSdng 1329 1 1 3 0 1 172000
1968 BrkFace 2110 2 1 3 2 2 244000
1997 VinylSd 1629 2 1 3 1 2 189900
1998 VinylSd 1604 2 1 3 1 2 195500
1999 VinylSd 1804 2 1 3 1 2 189000
1993 HdBoard 1655 2 1 3 1 2 175900
1992 HdBoard 1187 2 0 3 0 2 185000
1998 VinylSd 1465 2 1 3 1 2 180400
1990 HdBoard 1341 1 1 2 1 2 171500
2003 CemntBd 3279 3 1 4 1 3 538000
1988 WdSdng 1752 2 0 4 0 2 164000
1951 VinylSd 864 1 0 2 0 2 141000
1978 Plywood 2073 2 0 3 2 2 210000
2000 VinylSd 1674 2 1 3 0 2 216000
1970 WdSdng 1004 1 0 2 1 2 149000
1971 VinylSd 1078 1 1 3 1 2 149900
1968 VinylSd 1056 1 0 3 1 1 142000
1970 Plywood 882 1 0 2 0 2 126000
4
Data courtesy of Dean De Cock, retrieved from https://doi.org/10.1080/10691898.2011.11889627. The larger data set
contains 2930 observations.

Intro to Data, page 3


yearBuil bedroom
t exterior livingArea bathFull bathHalf s fireplaces garageCars salePrice
2007 VinylSd 1704 2 0 3 1 3 306000
2005 VinylSd 1698 2 0 3 1 3 275000
2005 CemntBd 1822 2 0 3 1 3 259000
2004 VinylSd 1535 2 0 3 0 2 214000
2003 VinylSd 2696 2 1 3 2 3 500000
2002 VinylSd 2250 2 1 3 1 3 320000
2005 VinylSd 1324 2 0 3 0 2 199500
2004 VinylSd 1374 2 1 3 1 2 184500
2002 VinylSd 1960 2 1 4 2 2 216500
2004 VinylSd 1733 2 1 3 1 2 185088
2000 VinylSd 1430 2 1 3 1 2 180000
2001 VinylSd 2035 2 1 3 1 2 222500
1999 VinylSd 2599 2 1 4 1 3 333168
1998 VinylSd 2475 2 1 4 1 3 355000
1996 VinylSd 1720 2 0 3 1 2 260400
1994 HdBoard 2622 2 1 3 2 2 325000
1999 VinylSd 2270 2 1 4 1 3 290000
1998 VinylSd 1839 2 1 4 0 2 221000
1995 VinylSd 3238 2 1 4 1 3 410000
2005 VinylSd 1595 2 0 2 1 3 221500
2008 VinylSd 1566 2 0 3 0 2 262500
2004 VinylSd 1947 2 1 3 1 2 254900
2005 VinylSd 1468 2 0 2 1 2 271500
2004 VinylSd 2084 2 1 4 0 2 233000
2004 VinylSd 1659 2 1 3 0 2 181000
1994 VinylSd 2110 2 1 3 2 2 205000
1992 WdSdng 1845 2 1 3 1 2 189000
1993 HdBoard 1744 2 1 3 0 2 194500
1984 HdBoard 1097 2 0 3 0 2 152000
1980 HdBoard 1564 2 1 3 1 2 171000

Each row in the table represents one single-family home (sold between 2006 and 2010 in Ames, Iowa),
or case (or observational unit). The columns represent characteristics, called variables, for each of the
homes. The variables are explained below:
variable description
yearBuilt Original construction year
exteriorMateria Primary exterior covering on house ( Brkface indicates Brick Face,
l CemntBd indicates Cement Board, Hdboard indicates Hard Board,
Plywood, VinylSd indicates Vinyl Siding, WdSdng indicates Wood Siding)
livingArea Living area (square feet)
bathFull Number of full baths
bathHalf Number of half baths
bedrooms Number of bedrooms
fireplaces Number of fireplaces
garageCars Size of garage in car capacity
Intro to Data, page 4
salePrice Sales price (USD)

For example, the first row (repeated below) represents a single-family home built in 1960 that has
Brick Face as its exterior material, 1656 square feet of living space, 1 full bathroom, no half bathrooms,
3 bedrooms, 2 fireplaces, a 2-car garage, and sold for $215,000.
yearBuil bedroom
t exterior livingArea bathFull bathHalf s fireplaces garageCars salePrice
1960 BrkFace 1656 1 0 3 2 2 215000
The full table represents a data matrix, which is a common way to organize raw, unprocessed data.
Data matrices are a convenient way to record and store data. Another observation can easily be added
as a new row at the bottom of the matrix, or another column could be added to represent a new
variable recorded for each case.

Types of Variables
Examine the variables in the ames50 data set. What similarities do you see among the variables? What
differences do you see? All variables are either numerical or categorical.
 Numerical variables (also called quantitative or measurement variables) take on a wide range
of numerical values, and it is sensible to do math (e.g., addition, subtraction, averaging) with
numerical variables.
 Categorical variables (also called qualitative variables) place an individual or item into one of
several groups or categories, which are called levels.
Different types of variables provide different kinds of information. The variable type will guide what
kinds of summaries (graphs/numerical) are appropriate.

Numerical and categorical variables can be broken down further are shown in the following diagram
and as explained below.

Numerical: Discrete and Continuous


Numerical variables can be divided into two groups: discrete and continuous. A quantitative variable is
discrete when it can only take numerical values with jumps. The number of dogs in a household is an
example of a discrete variable. A quantitative variable is continuous when it can take on any value in
an interval or collection of intervals. The daily temperature in Ann Arbor is an example of a continuous

Intro to Data, page 5


variable. Note: We often treat variables such as income as continuous even though they take on
distinct values.

Categorical: Nominal and Ordinal


Categorical variables can be divided into two groups: nominal and ordinal. The brand of soda you might
order with a take-out meal {Coke, Diet Coke, Sprite, etc.} is an example of a nominal variable. The size
of the drink you order {Small, Medium, Large} is also a category, but there is a logical order in which
they should be placed: a small drink has less than a medium has less than a large. As a consequence,
the size of the drink you order is considered an ordinal variable. We don’t need to differentiate
between nominal and ordinal variables for our work in Stats 250, so don’t get yourself caught up in a
nominal/ordinal debate.

Numerical Variables: Grouping Them Up


Numerical variables can also have different subtypes, but interestingly, it is sometimes possible (or
even preferable!) to group up ranges of a numerical variable and treat it as an ordinal, or even a
nominal, variable. For instance, age is often reported as 18-24, 25-34, 35-44, etc. which would make it
an ordinal variable. Age groups like {infant, child, teenager, adult, elderly} could be treated as a
nominal or ordinal variable.

Note: While it’s possible to turn numerical variables into categorical ones by creative grouping, it is not
possible to go the other direction and change categorical variables into numerical ones, even if
sometimes a categorical variable is coded using numbers. Responses to the question “What is your
marital status?” might be coded in a dataset as 1-Single, 2-Married/Partnered, 3-Separated/Divorced,
or 4-Widow, but the variable would still be a categorical variable.

Identify the type of each variable in the ames50 dataset:


variable name variable type variable name variable type

yearBuilt bedrooms

exteriorMateria fireplaces
l

livingArea garageCars

bathFull salePrice

bathHalf

Example: More Variable Typing


For each variable listed below, determine whether it is categorical or numerical. If it is numerical,
indicate whether it is discrete or continuous. The answers will be posted on Canvas.
Intro to Data, page 6
 Whether or not you were born in Michigan
 Day of the week on which you were born
 Distance (in miles) you currently are from where you were born
 Number of the original seven Harry Potter books you have read
 Hand you use to write
 Amount of sleep (in hours) you have gotten in the past 24 hours
 Whether you slept at least 7 hours in the past 24 hours
Note: The size of a sample (typically denoted as n ), while it is a number, is not a variable. A variable is
something that you can measure on each subject in the sample, while sample size is a characteristic of
the study as a whole.

Section 1.3 Data Collection Principles


Consider each of the following questions:
1. How is global warming influencing the health of coral reefs in the Pacific Ocean?
2. Over the last 5 years, what is the average time to complete a degree for University of Michigan
undergraduate students?
3. Do students who self-identify as ‘night owls’ have higher GPAs than those who consider
themselves ‘morning larks’?
4. Does having a pet around lessen the anxiety of those diagnosed with generalized anxiety
disorder?
5. Is gestation length associated with life expectancy among mammalian animal species?
6. Does exposure to light pollution at night influence weight gain?

Each research question is about a population, the entire group we are interested in learning about. For
example, question 2 is about all undergraduate students at the University of Michigan. Typically, we
cannot answer these questions definitively because we would need to observe every case in the
population. This usually takes too long, costs too much, and—for some research questions—actually
destroys the item in the process of measurement (e.g., the breaking strength of a wire rope).

Instead of measuring every item in a population, we take a sample, a subset of the cases that is often a
small fraction of the overall population. For instance, we might speak to 20–30 (or some other number)
University of Michigan alumni and ask them how long it took them to complete their undergraduate
degrees; their responses could be used to provide an estimate of the average time to complete a
degree for the overall population of undergraduate students.

Intro to Data, page 7


Example: Bad samples
Below are two conclusions corresponding to the first two research questions posed above. Each
conclusion is based on sample data, but there is a problem—can you spot it?
1. Last year was the warmest year on record at a local Hawaiian beach resort, and the nearby
coral reef was completely dead. Thus, rising temperatures must be damaging the health of coral
reefs in the Pacific.
2. I met two University of Michigan students who took more than 7 years to graduate from
Michigan, so it must take longer to graduate at Michigan than at other universities.

Each of these conclusions appears to be based on anecdotal evidence! Anecdotal evidence is typically
composed of unusual cases that are recalled based on their striking characteristics. Instead of using
these unusual cases to draw conclusions about a population, we should examine a sample of many
cases from this population and be cautious about making inferences.

Sampling from a Population


Suppose we would like to estimate the average amount of time University of Michigan undergraduates
spent studying for final exams last term. Our population consists of all UM undergraduate students
enrolled in classes last term. What is the best way to take a sample of 500 UM undergraduate students
enrolled in classes last term? Think about how you would do this.

In order to draw inferences about a population (e.g., all UM


undergraduates enrolled in classes last term) from a sample (e.g., 500
undergraduate students enrolled in classes last term), we want to be
reasonably sure that the sample is representative of the entire
population.

The best way to ensure a sample is representative of the population from


which it was drawn is to ensure all the observations that comprise it were
selected randomly. Representative samples allow us to generalize the
results from the sample to the population. To the right is the big picture 5
of what we are doing in statistics.

5
Images on this page courtesy of Kari Lock Morgan
Intro to Data, page 8
Two Primary Forms of Data Collection
There are two primary types of data collection: observational studies and experiments. Observational
studies (covered in Section 1.4 of the textbook) refer to instances where researchers collect data in a
way that does not directly interfere with how the data arise. Experiments (covered in Section 1.5 of
the textbook), on the other hand, refer to instances in which researchers directly influence the process
by which data arise. Usually, this involves assigning study participants to one or more treatments. Soon
we will learn key distinguishing factors between observational studies and experiments.

Section 1.4 Observational Studies & Sampling Strategies


Explanatory and Response Variables
Often, observational studies are interested in looking at the relationship between two or more
variables. Consider a recent study published in the journal Nature 6, where researchers were interested
in infant sleeping habits and its effect on later health developments. One question of interest was:
Are infants raised without exposure to total darkness more likely to suffer from myopia
(nearsightedness) later in life?

The researchers suspected darkness during sleep influences vision later in life, so darkness can be
considered the explanatory variable of the study and vision can be considered the response variable.
Identifying explanatory & response variables
To identify the explanatory variable in a pair of variables, identify which of the two is suspected of
affecting the other.

Note: You may find in your field that the explanatory variable is called the independent variable (or the
predictor) and that the response variable is called the dependent variable (or outcome). In Stats 250,
we avoid using the terms “independent” and “dependent” so that the ideas don’t get confused with
the ideas that a pair of variables may be independent or dependent.

In the study above, researchers recorded whether each child selected to be in the study slept with or
without a night light. They returned to each child years later and found that those who slept with
night-lights were much more likely to have developed myopia (nearsightedness) and need glasses.

This is called an observational study. Generally, data in observational studies are collected only by
monitoring what occurs, while experiments require the primary explanatory variable in a study be
assigned to each subject by the researchers.
6
Kaneshi, Y. et al. Influence of light exposure at nighttime on sleep development and body growth of preterm
infants. Sci. Rep. 6, 21680; doi: 10.1038/srep21680 (2016).
Intro to Data, page 9
Making causal conclusions based on experiments is often reasonable, depending on how the
explanatory variable was assigned. However, making the same causal conclusions based on
observational data can be difficult, but not impossible. While causal inference is an active area of
statistical research, it will not be a focus of our course. Most observational studies we encounter in
Stats 250 are generally only sufficient to show associations.

Although there are many different types of observational studies, we focus here on a few main ways of
collecting data, along with their benefits and drawbacks, each of which has a good chance of creating a
representative sample.

Simple Random Sampling


A simple random sample (often abbreviated as SRS) of n observations from a population is one in
which each possible sample of that size has the same chance of being the sample that is selected. This
also means that every member/case in the population has an equal chance of being included and there
is no implied connection between the members/cases in the sample.

Simple random sampling is the best way of ensuring that your sample is representative of the
population it is chosen from. Sampling this way allows us to generalize our results from the smaller
sample to the larger population. Note: It is possible for a SRS to not be representative of the
population, but a non-representative sample would be extremely rare. To quote Paul Velleman, “rare
things happen, but they don’t happen to me.”
The image to the right7 shows a simple random sample of 4
people selected at random. Here, individuals 2, 5, 8, and 10
were chosen. Note that, in a simple random sample, each of
the 12 people had the same chance of being chosen and
each sample of size 4 had the same chance of being the
chosen sample.

Example
Consider the cumulative grade point averages (GPAs) of
University of Michigan undergraduate students. To conduct
a simple random sample, we might write the cumulative
GPA for each of the approximately 30,000 undergraduate students at Michigan on a scrap of paper and
randomly jumble them in a bag. (We would need a huge bag!) Thereafter, we could blindly pull a
sample of n of these paper scraps out of the bag.

Stratified Sampling
In stratified sampling, a “divide-and-conquer” sampling strategy, the population is divided into
nonoverlapping groups called strata. The strata are chosen such that each group is similar with respect
7
Source: Dan Kernler, https://faculty.elgin.edu/dkernler/statistics/ch01/1-4.html
Intro to Data, page 10
to the outcome of interest. Thereafter, a simple random sample (SRS) is taken from each stratum. This
method works best when there is a lot of variability between each stratum, but not much variability
within each stratum.

The following images shows a stratified sample of 4 people. Here, the individuals are divided into 3
strata: one with the 3 blue people, one with the 6 red people, and the other with the 3 green people.
Then, we take a random sample of 1/3 of each stratum—we randomly sample 1 blue person, 2 red
people, and 1 green person. We still have 4 individuals in our sample and we have made sure to have
representation from each stratum.

Example
Instead of randomly sampling from our population of University of Michigan undergraduate students
indiscriminately, we could categorize (or stratify) the students according to their class rank.
Alternatively, we could stratify the students according to their college (e.g., LSA, Art and Design,
Engineering, Kinesiology). Ideally, we’d like to choose a way of stratifying the population such that all
the observations in each stratum are similar with respect to the outcome of interest. Do you think it
would be smarter in to stratify by class rank or by college in our GPA example?

Convenience Samples
A convenience sample refers to samples that are obtained by measuring whatever or whoever is
available to be measured. While it’s possible to get interesting information from a convenience sample,
they are rarely (if ever) representative of a larger population.

Example
A psychology professor wants to test a new behavioral theory so they tell their large class of 300
students that if they participate in an experiment, they will get extra credit. This sample would only be
able to tell the professor how college students who want extra credit behave, and would not
necessarily be applicable to other groups of people.

Intro to Data, page 11


Bias: How Sampling Can Go Wrong
While it is unlikely that the results of our sample will represent the population perfectly, we do want
our surveys to be unbiased. Results based on a survey are biased if the method used to obtain those
results would consistently produce values that are either too high or too low. “Bias is the bane of
sampling—the one thing above all to avoid.”8

Example
You want to find out how often high school students text or email while driving. You take a sample of
high school students at local Pioneer High School and find that 35% of those sampled say that they
have texted or emailed while driving. Do you think that this sample is representative of the population
of all high school students who drive? Why or why not?

Selection bias occurs if the method for selecting the participants produces a sample that does not
represent the population of interest.
Nonresponse bias occurs when a representative sample is chosen for a survey, but a subset cannot be
contacted or does not respond.
Response bias occurs when participants respond differently from how they truly feel. The way
questions are worded, the way the interviewer behaves, as well as many other factors might lead an
individual to provide false information.

Example
In order to find out how Ann Arbor residents feel about animal cruelty, volunteers at the Humane Society
of Huron Valley put together a list of all donors to the shelter. They then called a random sample of people
from the list. The results might suffer from what type of bias?

Sampling bias is a type of selection bias that occurs if the method for selecting the participants causes
some individuals in the population to be more or less likely to be included in the sample than others.

Example
Here’s part of the survey methodology section of a recent Gallup poll: “results for this Gallup poll are
based on telephone interviews conducted July 30-Aug. 12, 2020, with a random sample of 1,031
adults, aged 18 and older, living in all 50 U.S. states and the District of Columbia… Landline and cellular

8
Bock, Velleman, De Veaux, and Bullard, Stats 5e (Boston, MA: Pearson Education, Inc., 2019), p. 279.
Intro to Data, page 12
telephone numbers are selected using random-digit-dial methods.” 9 These methods are much better
than they were in the past when telephone interviews were only conducted with people who had
landline phones (and only in the contiguous United States). Think about populations that will be missed
in telephone surveys…

The Limits of Observational Studies


Recall the Nature study where the research question was:
Are infants raised without exposure to total darkness more likely to suffer from myopia
(nearsightedness) later in life?

It may be tempting to conclude that sleeping with the lights causes myopia. In fact, it turns out
teenagers with myopia are quite likely to have myopic parents, who in turn were more likely to leave a
light on so that they could see when tending to their infant children throughout the night.
In this case, whether the parents have myopia is an example of a confounding variable.

Confounding variables are variables that are associated with both the explanatory and response
variables.
Confounding variables get in the way of being able to make causal conclusions about the relationship
between explanatory and response variables. In the above example, we can’t say night light use causes
myopia because there’s an alternative explanation for that association: parental myopia.

While one method to justify making causal conclusions from observational studies is to exhaust the
search for confounding variables, there is no guarantee that all confounding variables can be examined
or measured. This is the main reason why it is so difficult to demonstrate a causal relationship with an
observational study. Where feasible, researchers will do an experiment to overcome this obstacle.
Example: Can You Find the Confounding Variable?
Each of the examples below describes an observational study that [incorrectly] attempts to
demonstrate a causal relationship between an explanatory and response variable. First, identify the
explanatory and response variables, and then come up with at least one possible confounding variable.
a. An observational study tracked sunscreen use and melanoma (skin cancer) diagnoses, and it
later found that increases in sunscreen use led to an increased risk of skin cancer.
Explanatory:

Response:

Confounding:

9
https://news.gallup.com/poll/317567/public-reengages-election-early-pandemic-dip.aspx
Intro to Data, page 13
b. A study examined how external clues influence student performance. Undergraduate students
were randomly assigned to one of four different forms for their midterm exam. Form 1 was
printed on blue paper and contained difficult questions, while Form 2 was also printed on blue
paper but contained simple questions. Form 3 was printed on red paper, with difficult
questions, and Form 4 was printed on red paper with simple questions. The researchers were
interested in the impact that color and type of question had on exam score (out of 100 points).
Suppose we learned that the students in the “blue paper” group performed better on average
over those in the “red paper” group, but that the “blue paper” group were mostly upper-
classmen and the “red paper” group were mostly first- and second-year students.
Explanatory:

Response:

Confounding:

Why Conduct Observational Studies if It’s Hard to Prove Causal Relationships with Them?
In both of the cases above, we can see clear associations between the variables being studied, but we
can’t claim these relationships are causal. It’s worth thinking of causal relationships as describing an
asymmetrical (one-way) relationship between two variables, and associative relationships as describing
symmetrical (two-way) relationships between variables.

For instance, consider two variables of interest to public officials: the percentage of the population that
is homeless and the crime rate. We could claim these variables share an association and point out they
tend to be high in the same places and low in the same places. However, you now know that such an
association does not necessarily mean one causes the other. Furthermore, it’s not clear which causes
which! To claim that homelessness causes crime or that crime causes homelessness would be very
different claims to justify.10 A confounding variable, such as the unemployment rate, could be at play.

This is not to say that it is impossible to draw causal conclusions from observational data. Some major
milestones in medicine, for example, have been achieved thanks to observational studies! Often, it’s
not feasible or ethical to conduct randomized experiments to make causal conclusions, but we know

10
It’s also important to think deeply about the consequences of an analysis which would prove that homelessness causes
crime. That wouldn’t imply that all homeless people are criminals, but a response might be to criminalize homelessness,
which would be unjust. Statisticians must remember that data often represent people and that our work has real human
consequences.
Intro to Data, page 14
that human papillomavirus (HPV) causes cervical cancer11 and that smoking causes lung cancer12. This is
not because researchers gave people HPV or forced them to smoke: we know these things thanks to
careful use of observational data.

We won’t be focusing on causal inference from observational studies in this course, but it’s an active
area of statistical research and is commonly used in the social sciences.

Causal Relationships
In short, observational studies are good at showing that a relationship exists. It is difficult to use data
from observational studies to show why a relationship exists.

Section 1.5 Experiments


Establishing cause-and-effect relationships is central to science. But it is difficult to establish cause-and-
effect relationships with observational studies (using the tools we learn in this course) because of the
possible confounding variables influencing the associations among collected data.
Randomized controlled experiments can demonstrate a causal connection between variables. Instead
of gathering data from simple observations (as in observational studies), researchers randomly assign
treatments to different cases and observe the treatment’s effect.

Example: Do Antidepressants Help Curb Nicotine Addiction?


Recent studies suggest that vaping, while it does not hold the same harmful chemicals contained in
many cigarettes, causes users to consume much higher dosages of nicotine. Suppose you have 100
college students who would like to quit vaping, and you suspect that administering an antidepressant
daily might lessen withdrawal symptoms and help them to quit. You could prescribe them a daily
dosage of the antidepressant and record the percentage of the sample who relapse back into old
habits within a month of starting the program. Of course, to determine whether the antidepressants
help individuals stop vaping, you would need to compare your results to the percentage of cases where
the students relapsed without taking antidepressants. It might be wise to split the 100 students into
two groups: those who are administered the antidepressant and those who are not. This separation
would be considered a control in the experimental design and represents one of the four main
principles upon which most experiments are conducted.

11
Walboomers, J.M.M., Jacobs, M.V., Manos, M.M., Bosch, F.X., Kummer, J.A., Shah, K.V., Snijders, P.J.F., Peto, J., Meijer,
C.J.L.M. and Muñoz, N. (1999), Human papillomavirus is a necessary cause of invasive cervical cancer worldwide. J. Pathol.,
189: 12-19. doi:10.1002/(SICI)1096-9896(199909)189:1<12::AID-PATH431>3.0.CO;2-F
12
Parascandola, M., Weed, D.L. & Dasgupta, A. Two Surgeon General's reports on smoking and cancer: a historical
investigation of the practice of causal inference. Emerg Themes Epidemiol 3, 1 (2006). https://doi.org/10.1186/1742-7622-
3-1
Intro to Data, page 15
The Four Principles of Experimental Design
1. Controlling: It’s generally important to have a comparison group in a study in order to monitor
how the treatment group performs relative to something you already understand. This control
group lets you understand what the effect of the treatment is: do fewer students given the
antidepressant relapse compared to students who weren’t given the antidepressant?

The control group is also designed to reduce or eliminate the effects of any other variables that
might influence the result. In the dolphin study on page 1, the control group was still brought to
the island near Honduras so they would be able to say that it was the dolphins, not the tropical
environment, that caused the difference in symptom improvement. In our vaping example, how
might we use controlling to make sure it’s the antidepressants that cause lower relapse rates?

It is common for both researcher and subject expectations to influence the results of an
experiment. How might this appear in our example?

To control for this effect, researchers could give the control group a placebo, an inert pill that
does not contain antidepressant medication. Since the students in both the treatment group
and the control group would both believe they might be receiving medication, the placebo
effect would be spread evenly among the groups. Studies where study participants do not know
which group they are in are called blind.
To control for researcher expectations, studies are sometimes conducted double-blind. In the
context of this example, the medical professional evaluating student responses would also not
know who was in the control group and who was in the treatment group.

2. Randomization: Part of the reason it’s hard to make causal conclusions from observational
studies that it’s hard to account for every possible confounding variable. Researchers
randomize the assignment of treatments to each of its cases to account for confounding
variables that cannot be controlled (or that they don’t know they should control). For example,
some students might be more likely to relapse into vaping because individuals in their friend
network also vape. Randomizing students into the treatment and control groups tends to even
out these differences and produce groups that are comparable. (This is what gets us causal
conclusions.)

Intro to Data, page 16


3. Replication: The more cases the researchers have in their experiment, the more accurately they
can estimate the effect [or lack of effect] the explanatory variable has on the response. Our
suggested study involves 100 college students evenly split into a control group size n = 50 and a
treatment group size n = 50. Having a sample size of 200 college students [more replicates] with
100 students in each group would provide more accurate estimates. Additionally, scientists also
replicate an entire study to verify earlier findings.

4. Blocking: Earlier, we mentioned that one variable that might


influence whether our college students relapse in vaping
behavior was whether individuals in their friend network also
vape, and that we could randomize students’ assignments
into the control and treatment groups so that this
confounding variable would hopefully be ‘evened-out’ in the
end. Blocking tries to address this same problem through the
opposite means. Instead of randomizing students into the
control and treatment, we might first block them into two
categories – those with friends who vape and those without –
and then randomly assign students in each of these categories
to the control and treatment groups. Since we think that
having friends who vape could impact whether or not a
student relapses, we want to make sure that there are
approximately equal numbers of students with and without
friends who vape in the control and treatment groups. This
blocking will help us make sure this variable is not a
confounder!

Note: Randomized controlled experiments, while not perfect, are the “gold standard” for causal
inference.

Aside: Designing new kinds of experiments is a really active research area in Statistics these days,
because scientists are asking more questions that more traditional experiments can’t answer well. One
example is the sequential, multiple-assignment randomized trial (SMART), which helps researchers
answer questions about how to develop sequences of treatments which can adapt to patients’
changing needs over time.13 There’s also work being done on randomized trials to help develop mobile
health (“mHealth”) apps.

Example: Dolphin therapy


Refer back to the description of the dolphin therapy study and identify the elements of experimental
design. If any elements appear to be missing, describe briefly how they could be incorporated.

13
Almirall D., Nahum-Shani, I., Sherwood, N.E., Murphy S.A. (2014). Introduction to SMART Designs for the Development of
Adaptive Interventions: With Application to Weight Loss Research. Translational Behavioral Medicine, 4:260-274. DOI:
10.1007/s13142-014-0265-0
Intro to Data, page 17
Controlling:

Randomization:

Replication:

Blocking:

Sections 1.6 and 1.7 Summarizing Data


We have discussed the difference between types of variables, along with various ideas we should take
into consideration when collecting data. Next, we focus on how to organize and summarize the data so
that we can begin to see the story the data tell.

Parameters and Statistics


How our data is collected determines both the language and the notation for any summary values.
Suppose we were interested in gathering data from the population of University of Michigan students.
Summary values calculated from populations are called parameters. Examples are:
 The average age of all students at University of Michigan
 The proportion of all University of Michigan students who have access to high-speed Internet
Summary values calculated from samples are called statistics. Examples are:
 The average age of Stats 250 students
 The proportion of Stats 250 students who have access to high-speed Internet

We use sample statistics to estimate population parameters.

Example: Resting Pulse Rate


Each year, as part of a statistics education program, school children in Australia participate in the
Census at School program by filling out a questionnaire. On the questionnaire, one of the questions
asks: “What is your resting pulse rate?” A sample of 200 students reported an average resting pulse
rate of 75.4 beats per minute. Based on this information,
a. What is the population of interest?

b. What is the parameter of interest?

c. What is the sample?

d. What is the statistic?

Intro to Data, page 18


Summarizing Numerical Data
The most typical numerical variables represent measurements or counts. For example, let’s consider
two variables from the Census at School program:
variable description
travelTime Time (in minutes) to travel to school by car
texts Number of text messages sent yesterday

I took a random sample of 20 students who were seniors in high school in Michigan. Let’s take a look at
the values for the travelTime and texts variables for these 20 students.
Student travelTim texts Student travelTim texts
e e
1 6 150 11 13 45
2 20 8 12 3 20
3 10 7 13 15 20
4 3 0 14 30 30
5 5 30 15 3 20
6 17 30 16 7 25
7 10 40 17 10 30
8 12 54 18 6 200
9 10 4 19 7 150
10 3 20 20 15 5
It’s difficult to look at a set of data, even one this small, to figure out the story the data set is telling us.

Our first step is to make histograms for each of the variables.

Intro to Data, page 19


We will describe these histograms later in the chapter. For now, let’s determine how we can
summarize the data.

Describing the Central Tendency of a Data Set


The first step in summarizing a set of numerical data is to describe the “typical” values for the data.
Measures of central tendency acknowledge that each observation of a numerical variable might be
different, but that they also have some “center” that represents them all together. There are two basic
measures of central tendency:
 Mean -- The sample mean of a numerical variable is the sum of all of the observations divided
by the number of observations: the numerical average value
x 1+ x 2 +…+ x n 1 n
x= = ∑ xi
n n i=1
where x 1 , x 2 , … , x n represent the n observed values. We read the symbol x as “x bar.”
 Median -- the middle value when data arranged from smallest to largest. When there are an
even number of data points, the median is the average of the two values in the middle.

Example: Travel Time


Consider the travel time by car (in minutes) for our 20 high schools seniors in Michigan. Here are the
data values:
6, 20, 10, 3, 5, 17, 10, 12, 10, 3, 13, 3, 15, 30, 3, 7, 10, 6, 7, 15
a. Compute the mean travel time.
x + x +⋯+ x20 6+20+ …+15 205
x= 1 2 = = =10.25 minutes
n 20 20
Note: Don’t round the mean to a whole number: it does not represent a possible value of the variable, so it
should stay as a decimal number.

b. Compute the median travel time.


Note: To compute the median, first put the data values in increasing order. Here are the
ordered data values:
3, 3, 3, 3, 5, 6, 6, 7, 7, 10, 10, 10, 10, 12, 13, 15, 15, 17, 20, 30
Since there are 20 values, the median is the average of the 10 th and 11th values:
10+10 20
= =10 minutes
2 2
Note: As with the mean, the median can take on a value other than what is possible for the variable: we
shouldn’t round it to a whole number.

c. What if the data value for the student who travels 30 minutes to school was incorrectly entered
as 60 minutes instead of 30 minutes? How would the mean change? The median?
Mean: the numerator in the calculation of the mean would be 235, so the mean would be
235
x= =11.75 minutes
20
Intro to Data, page 20
Median: since the middle two values are the same as they were without the data entry error,
the median is still 10 minutes

Notes: The mean is sensitive to extreme observations (can change dramatically due to a few extreme
observations). The median is resistant/robust to extreme observations.

Describing Variability
Midterm exams are returned, and the “average” was reported as 76 points out of 100 points. You
received a score of 88 points. Under which of the following scenarios did you perform better?

Often what is missing when the central tendency of something is reported is a corresponding measure
of variability or “spread” that describes how tightly or loosely clustered the observations in the data set
are clustered around that measure of central tendency. A measure of variability is perhaps the most
important quantity in statistical analysis. Here we discuss several measures of variation, each useful in
some situations, each with some limitations.

Note: We will stay away from the word “spread” when we talk about variability in class because the
word has many meanings, some of which cause confusion for students learning statistics. 14 The authors
(Kaplan, Rogness, and Fisher) have an excellent example of the issues with the word “spread”:
Said the statistician from west Texas…
Howdy! Welcome to my spread and make yourself at home. Go ahead and spread out your papers and
help yourself to all the food – it's quite a spread , huh? Be sure to try some of that blueberry spread ;
just spread a little on a cracker. Yum! And take a look at that fancy tablecloth; my grandma made
that spread for me. Once you're all settled in we'll open up that spread sheet and see if we can't figure
out the spread of those data.
Only the last instance of the word “spread” is actually related to the idea of variability.

14
Kaplan, J.J., Rogness, N.T. and Fisher, D.G. (2012), Lexical ambiguity: making a case against spread . Teaching Statistics,
34: 56-60. doi:10.1111/j.1467-9639.2011.00477.x
Intro to Data, page 21
Range and IQR
One way to describe the variability in travel times would be to compute the range. Statisticians tend to
think of the range as the difference between the maximum and minimum values:
Range = maximum – minimum

Calculate the range for the travel times to school for our 20 Michigan high school seniors.
Range = maximum – minimum = 30 – 3 = 27 minutes
The range is easy to calculate, but there’s a catch… Since the range consists of the minimum and
maximum values, it is impacted greatly by extreme points in the data set.

What would happen to the range if the data value for the student who travels 30 minutes to school
was incorrectly entered as 60 minutes instead of 30 minutes?

The range only uses two observations to describe the variation in an entire data set, regardless of the
sample size. There are obviously situations where it will not do a particularly good job, especially when
the maximum or minimum observed values are extreme relative to the other data.

Another measure of variation, called the interquartile range (IQR), tries to address this issue. To
understand how the IQR works, we must first introduce the idea of percentiles.

Percentiles
The pth percentile is the value such that p% of the observations fall at or below that value.

Some common percentiles are the first quartile (Q1, the 25th percentile) and the third quartile (Q3, the
75th percentile). The median is the second quartile (the 50th percentile), but we typically don’t call it
anything other than the median.

The interquartile range (IQR) is calculated as


IQR = Q3 – Q1

Whereas the range describes the variability for 100% of the data, the IQR describes the variability of
the middle 50% of the data.

Example: Travel Time


We’ve already found the 50th percentile for the travel time variable in our data set. Now find the 25 th
and 75th percentiles. Afterward, compute the corresponding IQR. Here again are the ordered data
values:
3, 3, 3, 3, 5, 6, 6, 7, 7, 10, 10, 10, 10, 12, 13, 15, 15, 17, 20, 30

Intro to Data, page 22


What would happen to the IQR if the if the data value for the student who travels 30 minutes to school
was incorrectly entered as 60 minutes instead of 30 minutes?
Note: There are a few different ways to calculate the quartiles for a data set. As a result, the values we
calculate for the quartiles (and, thus, the IQR) may be different if we are using technology instead of
calculating the values by hand. For example, here’s a summary of the travelTime variable from R:
summary(travelTime)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.00 5.75 10.00 10.25 13.50 30.00
Don’t worry about these differences—they’re small enough that we still get the same sense of our
data.

Five-Number Summary
Sometimes we will choose to summarize our data in a five-number summary, which includes the
minimum, Q1, the median, Q3, and the maximum. Together these five values give a quick overview of
the data set. It’s helpful to display the five-number summary in a table. Here is the five-number
summary for the travelTime variable, which we can pull directly from the R output (or we can use
the values we calculated by hand):

Minimum Q1 Median Q3 Maximum


3 5.75 (or 5.5) 10 13.5 (or 14) 30

Even though the IQR is not sensitive to extreme values, it still suffers from the same problem as the
range: it only uses two of the data points from the data set. We want all of our data to contribute to
the measure of variability, not just the two data points that happen to be at the 25 th and 75th
percentiles.

Variance and Standard Deviation


What is the best way to involve every single data point in our data set in a calculation of the variation
of that data set? When the mean is used as the measure of center for a data set, we use the standard
deviation as the measure of variability in the data set.

It might be helpful if we define a deviation before we talk about the standard deviation. A deviation
tells us how much an observation departs from the mean:
deviation = observation – mean = x−x
Intro to Data, page 23
It seems like it would make sense for us to talk about the average deviation, but that’s actually
problematic. Why?

As a workaround, we commonly square the deviations before adding them all up. This is, somewhat
unimaginatively, called the sum of squares, and will always be positive. Problematically, the sum of
squares will increase with every additional observation.

A simple solution would be to take the sum of squares and divide it by the number of observations to
find a mean squared deviation. This works well if we are dealing with population data, but consistently
underestimates the population variance when we use sample data.
Because we most often deal with sample data, when calculating the mean squared deviation we
instead divide the sum of squares by n−1. This is called the variance of the data. If we take the square
root of the variance, we call that the standard deviation. In this course, we emphasize the standard
deviation over the variance since the standard deviation is in the original units of the data (making it
much friendlier to work with and understand).

Here are the formulas for variance and standard deviation:


2
Variance: s = ∑ of squared deviations ∑ ( x −x)2
=
n−1 n−1
Standard deviation (SD): s=+ √ variance=+ ∑
√(x−x)2
n−1

Example: Travel Time


Consider again the travel times to school for our 20 Michigan high school seniors. We found the mean
travel time earlier: x=10.25 minutes.
Studen Travel time, Deviation, Studen Travel time, Deviation,
t x (x−x ) t x (x−x )
1 6 11 13 2.75
2 20 12 3 –7.25
3 10 –0.25 13 15 4.75
4 3 –7.25 14 30 19.75
5 5 –5.25 15 3 –7.25
6 17 6.75 16 7 –3.25
7 10 –0.25 17 10 –0.25
8 12 1.75 18 6 –4.25
9 10 –0.25 19 7 –3.25
10 3 –7.25 20 15 4.75

Intro to Data, page 24


Calculate the standard deviation for these data.

Calculating the range by hand is trivial; calculating the IQR by hand takes a little more work; calculating
the standard deviation by hand is cruel, especially for a large data set. We will never ask you to
calculate the standard deviation by hand! Here’s the one line we need to get the standard deviation of
the travel times for our 20 students:
sd(travelTime)
[1] 6.812334
We won’t focus on a precisely worded interpretation of the standard deviation. It’s more important to
us that you understand what the standard deviation is. The standard deviation is our measure of
variability (the “wiggle room”) we use when we talk about how data are distributed. In this example,
we might say something like, “Travel times for our 20 high school seniors in Michigan average 10.25
minutes give or take about 6.81 minutes.”

A few notes about the standard deviation:


 The standard deviation is the square root of the variance and describes how close the data are
to the mean using the units in which the data are recorded.

 s = 0 means…

 Like the mean, s is sensitive to extreme observations.

Relationship Between the Center and the Variability of a Distribution


The median and IQR are both quite resistant to extreme values in a distribution since they both
concentrate on the center of the distribution and ignore anything in the tails. The mean and the
standard deviation are not resistant to extreme values.

As with food and fine wines, there are natural pairings between measures of center and variability:
 Median and IQR
 Mean and standard deviation

Intro to Data, page 25


(More on this when we look at the overall shapes of distributions.)

Notation for Parameters vs. Statistics


We use different notation to indicate whether our summary values are calculated from population data
and are therefore parameters or if they are from sample data and are statistics. It will be helpful for
you to notice and pay attention to the notation!

Measure Parameter notation Statistic notation


mean μ x
2 2
variance σ s
standard deviation σ s

Intro to Data, page 26


Example
At the end of each semester, you have a chance to evaluate your courses and your instructors with
Teaching Evaluations. A few weeks after grades are posted, instructors get to see their results. The
dotplots below show student ratings (on a scale of 1–5) for the question “Overall the instructor was an
excellent teacher” for four hypothetical professors (professors A–D).

a. What are the observational units (cases) in this scenario? What is the outcome variable?

b. Arrange these professors in order from smallest to largest standard deviation of their ratings.

Example
The table below asks you to compare the standard deviations of two data sets. Without doing any
calculations, choose one of the four statements below to describe the relationship between the data
sets compared.
i. The quantity in column A is greater.
ii. The quantity in column B is greater.
iii. The two quantities are equal.
iv. The relationship cannot be determined from the given information.

Statement Column A Column B


__________ The standard deviation of { 0.2,0 .4,0 .6,0.8 } The standard deviation of {2,4,6,8 }

__________ The standard deviation of { 1,3,5,7,9 } The standard deviation of {3,5,7,9,11 }

__________ The standard deviation of { 1,3,5,7,9 } The standard deviation of { 1,3,5,7,9,9 }

__________ The standard deviation of { 1,3,5,7,9 } The standard deviation of { 1,3,5,5,7,9 }

Intro to Data, page 27


Intro to Data, page 28
Graphical Summaries of Numerical Data
We’ve now looked at two primary measures of central tendency {mean, median} and three primary
measures of variability {range, interquartile range, standard deviation}. These numerical summaries
are helpful in describing the characteristics of a data set, but they pale in comparison to the summaries
we’ll discuss next – graphical displays!

Histograms and Dot Plots


The most basic graphical display of numerical data is the dot plot, which
represents each observation in a data set using a single dot plotted along
the x-axis. To the right is a dot plot of our travel time data.

Although dot plots do a good job displaying the observed travel


times (they show the exact value for each student), you can
imagine that they would start to get pretty cluttered if we
attempted to display every single observation in a larger data
set. Recall our ames50 data set about a random sample of 50
single-family homes that sold between 2006 and 2010 in Ames,
Iowa. Many other variables were recorded for each home. To
the right is a dot plot corresponding to the reported total
number of rooms for our n = 50 homes.

While dot plots are simple, they are not good solutions for
when we either have a large amount of discrete data or for
small samples of continuous variables. Look at this dot plot
of the total rooms in a larger sample of 2,930 homes in Ames
(a large number of discrete observations). This is not a
particularly useful plot! We have so many homes that it’s not
possible to fit all the dots in the plot (there are 844 homes
with 6 rooms in this larger sample, and only 1 home with 2
rooms). The scale of the y -axis is very limiting in this case.

Intro to Data, page 29


Now let’s look at another messy dot plot, this time of a small sample of continuous observations:

In this plot, there are only two stacked dots because all of the living areas (except two) take on unique
values. Looking at this plot might give us some sense of the variability of the data (notice that cluster of
dots around 1600 ft2 and how the dots get more spaced out towards the edges of the plot), but we can
do better.

Let’s look at the values of livingArea for our random sample of 50 homes. The living areas (in square
feet) for these homes (in increasing order) are:
864 1078 1341 1535 1629 1698 1752 1947 2110 2599
882 1097 1374 1564 1655 1704 1804 1960 2110 2622
896 1187 1430 1566 1656 1720 1822 2035 2250 2696
1004 1324 1465 1595 1659 1733 1839 2073 2270 3238
1056 1329 1468 1604 1674 1744 1845 2084 2475 3279

How might we go about creating a graphical summary of the living area of the homes? These
measurements do vary. How do they vary? What is the range of values? What is the pattern of
variation?

The following are a frequency table and a histogram for these data:

Summary Table
Class Interval Frequency
(or count)
500 – 1000 3
1000 – 1500 12
1500 – 2000 22
2000 – 2500 8
2500 – 3000 3
3000 – 3500 2
Total 50

Note: each bar (or “bin”) represents a class, and the base of the bar covers the class. The table and
histogram above show the distribution of the numerical variable livingArea; that is, it provides

Intro to Data, page 30


information about the values that observed in the sample and the overall pattern of how often the
possible values occur.
How to Interpret Histograms
How can we summarize what a histogram is telling us about the distribution of our quantitative
variable?
When examining the distribution of data, we need to describe four aspects:
1. Shape:
o Modes: A distribution is unimodal if it has one mode, bimodal if it has two modes, and
multimodal if it has three or more modes.
o Symmetry:
 A distribution is symmetric if it appears to be mirrored at its midpoint.
 A distribution is right-skewed (positively skewed) if lower values of the variable are
more common with fewer and fewer observations having larger values of the variable.
The right “tail” of the distribution is longer than the left tail.
 A distribution is left-skewed (negatively skewed) if higher values of the variable are
more common with fewer and fewer observations having smaller values of the variable.
The left “tail” of the distribution is longer than the right tail.
2. Center (e.g., mean or median)
3. Variability (e.g., standard deviation or IQR)
4. Outliers (unusual observations) An outlier is a data point that does not seem to be consistent
with the bulk of the data. Outliers should not be discarded without justification. Good data
measuring and recording processes are important here.

Robust Statistics
Consider the summary statistics we’ve explored so far this chapter: mean, median, standard deviation,
range, IQR. The median and IQR are statistics that are robust to outliers, while the mean, range, and
standard deviation are sensitive to outliers.

Consider the following histograms. For each histogram, determine whether you would expect the
mean and median values to be approximately equal, for the mean > median, or for the mean < median.

Intro to Data, page 31


The bottom line: When analyzing histograms remember that the mean follows the skew of the
histogram. You can remember this if you remember what one of my friends in graduate school said,
“The alligator has a mean tail, so the mean is in the tail.”

Example: Describing the Shape of Data with Histograms


Here are histograms for eight fictitious data sets 15 (A through H). Use these histograms to answer the
questions that follow. Additionally, state whether the mean is likely to be smaller than the median,
about equal to the median, or larger than the median.
a. Which histograms are skewed to the left?

b. Which histograms are skewed to the


right?

c. Which histograms are approximately


symmetric?

d. Which histograms are approximately symmetric bell-shaped?

Relationship between the Shape, Center, and Variability of a Distribution


When should you use the mean and standard deviation versus using the median and IQR?
 Median and IQR
Use the median and IQR when the distribution is skewed (either to the left or the right). These
measures are resistant (robust) to the few extreme values in skewed distributions and will
better represent the center and variability.

 Mean and standard deviation


Use the mean and standard deviation when the distribution is roughly symmetric. While you
could use the median and IQR for symmetric distributions, the mean and standard deviations
have some nice properties that make them preferable.

15
From Lock, Lock, Lock Morgan, Lock, and Lock’s Statistics: Unlocking the Power of Data (2nd edition), John Wiley & Sons,
2017.
Intro to Data, page 32
Example
Use the following histogram and summary statistics to describe the gross sales (in millions, USD) for
the 50 top-ranked movies in 2019.

Box Plots
In general, a box plot is a data visualization of the five-number summary. A box plot is comprised of a
box (shocking, we know), a line marking the location of the median, and “whiskers.” Box plots can be
vertical or horizontal (the default in R produces a vertical box plot):

Intro to Data, page 33


Anatomy of a box plot
 The heavy line near the middle of the box in a box plot indicates the median.
 The lighter lines at the bottom and top of the box represent Q1 and Q3, respectively.
 In a standard box plot (left), the lower whisker goes all the way down to the minimum value,
and the upper whisker goes all the way up to the maximum value.
 In a modified box plot (right, default in R), potential outliers are identified and shown as points
separate from the whiskers. Potential outliers lie more than 1.5×IQR below Q1 or more than
1.5×IQR above Q3. When there are potential outliers, the whiskers are drawn to the most
extreme point that is not a potential outlier. (For the travel time data, the upper whisker is
drawn to 20 minutes, and the maximum travel time of 30 minutes is flagged as a potential
outlier.

Notes:
 Sometimes points flagged as potential outliers in a box plot are not really outliers. When the
distribution of the data is skewed one direction or the other, points flagged as potential outliers
may really just be part of the long tail. Be sure to examine a histogram in addition to the box
plot before declaring the potential outliers to be outliers.
 Be careful when using a box plot to determine the shape of a distribution. Box plots tell us
nothing about the number of modes in a distribution. Here are two identical box plots with
identical 5-number summaries (min = 0, Q1 = 25, Q2 = 50, Q3 = 75, max = 100), but very
different shapes!

Intro to Data, page 34


Example
Match the histograms below with their respective box plots.16

Relationships Between Categorical Variables


So far, we’ve explored various numerical and graphical ways to summarize numerical data in terms of
central tendency and variation. We now turn to categorical data, which can also be organized and
analyzed. Remember that our methods will need to be slightly different here, because categorical
variables – by their nature – cannot be summed or averaged.

A recent study examines the relationship between class start times, sleep, circadian preference,
alcohol use, academic performance, and other variables in college students. The data were obtained

16
Exercise 1.37, ISRS
Intro to Data, page 35
from a sample of n=253 students who completed skills tests to measure cognitive function, completed
a survey that asked many questions about attitudes and habits, and kept a sleep diary to record time
and quality of sleep over a two-week period. Below are some of the recorded variables.
Variable Coding
gender 0 = female, 1 = male
classYear Year in school, 1 = first year, …, 4 = senior
larkOwl Early riser or night owl? Lark, Neither, Owl
classesMissed Number of classes missed in a semester
earlyClass 0 = no early class, 1 = early class
anxietyScore Measure of amount of anxiety

Does the proportion of students who have an early class differ across the class year for these students?
This question asks about the association/relationship between two categorical variables. To answer it,
we’ll need a contingency table, where the categories for one variable are listed across the rows and
the categories for the second variable are listed across the columns. Each cell of the table contains the
(relative) frequency of cases that are in the joint categories of a given row and column.

Consider the two-way table below which displays the relationship between earlyClass and classYear.
Class Year
First year Sophomore Junior Senior Total
Early Yes 39 64 33 32 168
Class? No 8 31 21 25 85
Total 47 95 54 57 253

a. What proportion of the students surveyed have an early class?

b. What proportion of first-year students have an early class?

c. What proportion of sophomores have an early class?

d. What proportion of students with an early class are sophomores?

Note that the solutions to (c) and (d) are talking about the same 64 students! The questions “what
proportion of sophomores have an early class?” and “what proportion of students with an early class
are sophomores?” sound similar but are asking different questions.17

17
One way I remember this: The proportion of U.S. Senators who identify as female and the proportion of people who
identify as female who are U.S. senators are clearly not the same! The first is upsettingly small (27%), the second is
understandably very small (27 out of about 167 million).
Intro to Data, page 36
Bar plots (or Bar Graphs or Bar Charts)
Several types of graphs can display the relationship between two categorical variables. One is a
segmented or side-by-side bar plot and the other is a mosaic plot. We will not discuss mosaic plots in
Stats 250, but feel free to read about them in the book. Let’s consider both a segmented (or stacked)
bar plot and a side-by-side bar plot for displaying the data in our two-way table.

Consider each of the following questions. After answering each, consider which graphical display does
a better job at helping you toward your answer.
a. Were more juniors or more seniors present in the sample of n=253 students? Which plot does a
better job of showing this?

b. Do a greater number of first-year or sophomore students have early classes at least once per week?
Which graph is more helpful to answer this question?

c. Do a greater percentage of first-year or sophomore students have early classes at least once per
week? Which graph is more helpful to answer this question?

d. Which of these two plots provides a better way of analyzing the relationship between earlyClass
and classYear? Explain.

Intro to Data, page 37


Wait, What about Pie Charts?
It turns out that pie charts are controversial. Generally, statisticians don’t like them, since they can make it hard
to get a good sense of how the size of different groups compare. This is especially true when categories have
nearly identical counts or proportions. Take a look at the pie chart of class year below, next to a bar chart of the
same data. Which do you think conveys the distribution of class year better? Does the pie chart convey that
there are slightly fewer juniors than seniors in the sample, for instance?

Intro to Data, page 38

You might also like