Professional Documents
Culture Documents
STATISTICS
THE ART AND SCIENCE OF LEARNING FROM DATA
FIFTH EDITION
Alan Agresti
University of Florida
Christine Franklin
University of Georgia
Bernhard Klingenberg
Williams College and New College of Florida
The rights of Alan Agresti, Christine A. Franklin, and Bernhard Klingenberg to be identified as the authors of this work have
been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Authorized adaptation from the United States edition, entitled Statistics: The Art and Science of Learning from Data, 5th
Edition, ISBN 978-0-13-646876-9 by Alan Agresti, Christine A. Franklin, and Bernhard Klingenberg published by Pearson
Education © 2021.
PEARSON, ALWAYS LEARNING, and MYLAB are exclusive trademarks owned by Pearson Education, Inc. or its
affiliates in the U.S. and/or other countries.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the
publisher or a license permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd,
Saffron House, 6–10 Kirby Street, London EC1N 8TS. For information regarding permissions, request forms and the
appropriate contacts within the Pearson Education Global Rights & Permissions department, please visit
www.pearsoned.com/permissions/.
All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in
the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any
affiliation with or endorsement of this book by such owners. For information regarding permissions, request forms, and the
appropriate contacts within the Pearson Education Global Rights and Permissions department, please visit
www.pearsoned.com/permissions/.
Attributions of third-party content appear on page 921, which constitutes an extension of this
copyright page.
This eBook is a standalone product and may or may not include all assets that were part of the print version. It also does not
provide access to other Pearson digital products like MyLab and Mastering. The publisher reserves the right to remove any
material in this eBook at any time.
Preface 10
1.3 Organizing Data, Statistical Software, and the New Field 3.2 The Relationship Between Two Quantitative
of Data Science 42 Variables 146
Part Two
P
robability, Probability Distributions,
and Sampling Distributions
Chapter 5 Probability in Our Daily 5.3 Conditional Probability 275
Part Four
Extended Statistical Methods and Models for
Analyzing Categorical and Quantitative Variables
Chapter 11 Categorical Data Chapter Summary 643
11.4 Using Residuals to Reveal the Pattern of 12.2 Inference About Model Parameters and the
Association 631 Relationship 663
11.5 Fisher’s Exact and Permutation Tests 635 12.3 Describing the Strength of the Relationship 670
12.4 How the Data Vary Around the Regression Line 681 Chapter 14 Comparing Groups: Analysis
12.5 Exponential Regression: A Model for Nonlinearity 693 of Variance Methods 760
Chapter Summary 698
14.1 One-Way ANOVA: Comparing Several Means 761
Chapter Exercises 701
14.2 Estimating Differences in Groups for a Single
Factor 772
Chapter 13 Multiple Regression 707 14.3 Two-Way ANOVA: Exploring Two Factors and Their
13.1 Using Several Variables to Predict a Response 708 Interaction 781
2
13.2 Extending the Correlation Coefficient and R for Multiple Chapter Summary 795
Regression 714 Chapter Exercises 797
13.3 Inferences Using Multiple Regression 720
13.4 Checking a Regression Model Using Residual Chapter 15 Nonparametric Statistics 804
Plots 731
15.1 Compare Two Groups by Ranking 805
13.5 Regression and Categorical Predictors 737
15.2 Nonparametric Methods for Several Groups and for
13.6 Modeling a Categorical Response: Logistic Dependent Samples 816
Regression 743
Chapter Summary 827
Chapter Summary 752
Chapter Exercises 829
Chapter Exercises 754
Appendix A-833
Answers A-839
Index I-865
Index of Applications I-872
Credits C-877
• The Times Series app plots a simple time series and can
add a smooth or linear trend.
• The Random Numbers app generates random numbers
(with or without replacement) and simulates flipping Screenshot from the Normal Distribution app
(potentially biased) coins.
• The three Sampling Distribution apps generate sampling
• The Mean vs. Median app allows users to add or delete distributions of the sample proportion or the sample
points by clicking in a graph and observing the effect of mean (for both continuous and discrete populations).
outliers or skew on these two statistics. Sliders for the sample size or population parameters help
• The Scatterplots & Correlation and the Explore Lin- with discovering and exploring the Central Limit Theo-
ear Regression apps allow users to add or delete points rem interactively. In addition to the uniform, skewed,
from a scatterplot and observe how the correlation co- bell-shaped, or bimodal population distributions, several
efficient or the regression line are affected. The Guess real population data sets (such as the distances traveled
the Correlation app lets users guess the correlation for by all taxis rides in NYC on Dec. 31) are preloaded.
Screenshot from the Sampling Distribution for the Sample Mean app
10
Our Approach
In 2005 (and updated in 2016), the American Statistical Association (ASA) en-
dorsed guidelines and recommendations for the introductory statistics course as
described in the report, “Guidelines for Assessment and Instruction in Statistics
Education (GAISE) for the College Introductory Course” (www.amstat.org/
education/gaise). The report states that the overreaching goal of all introductory
statistics courses is to produce statistically educated students, which means that
students should develop statistical literacy and the ability to think statistically.
The report gives six key recommendations for the college introductory course:
1. Teach statistical thinking.
• Teach statistics as an investigative process of problem solving and decision
making.
• Give students experience with multivariable thinking.
2. Focus on conceptual understanding.
3. Integrate real data with a context and purpose.
4. Foster active learning.
5. Use technology to explore concepts and analyze data.
6. Use assessments to improve and evaluate student learning.
We wholeheartedly support these recommendations and our textbook takes
every opportunity to implement these guidelines.
Student Support
To draw students to important material we highlight key definitions, and provide
summaries in blue boxes throughout the book. In addition, we have four types of
margin notes:
In Words • In Words: This feature explains, in plain language, the definitions and symbolic
To find the 95% confidence notation found in the body of the text (which, for technical accuracy, must be
interval, you take the sample more formal).
proportion and add and subtract • Caution: These margin boxes alert students to areas to which they need to pay
1.96 standard errors. special attention, particularly where they are prone to make mistakes or incor-
rect assumptions.
• Recall: As the student progresses through the book, concepts are presented that
Recall depend on information learned in previous chapters. The Recall margin boxes
From Section 8.2, for using a confidence direct the reader back to a previous presentation in the text to review and rein-
interval to estimate a proportion p, the
force concepts and methods already covered.
standard error of a sample proportion
pn is • Did You Know: These margin boxes provide information that helps with the
pn 11 - pn 2 contextual understanding of the statistical question under consideration or pro-
. b vides additional information.
B n
Sampling distribution of – x
Graphical Approach
(describes where the sample
mean x– is likely to fall when Because many students are visual learners, we have taken extra care to make our
sampling from this population)
figures informative. We’ve annotated many of the figures with labels that clearly
identify the noteworthy aspects of the illustration. Further, many figure captions
include a question (answered in the Chapter Review) designed to challenge the
student to interpret and think about the information being communicated by the
Population distribution
(describes where
graphic. The graphics also feature a pedagogical use of color to help students
observations are likely recognize patterns and distinguish between statistics and parameters. The use of
to fall)
color is explained in the very front of the book for easy reference.
Statistics: In Practice
We realize that there is a difference between proper academic statistics and what
is actually done in practice. Data analysis in practice is an art as well as a science.
Although statistical theory has foundations based on precise assumptions and
conditions, in practice the real world is not so simple. In Practice boxes and text
references alert students to the way statisticians actually analyze data in practice.
These comments are based on our extensive consulting experience and research
and by observing what well-trained statisticians do in practice.
Survey are indicated by a GSS icon. Larger data sets used in examples and exer-
cises are referenced in the text, listed in the back endpapers, and made available
for download on the book’s website ArtofStat.com.
Technology Integration
Web Apps
Screenshots of web apps are shown throughout the text. An overview of the apps
is available on pages vii–ix. At the end of each chapter, in the Statistical Software
section, the functionality of each app used in the chapter is described in detail.
A list of apps by chapter appears on the second to last last page of this book.
R
We now provide R code in the end of chapter section on statistical software. Every
R function used is explained and illustrated with output. In addition, annotated R
scripts can be found on ArtofStat.com under “R Code.” While only base R functions
are used in the book, the scripts available online use functions from the tidyverse.
Data Sets
We use a wealth of real data sets throughout the textbook. These data sets are
available for download at ArtofStat.com. The same data set is often used in several
chapters, helping reinforce the four components of the statistical investigative
process and allowing the students to see the big picture of statistical reasoning.
Many data sets are also preloaded and accessible from within the apps.
StatCrunch®
StatCrunch® is powerful web-based statistical software that allows users to per-
form complex analyses, share data sets, and generate compelling reports of their
data. The vibrant online community offers more than 40,000 data sets for instruc-
tors to use and students to analyze.
• Collect. Users can upload their own data to StatCrunch or search a large library
of publicly shared data sets, spanning almost any topic of interest. Also, an on-
line survey tool allows users to collect data quickly through web-based surveys.
• Crunch. A full range of numerical and graphical methods allows users to analyze
and gain insights from any data set. Interactive graphics help users understand
statistical concepts and are available for export to enrich reports with visual rep-
resentations of data.
• Communicate. Reporting options help users create a wide variety of visually
appealing representations of their data.
Full access to StatCrunch is available with MyLab Statistics, and StatCrunch
is available by itself to qualified adopters. For more information, visit www
.statcrunch.com or contact your Pearson representative.
pearson.com/mylab/statistics
Success
Instructor Resources website in an online course. Fully accessible ver-
sions are also available.
Additional resources can be downloaded
from www.pearson.com. TestGen® (www.pearson.com/testgen) enables
instructors to build, edit, print, and administer
Online Resources Students and instructors have
tests using a computerized bank of questions
a full library of resources, including apps devel-
developed to cover all the objectives of the
oped for in-text activities, data sets and instructor-
text. TestGen is algorithmically based, a llowing
to-instructor videos. Available through MyLab
instructors to create multiple but equivalent
Statistics and at the Pearson Global Editions site,
versions of the same question or test with the
https://media.pearsoncmg.com/intl/ge/abp/
click of a b
utton. Instructors can also modify test
resources/index.html.
bank questions or add new questions.
Updated! Instructor to Instructor Videos provide
The Online Test Bank is a test bank derived
an opportunity for adjuncts, part-timers, TAs,
from TestGen®. It includes multiple choice and
or other instructors who are new to teaching
short answer questions for each section of the
from this text or have limited class prep time to
text, along with the answer keys.
learn about the book’s approach and coverage
from the authors. The videos focus on those
topics that have proven to be most challenging Student Resources
to students and offer suggestions, pointers, and Additional resources to help student success.
ideas about how to present these topics and Online Resources Students and instructors
concepts effectively based on many years of have a full library of resources, including apps
teaching introductory statistics. The videos also developed for in-text activities and data sets
provide insights on how to help students use (Available through MyLab Statistics and the
the textbook in the most effective way to realize Pearson Global Editions site, https://media.
success in the course. The videos are available pearsoncmg.com/intl/ge/abp/resources/index.
for download from Pearson’s online catalog at html).
www.pearson.com and through MyLab Statistics. Study Cards for Statistics Software This series
Instructor’s Solutions Manual, by James Lapp, of study cards, available for Excel®, MINITAB®,
contains fully worked solutions to every text- JMP®, SPSS®, R®, StatCrunch®, and the TI fam-
book exercise. ily of graphing calculators provides students
PowerPoint Lecture Slides are fully editable and with easy, step-by-step guides to the most
printable slides that follow the textbook. These common statistics software. These study cards
slides can be used during lectures or posted to a are available through MyLab Statistics.
Acknowledgments
We appreciate all of the thoughtful contributions and additions to the fourth edi-
tion by Michael Posner Villanova University. We are indebted to the following
individuals, who provided valuable feedback for the fifth edition:
John Bennett, Halifax Community College, NC; Yishi Wang, University of North
Carolina Wilmington, NC; Scott Crawford, University of Wyoming, WY; Amanda
Ellis, University of Kentucky, KY; Lisa Kay, Eastern Kentucky University, KY;
Inyang Inyang, Northern Virginia Community College, VA; Jason Molitierno,
Sacred Heart University, CT; Daniel Turek, Williams College, MA.
We are also indebted to the many reviewers, class testers, and students who
gave us invaluable feedback and advice on how to improve the quality of the
book.
The detailed assessment of the text fell to our accuracy checkers, Nathan
Kidwell and Joan Sanuik.
Thank you to James Lapp, who took on the task of revising the solutions man-
uals to reflect the many changes to the fifth edition.
We would like to thank the Pearson team who has given countless hours in
developing this text; without their guidance and assistance, the text would not
have come to completion. We thank Amanda Brands, Suzanna Bainbridge, Rachel
Reeve, Karen Montgomery, Bob Carroll, Jean Choe, Alicia Wilson, Demetrius
Hall, and Joe Vetere. We also thank Kim Fletcher, Senior Project Manager at
Integra-Chicago, for keeping this book on track throughout production.
Alan Agresti would like to thank those who have helped us in some way, often
by suggesting data sets or examples. These include Anna Gottard, Wolfgang Jank,
René Lee-Pack, Jacalyn Levine, Megan Lewis, Megan Mocko, Dan Nettleton,
Yongyi Min, and Euijung Ryu. Many thanks also to Tom Piazza for his help with
the General Social Survey. Finally, Alan Agresti would like to thank his wife,
Jacki Levine, for her extraordinary support throughout the writing of this book.
Besides putting up with the evenings and weekends he was working on this book,
she offered numerous helpful suggestions for examples and for improving the
writing.
Chris Franklin gives a special thank-you to her husband and sons, Dale, Corey,
and Cody Green. They have patiently sacrificed spending many hours with their
spouse and mom as she has worked on this book through five editions. A special
thank-you also to her parents, Grady and Helen Franklin, and her two brothers,
Grady and Mark, who have always been there for their daughter and sister. Chris
also appreciates the encouragement and support of her colleagues and her many
students who used the book, offering practical suggestions for improvement.
Finally, Chris thanks her coauthors, Alan and Bernhard, for the amazing journey
of writing a textbook together.
Bernhard Klingenberg wants to thank his statistics teachers in Graz, Austria;
Sheffield, UK; and Gainesville, Florida, who showed him all the fascinating as-
pects of statistics throughout his education. Thanks also to the Department of
Mathematics & Statistics at Williams College and the Graduate Program in
Data Science at New College of Florida for being such great places to work.
Finally, thanks to Chris Franklin and Alan Agresti for a wonderful and inspiring
collaboration.
Alan Agresti, Gainesville, Florida
Christine Franklin, Athens, Georgia
Bernhard Klingenberg, Williamstown, Massachusetts
Contributors
■ DELHI Vikas Arora, professional statistician; Abhishek K. Umrawal, Delhi
Reviewers
■ DELHI Amit Kumar Misra, Babasaheb Bhimrao Ambedkar University ■
TURKEY Ümit Işlak, Boǧaziçi Üniversitesi; Özlem Ⅰlk Daǧ, Middle East
Technical University
22
Gathering and
Exploring Data
1
Chapter 1
Statistics: The Art and Science
of Learning From Data
Chapter 2
Exploring Data With Graphs
and Numerical Summaries
Chapter 3
Exploring Relationships
Between Two Variables
Chapter 4
Gathering Data
23
1
1.1 Using Data to
Answer Statistical
1.2
Questions
Sample Versus
Statistics: The Art
1.3
Population
Organizing Data,
and Science of Learning
From Data
Statistical Software,
and the New Field
of Data Science
Example 1
Defining Statistics
You already have a sense of what the word statistics means. You hear statistics
quoted about sports events (number of points scored by each player on a basket-
ball team), statistics about the economy (median income, unemployment rate),
and statistics about opinions, beliefs, and behaviors (percentage of students who
indulge in binge drinking). In this sense, a statistic is merely a number calculated
from data. But statistics as a field is a way of thinking about data and quantifying
uncertainty, not a maze of numbers and messy formulas.
Statistics
Statistics is the art and science of collecting, presenting, and analyzing data to answer an
investigative question. Its ultimate goal is translating data into knowledge and an understand-
ing of the world around us. In short, statistics is the art and science of learning from data.
c c Activity 1
Using Data Available Online
to Answer Statistical Questions
Did you ever wonder about how much time people spend
watching TV? We could use the survey question “On the
average day, about how many hours do you personally watch
television?” from the General Social Survey (GSS) to investigate
this statistical question. Data from the GSS are accessible online
via http://sda.berkeley.edu. Click on Archive (the second box in
the top row) and then scroll down and click on the link labeled
“General Social Survey (GSS) Cumulative Datafile 1972–2018.”
You can also access this site directly and more easily via the link
provided on the book’s website www.ArtofStat.com. (You may
want to bookmark this link, as many activities in this book use
data from the GSS.)
As shown in the screenshot,
jj Enter TVHOURS in the field labeled Row. TVHOURS is
the name the GSS uses for identifying the responses to the Press the Run the Table button on the bottom, and a new
TV question. browser window will open that shows the frequencies (e.g., 145 and
jj Enter YEAR(2018) in the Selection Filter field. This will select 349) and, in bold, the percentages (e.g., 9.3 and 22.4) of people who
only the data for the 2018 GSS, not data from earlier (or later) made each of the possible responses. (See the screenshot on the
iterations of the GSS, when this question was also included. next page, which displays the top and bottom of the table.)
jj Change the Weight to No Weight. This will specify that we A statistical analysis question we can ask is, “What is the
see the actual frequency for each of the possible responses most common response?” From the bottom of the table we see
to the survey question and not some numbers adjusting for that a total of 1,555 persons answered the question, of which
the way the survey was taken. 376 or, as shown, 24.2% (= 100 * (376/1,555)) watched two
given in each category and the total shown in the last b. To get data for this question for each year separately,
row of the frequency table, or just read them off from type YEAR into the column variable field. Find the
the percentages given in bold.) percentage of respondents who said yes, definitely to the
c. The questions in parts a and b refer to the cumulative hell question in 2018. How does this compare to the pro-
data over all the years the GSS included this question portion of respondents in 2018 who believe in heaven?
(which was in 1991, 1998, 2008, and 2018). To get the 1.5 GSS and work hours Go to the General Social Survey
percentages for each year separately, enter HEAVEN TRY website, http://sda.berkeley.edu, press Archive, scroll
in the row variable and YEAR in the column variable. down a bit and then select the “General Social Survey
Press Run the Table (keeping Weight as No Weight). GSS (GSS) Cumulative Datafile 1972–2018”, or whichever ap-
What is the proportion of respondents that answered pears as the top (most recent) one. Alternatively, go there
yes, definitely in 2018? How does it compare to the directly via the link provided on the book’s website,
proportion 10 years earlier? www.ArtofStat.com (see Activity 1). Enter the variable
1.4 GSS and heaven and hell Refer to the previous exercise, USUALHRS as the row variable, type YEAR(2016) in
but now type HELL into the row variable box to obtain the selection filter, and choose No Weight for the Weight
GSS field. USUALHRS refers to the question, “How many
data on the question, “Do you believe in hell?” Keep all
other fields blank and select No Weight. hours per week do you usually work?”
a. What percent of respondents said yes, definitely; yes, a. Overall, how many people responded to this question?
probably; no, probably not; and no, definitely not for b. What was the most common response, and what per-
the belief in hell question? centage of people selected it?
In the 2018 GSS, the sample consisted of the 2,348 people who participated in
this survey. The population was the set of all adult Americans at that time—about
Sample 240 million people.
Example 2
Sample and Population b An Exit Poll
Picture the Scenario
Scenario 1 in the previous section discussed an exit poll. The purpose was to
predict the outcome of the 2010 gubernatorial election in California. The exit
poll sampled 3,889 of the 9.5 million people who voted.
Question to Explore
For this exit poll, what was the population and what was the sample?
Think It Through
The population was the total set of subjects of interest, namely, the 9.5 mil-
Did You Know? lion people who voted in this election. The sample was the 3,889 voters who
Examples in this book use the five parts were interviewed in the exit poll. These are the people from whom the poll
shown in this example: Picture the obtained data about their votes.
Scenario introduces the context. Question
to Explore states the question addressed. Insight
Think It Through shows the reasoning The ultimate goal of most studies is to learn about the population. For exam-
used to answer that question. Insight gives ple, the sponsors of this exit poll wanted to make an inference (prediction)
follow-up comments related to the example. about all voters, not just the 3,889 voters sampled by the poll.
Try Exercises direct you to a
TRY
similar “Practicing the Basics” cc Try Exercises 1.9 and 1.10
exercise at the end of the section. Also,
each example title is preceded by a label
highlighting the example’s concept. In this
Occasionally data are available from an entire population. For instance, every
example, the concept label is “Sample and
population.” b 10 years the U.S. Census Bureau gathers data from the entire U.S. population
(or nearly all). But the census is an exception. Usually, it is too costly and time-
consuming to obtain data from an entire population. It is more practical to get
In Practice data for a sample. The GSS and polling organizations such as the Gallup poll or
Sometimes the population is well- the Pew Research Center usually select samples of about 1,000 to 2,500 Americans
defined, but sometimes it is a to learn about opinions and beliefs of the population of all Americans. The same is
hypothetical set of subjects that is true for surveys in other parts of the world, such as the Eurobarometer in Europe.
not enumerated. For example, in a
Similarly, a manufacturing company may not be able to inspect the entire pro-
clinical trial to examine the impact of
a new drug on lowering cholesterol, duction of the day (the population), which could consist of thousands of items.
the population would be all people It has to rely on taking a small sample of the entire production to learn about
with high cholesterol who might have potential quality issues with the production process.
signed up for such a trial. This set of
people is not specifically listed. And
sometimes, the population is not
Descriptive Statistics and Inferential Statistics
even a population in the conventional Using the distinction between samples and populations, we can now tell you
sense, but refers to a theoretical
more about the use of description and inference in statistical analyses.
statistical model for our data.
Never
Rarely
Sometimes
Often
0 5 10 15 20 25 30 35 40
Percent (%)
mmFigure 1.1 Result of a survey of 743 teenagers (ages 13–17) about their handling
of screen time and distraction. (Source: Pew Research Center, http://www.pewinternet
.org/2018/08/22/how-teens-and-parents-navigate-screen-time-and-device-distractions/
© 2020 Pew Research Center)
by the 743 teenagers. From this graph, it’s readily apparent that about 40% never
lose focus and only 8% lose focus often.
Descriptive statistics are also useful when data are available for the entire pop-
ulation, such as in a census. By contrast, inferential statistics are used when data
are available for a sample only, but we want to make a decision or prediction
about the entire population.
In most surveys, we have data for a sample, not for the entire population. We
use descriptive statistics to summarize the sample data and inferential statistics to
make predictions about the population.
Example 3
Descriptive and
Inferential Statistics
b Banning Plastic Bags
Picture the Scenario
Suppose we’d like to investigate what people think about banning single-use
plastic bags from grocery stores. California was the first state to issue a law ban-
ning plastic bags, and similar measures were under consideration for the state of
New York at the beginning of 2019. Let’s consider as our population all eligible
voters in the state of New York, a population of more than 14 million people.
Because it is impossible to discuss the issue with all these people, we can
study results from a January 2019 poll of 929 New York State voters con-
ducted by Quinnipiac University.1 In that poll, 48% of the sampled subjects
said they would support a state law that bans stores from providing plastic
bags. The reported margin of error for how close this number falls to the pop-
ulation percentage is 4.1%. We’ll see (later in the book) that this means we
can predict with high confidence (about 95% certainty) that the percentage
of all New York voters supporting the ban of plastic bags falls within 4.1% of
the survey’s value of 48%, that is, between 43.9% and 52.1%.
1
Quinnipiac University. (2019). New York State voters split on plastic bag ban, Quinnipiac
University poll finds [press release]. Retrieved from https://poll.qu.edu/new-york-state/
release-detail?ReleaseID=2595
Question to Explore
In this analysis, what is the descriptive statistical analysis and what is the
inferential statistical analysis?
Think It Through
The results for the sample of 929 New York voters are summarized by the
percentage, 48%, who support banning plastic bags. This is a descriptive sta-
tistical analysis. We’re interested, however, not just in those 929 people but
in the population of the entire state of New York. The prediction that the
percentage of all New York voters who support banning plastic bags falls be-
tween 43.9% and 52.1% is an inferential statistical analysis. In summary, we
describe the sample and we make inferences about the population.
Insight
The press release about this study mentioned that New York State voters are
split on a plastic bag ban. This is because the interval from 43.9% to 52.1%
includes values on both sides of 50%, showing that supporters may be in the
minority or in the majority. If such a percentage and margin of error were ob-
served in an election poll, New York State would be called “too close to call.”
In April 2019, New York State lawmakers agreed to impose a statewide
ban, which went into effect in March 2020.
For example, the percentage of the population of all voters in New York State sup-
porting a ban on plastic bags is a parameter. We hope to learn about parameters so
that we can better understand the population, but the true parameter values are al-
most always unknown. Thus, we use sample statistics to estimate the parameter values.
Example 4
Sample Statistic and
b Gallup Poll
Population Parameter
Picture the Scenario
On April 20, 2010, one of the worst environmental disasters took place in
the Gulf of Mexico: the Deepwater Horizon oil spill. In response to the
spill, many activists called for an end to deepwater drilling off the U.S. coast.
Meanwhile, approximately nine months after the Gulf of Mexico disaster,
political turbulence gripped the Middle East, causing the price of gasoline in
the United States to approach an all-time high.
In March 2011, Gallup’s annual environmental survey showed that 60%
of Americans favored offshore drilling as a means to reduce U.S. dependence
on foreign oil.2 The poll was based on interviews conducted with a random
sample of 1,021 adults, aged 18 and older, living in the continental United
States, selected using random digit dialing.
Question to Explore
In this study, identify the sample statistic and the population parameter of
interest.
Think It Through
The 60% refers to the percentage of the 1,021 adults in the survey who
favored offshore drilling. It is the sample statistic. The population parame-
ter of interest is the percentage of the entire U.S. adult population favoring
offshore drilling in March 2011.
Insight
This and previous examples have focused on percentages as examples of sam-
ple statistics and population parameters because they are so ubiquitous. But
there are many other statistics we compute from samples to estimate a cor-
responding parameter in the population. For instance, the U.S. government
uses data from surveys to estimate population parameters such as the median
household income or the average student debt. The corresponding sample
statistics used to predict these parameters are the median income or the aver-
age student debt computed from all the people included in the survey.
cc Try Exercise 1.7
2
Saad, Lydia., U.S. Satisfaction Still Running at Improved Level. Gallup, Inc, 2018 © 2020
Gallup, Inc.
Randomness
Random is often thought to mean chaotic or haphazard, but randomness is an
extremely powerful tool for obtaining good samples and conducting experiments.
A sample tends to be a good reflection of a population when each subject in the
population has the same chance of being included in that sample. That’s the basis
of random sampling, which is designed to make the sample representative of the
population. A simple example of random sampling is when a teacher puts each
student’s name on slips of paper, places them in a hat, shuffles them, and then
draws names from the hat without looking.
jj Random sampling allows us to make powerful inferences about populations.
jj Randomness is also crucial to performing experiments well.
If, as in Scenario 2 on page 26, we want to compare aspirin to a placebo in terms
of the percentage of people who later have a heart attack, it’s best to randomly
select those in the sample who use aspirin and those who use the placebo. This
approach tends to keep the groups balanced on other factors that could affect the
results. For example, suppose we allowed people to choose whether to use aspi-
rin (instead of randomizing whether the person receives aspirin or the placebo).
Then, the people who decided to use aspirin might have tended to be healthier
than those who didn’t, which could produce misleading results.
Variability
An essential goal of Statistics is to study and describe variability: How observa-
tions vary from person to person within a sample, and how statistics computed
from samples vary from one sample to the next, i.e., between samples.
Within Sample Variability: People are different from each other, so, not sur-
prisingly, the measurements we make on them vary from person to person. For
the GSS question about TV watching in Activity 1 on page 28, different people
included in the sample reported different amounts of TV watching. In the exit
poll sample of Example 2, not all people voted the same way. If subjects did not
vary, we’d need to sample only one of them. We learn more about this variability
by sampling more people. If we want to predict the outcome of an election, we’re
better off sampling 100 voters than one voter, and our prediction will be even
more reliable if we sample 1,000 voters.
Between Sample Variability: Just as observations on people vary, so do results com-
puted from samples. Suppose you take a random sample of 1,000 voters to predict
the outcome of an election the next week. Suppose the Gallup organization also
takes a random sample of 1,000 voters. Your sample will have different people
than Gallup’s. Consequently, the predictions will differ between these two sam-
ples. Perhaps your poll of 1,000 voters has 480 voting for the Republican can-
didate, so you predict that 48% of all voters are going to vote for that person.
Perhaps Gallup’s poll of 1,000 voters has 460 voting for the Republican candidate,
so they predict that 46% of all voters are going to vote for that person.
A very interesting and useful, yet surprising, fact with random sampling is that
the amount of variability in the results from one sample to another is actually quite
predictable. (Activity 2 at the end of this section will show this.) For instance, the
two predictions from your and the Gallup poll are likely to fall within just three
percentage points of the actual population percentage voting Republican. This
gives us some idea about the variability (and, hence, precision) of our predictions.
If, however, the polls cannot be regarded a random sample from the popula-
tion, perhaps because Republicans are more likely than Democrats to refuse to
participate, then we would need to account for this, resulting in a loss of precision
in our predictions. Issues such as the difficulty of obtaining a truly random sample
or people not responding can dramatically limit the accuracy of polls in predict-
ing the actual population percentage. We will discuss such issues in Chapter 4.
Margin of Error
With statistical inference, we try to estimate the value of an unknown popula-
tion parameter, using information the data provide. Such an estimate should al-
ways be accompanied by a statement about its precision so that we can assess
how useful it is in predicting the unknown parameter. For instance, the Gallup
poll from Example 4 reported that 60% of Americans favored offshore drill-
ing. How close is such a sample estimate to the true, unknown population per-
In Practice centage? To address this, many studies publish the margin of error associated
In February 2019, Gallup’s survey on with their estimates, in statements such as “The margin of error in this study
the job the U.S. president is doing is plus or minus 3 percentage points.” The margin of error is a measure of the
found a 44% approval rating among expected variability in the results from one random sample to the next ran-
U.S. adults. As Gallup writes: “For dom sample. As we have discussed (and will illustrate in Activity 2), there is
results based on the total sample variability between samples. A second sample of the same size about offshore
of national adults, the margin of
drilling could yield a proportion of 58%, whereas a third might yield 61%. A
sampling error is ;3 percentage
points at the 95% confidence level.”
margin of error of plus or minus 3 percentage points summarizes this variability
numerically and indicates that it is very likely that the population percentage
Margin of Error
The margin of error measures how close we expect an estimate to fall to the true
population parameter.
mmFigure 1.2 Simulate Taking Polls to See How Sample Proportions Vary. Screenshot from the Sampling Distribution for the
Sample Proportion app. Using the top slider, we set the population proportion p equal to 0.5. The top graph then shows that a proportion
of 0.5 (or 50%) of the students in the entire student population prefer Diana as the next student body president and the other half prefer
William. Pressing the blue Draw Sample(s) button once simulates taking one sample of size 10, set via the slider labeled “Sample Size
(n)”, from this population. The graph in the middle shows the outcome of this simulation (here: 3 students preferred Diana, 7 preferred
William) and marks the sample proportion voting for Diana (0.3) with a blue triangle labeled pn . Repeatedly clicking Draw Sample(s) will
generate a new sample of 10 students with each click, illustrating how the sample proportion varies from one sample to another. (Watch
the blue triangle in the middle plot jump around.) The bottom plot keeps track of the frequencies of all the sample proportions generated
throughout the simulations. Play around with this app, selecting different values for p and the sample size, and observe how much the
sample proportion varies. Other features of this app will be explained in Chapter 7.
Try Exercises 1.16, part c and d, 1.37 and 1.41 b
Statistical Significance
In many applications, we want to compare two population parameters, such as
the proportion of elderly patients suffering a heart attack between those taking
a daily dose of aspirin and those just taking a placebo. Or, we may want to com-
pare voters who identify with one of two political parties, for instance Democrats
and Republicans, in terms of their amount of charitable contributions. In both
cases, we can investigate if the population parameter in one group differs from
the population parameter in the other group, and by how much. For instance, we
can ask the statistical questions, “Is the average amount given by Republicans
different from the average amount given by Democrats? If yes, by how much?”
If the information from the two samples provides strong evidence that there
is a difference, we say that the two groups differ significantly with respect to the
two population parameters. For statisticians, the word significantly implies that
the observed difference computed between the two samples is larger than what
would be expected from ordinary random sample-to-sample variability. Sample-
to-sample variability refers to variability that occurs naturally, just by chance.
For instance, based on data from the GSS in 2014, we can conclude that the
average charitable contributions differed significantly between Republicans and
Democrats. The sample averages ($3,900 for Republicans, $1,300 for Democrats)
are so different that it is unlikely that such a difference occurred just by random
sample-to-sample variability alone. In other words, we would not expect to see such
a large difference of $3,900 - $1,300 = $2,600 in the average charitable contribu-
tions if voters identifying with these two parties gave the same amounts, on average.
On the other hand, the difference in the proportion of patients suffering a
heart attack between those taking aspirin and those taking a placebo was found
to be not significantly different for elderly patients. This conclusion was reached
based on data from a five-year study in elderly (≥ 70 years) subjects.3 The sample
proportion of 5.0% that suffered a heart attack in the aspirin group is not that
different from the 5.4% that suffered a heart attack in the placebo group. The
difference of 0.4 percentage points observed between the two samples was not
large enough to conclude that the two population proportions differ significantly.
In other words, the observed difference of 0.4 can occur by random variability
alone, and even when the proportion of heart attacks are the same in both groups.
Two-group comparisons and statistical methods for analyzing data such as the
ones mentioned here are explored in Chapter 10. There, we will discuss statistical
significance, how it may differ from practical relevance, and explain that it is more
useful to estimate the size of the difference together with a margin of error.
3
McNeil, J. J. et al. (2018, October 18). Effect of aspirin on cardiovascular events and bleeding in the
healthy elderly, New England Journal of Medicine, 379, 1509–1518. Accessible via https://www.nejm.
org/doi/full/10.1056/NEJMoa1805819?query=recirc_curatedRelated_article
simple graphs (such as the bar graph in Figure 1.1) and numerical summaries
(statistics such as means or proportions).
Chapter 3: Relationships Between Two Variables How does annual income
10 years after graduation correlate with college tuition? You can find out by
studying the relationship between those characteristics.
Chapter 4: Gathering Data How can you design an experiment or conduct a
survey to get data that will help you answer questions? You’ll see why results
may be misleading if you don’t use randomization.
Chapter 5: Probability in Our Daily Lives How can you determine the
chance of some outcome, such as winning a lottery? Probability, the basic tool
for evaluating chances, is also a key foundation for inference.
Chapter 6: Probability Distributions You’ve probably heard of the nor-
mal distribution, a bell-shaped curve, that describes people’s heights or IQs
or test scores. What is the normal distribution, and how can we use it to find
probabilities?
Chapter 7: Sampling Distributions Why is the normal distribution so
important? You’ll see why, and you’ll learn its key role in statistical inference.
Chapter 8: Statistical Inference: Confidence Intervals How can an exit poll
of 3,889 voters possibly predict the results for millions of voters? You’ll find
out by applying the probability concepts of Chapters 5, 6, and 7 to find the
margin of error and make statistical inferences about population parameters.
Chapter 9: Statistical Inference: Significance Tests About Hypotheses How
can a medical study make a decision about whether a new drug is better than a
placebo? You’ll see how you can control the chance that a statistical inference
makes a correct decision about what works best.
Chapters 10–15: Applying Descriptive and Inferential Statistics and Statistical
Modeling to Many Kinds of Data After Chapters 2 to 9 introduce you to the
key concepts of statistics, the rest of the book shows you how to apply them in a
variety of situations. For instance, Chapter 10 shows how to compare two groups
of students, one using cell phones and another one just listening to the radio in a
simulated driving experiment, to make inference about the reaction time when
seeing a red light. In Chapters 12 and 13 you will learn how to formulate statisti-
cal models and use them to make predictions, such as predicting the selling price
of a house or the probability that a consumer accepts a credit card offer.
1.10 Is globalization good? The Voice of the People poll asks a c. In 2010, of those who were very old, more were female
TRY random sample of citizens from different countries around than male.
the world their opinion about global issues. Included in the d. In 2010, the largest five-year group included people
list of questions is whether they feel that globalization helps born during the era of first manned space flight.
their country. The reported results were combined by re- 1.14 Margin of error in better life index The Current Better Life
gions. In the most recent poll, 74% of Africans felt globaliza- Survey (https://www.oecdbetterlifeindex.org) is used by the
tion helps their country, whereas 38% of North Americans Organization for Economic Co-operation and Development
believed it helps their country. (a) Identify the samples and (OECD) to assign Better Life Indices, calculated through
the populations. (b) Are these percentages sample statistics average literacy score and life expectancy at birth, for the
or population parameters? Explain your answer. French population. How close is a sample estimate from this
1.11 Graduating seniors’ salaries The job placement center survey to the true, unknown population parameter?
TRY at your school surveys all graduating seniors at the school. a. Based on the latest data from this survey, the average
Their report about the survey provides numerical summa- literacy score was estimated as 494, with a margin of error
ries such as the average starting salary and the percentage of plus or minus 18. Is it plausible that the average literacy
of students earning more than $30,000 a year. score of all French adults in the age group 25–64 years is
a. Are these statistical analyses descriptive or inferential? 520? Explain.
Explain. b. The survey estimated the life expectancy at birth as
b. Are these numerical summaries better characterized as 83 years, with a margin of error of plus or minus 4
statistics or as parameters? years. Is it plausible that the actual life expectancy at
1.12 Level of inflation in Egypt An economist wants to estimate birth in all of France is less than 80 years? Explain.
TRY the average rate of consumer price inflation in Egypt in the 1.15 Graduate studies Consider the population of all under-
late 20th century. Within the National Bureau, they find re- graduate students at your university. A certain proportion
cords for the years 1970 to 2000, which they treat as a sample have an interest in graduate studies. Your friend randomly
of all inflation rates from the late 20th century. The average samples 40 undergraduates and uses the sample proportion
rate of inflation in the records is 11.6. Using the appropriate of those who have an interest in graduate studies to predict
statistical method, they estimate the average rate of inflation the population proportion at the university. You conduct
in late 20th century Egypt was between 11.25 and 11.95. an independent study of a random sample of 40 undergrad-
a. Which part of this example gives a descriptive sum- uates, and find the sample proportion of those are inter-
mary of the data? ested in graduate studies. For these two statistical studies,
b. Which part of this example draws an inference about a a. Are the populations the same?
population? b. How likely is it that the samples are the same? Explain.
c. To which population does the inference in part b refer? c. How likely is it that the sample proportions are equal?
d. The average rate of inflation was 11.6. Is 11.6 a statistic Explain.
or a parameter? 1.16 Samples vary less with more data We’ll see that the
1.13 Age pyramids as descriptive statistics The figure shown TRY amount by which statistics vary from sample to sample
is a graph published by Statistics Sweden. It compares always depends on the sample size. This important fact
Swedish society in 1750 and in 2010 on the numbers of can be illustrated by thinking about what would happen in
men and women of various ages, using age pyramids. repeated flips of a fair coin.
Explain how this indicates that a. Which case would you find more surprising—flipping
a. In 1750, few Swedish people were old. the coin five times and observing all heads or flipping
b. In 2010, Sweden had many more people than in 1750. the coin 500 times and observing all heads?
b. Imagine flipping the coin 500 times, recording the
proportion of heads observed, and repeating this ex-
Age pyramids, 1750 and 2010 periment many times to get an idea of how much the
proportion tends to vary from one sequence to an-
other. Different sequences of 500 flips tend to result in
1750 Thousands Sweden: 2010
proportions of heads observed which are less variable
80+
Men 90 Women 75–79 than the proportion of heads observed in sequences
Men 70–74 Women
80
65–69 of only five flips each. Using part a, explain why you
60–64
70
55–59 would expect this to be true.
60 50–54
45–49 c. We can simulate flipping the coin five times with the
50 40–44
35–39 Sampling Distribution for the Sample Proportion web
40
30
30–34
25–29 app mentioned in Activity 2. Access the app from the
APP book’s website, www.ArtofStat.com, and set the pop-
20–24
20 15–19
10
10–14
5–9
ulation proportion to 0.5 (so that we are simulating
0
0–4 flipping a fair coin) and the sample size n to 5 (so that
150 100 50 0 0 50 100 150 350 300 250 200 150 100 50 0 0 50 100 150 200 250 300 350
Population (in thousands) we are simulating flipping a coin five times). Then
press Draw Sample(s) once. What proportion of heads
Graphs of number of men and women of various ages, in 1750 and (= successes) did you observe in these five flips? Press
in 2010. (Source: Statistics Sweden.) the Generate Sample(s) button 24 more times. What is
the smallest and what is the largest sample proportion (b) n = 1600, and (c) n = 2500. Explain how the margin of
you observed? Which sample proportion occurred the error changes as n increases.
most often in these 25 simulations? 1.19 Statistical significance A study performed by Florida
d. Press Reset and set the sample size n to 500 instead of International University investigated how musical in-
5, to simulate flipping the coin 500 times. Then press struction in an after-school program affected overall
Draw Sample(s) once. What proportion of heads competence, confidence, caring, character, and connection
(= successes) did you observe in these five flips? Press (known as the “Five Cs”). The study compared a group
the Generate Sample(s) button 24 more times. What is of students who received music instruction to another
the smallest and what is the largest sample proportion group who didn’t. One of the findings was that those re-
you observed? (Hint: Check Show Summary Statistics ceiving music instruction showed, on average, a significant
to see the smallest and largest value generated.) increase in their character. Character was measured by
1.17 Comparing polls The following table shows the result scoring answers to questions such as “I finish whatever I
of the 2018 General Elections in Pakistan, along with begin,” with scores ranging from 1 (“not at all like me”) to
the vote share predicted by several organizations in the 5 (“very much like me”). For each part, select (i) or (ii).
days before the elections. The sample sizes were typically a. For the two groups to be called significantly different,
above 2,000 people, and the reported margin of error was the difference in the average character scores com-
around plus or minus 2 percentage points. (The percent- puted from the participants in the study had to be
ages for each poll do not sum up to 100 because of the (i) much smaller or (ii) much larger than one would
vote share of other regional parties in a democratic setup.) expect under regular sample-to-sample variability.
b. What does a significant increase imply in this context?
Predicted Vote
That the difference in the average score to questions such
Poll PML-N PTI as “I finish whatever I begin” between the group receiving
IPOR 32 29 music instruction and the group that didn’t was (i) much
SDPI 25 29 larger than would be expected if music instruction didn’t
Pulse Consultant 27 30 have an effect or (ii) nearly identical in the two groups.
Gallup Pakistan (Self) 38 25 c. Suppose the two groups did not differ significantly
Gallup Pakistan (WSJ) 36 24 in terms of the average confidence score. This indi-
Gallup Pakistan (Geo/Jang) 34 26 cates that the difference in the average confidence
score of participants (i) can be explained by regu-
Actual Vote 24.4 31.8
lar sample-to-sample variability or (ii) cannot be
explained by regular sample-to-sample variability.
a. The SDPI Poll Predicted that 29% were going to vote 1.20 Education performance A study published in The Sydney
for PTI, with a margin of error of 2 percentage points. Morning Herald in 2019 investigated the effect of financial
Explain what that means. incentives on performance scores of students. As part of
b. Do most of the predictions fall within 2% (the margin the study, 50 schools having students of fifth grade were
of error) of the actual vote percentages? Considering randomly assigned to one of the two treatment groups. One
the relative sizes of the sample, the population, and the group of students (25 schools) were given $2 per home-
other parties factor, would you say that these polls have work problem that they mastered; the other (25 schools)
good accuracy? was given a moral lecture on the importance of mastering
c. Access the Sampling Distribution for the Sample Proportion maths. The outcome of interest of the study was student
APP web app at www.ArtofStat.com (see Activity 2), performance two years after the incentives were stopped.
and set the population proportion to 32% (the actual After discontinuation of incentives, 17% of students in
vote share of PTI rounded to two significant digits) and the financial incentive group reported scores at or above
the sample size to 3,000 (a typical sample size for the the proficiency level for their grade in maths, compared to
polls shown here). Press the Generate Sample(s) but- 13.5% of the group completing their homework. Assume
ton 10 times to stimulate taking a poll of 3,000 voters that the observed difference in performance improvement
10 times. The bottom graph now marks 10 potential rates between the groups (17%-13.5% = 3.5%) is statisti-
results obtained from exit polls of 3,000 voters under cally significant. What does it mean to be statistically signif-
this scenario. You can see the smallest and the largest icant? (Choose the best option from a to d).
value of these 10 proportions by checking the “Summary a. 14.5% was calculated using statistical techniques.
Statistics for Sample Distribution” box under Options. b. We know that if financial incentive were given to all
What are the smallest and the largest proportions you students, 14.5% would perform better.
obtained in your simulation of 10 exit points?
c. Financial incentive was offered to 14.5% more students
1.18 Margin of error and n A poll was conducted by Ipsos in the study than the non school going students who
APP Public Affairs for Global News. It was conducted between were working somewhere.
May 18 and May 20, 2016, with a sample of 1005 Canadians.
d. The observed difference of 14.5% is so large that it is
It was found 73% of the respondents “agree” that the
unlikely to have occurred by chance only.
Liberals should not make any changes to the country’s
voting system without a national referendum first (http:// (Source: https://www.smh.com.au/education/staggering-
globalnews.ca/news/). Find the approximate margin of error cash-incentives-improve-kids-learning-research-shows-
if the poll had been based on a sample of size (a) n = 900, 20190130-p50ukt.html)
Data Files
Today’s researchers (and students) are also lucky when it comes to data. Without
data, of course, we couldn’t do any statistical analysis. Data is now electroni-
cally available almost instantly from many different disciplines and a variety of
sources. We will show some examples at the end of this section.
When statisticians speak of data, they typically refer to the type of data that
is organized in a spreadsheet. That’s why most of the statistical software pack-
ages mentioned above start with a blank spreadsheet to input or import data.
Figure 1.3 shows an example of such a spreadsheet created with Excel, where
various information on student characteristics have been entered:
jj Gender 1f = female, m = male, nb = nonbinary)
jj Racial–ethnic group 1b = Black, h = Hispanic, w = White, o = other)
jj Age (in years)
jj Average number of hours per week watching or streaming TV
jj Whether a vegetarian (yes, no)
jj Political party 1Dem = Democrat, Rep = Republican, Ind = independent2
jj Opinion on Affirmative Action (y = in favor, n = oppose)
Such data files created with a spreadsheet program can be saved in the .csv
(which stands for “comma separated values”) format, which every statistical
A row
refers to
data for a
particular
student
mmFigure 1.3 A spreadsheet to organize data. Saved as a .csv file, it can be imported
into many statistical software packages.
software program can import. Regardless of which program you use to create the
data file, there are certain conventions you should follow:
jj Each row contains measurements for a particular subject (for instance, person).
jj Each column contains measurements for a particular characteristic.
Some characteristics have numerical data, such as the values for hours of TV
watching. Some characteristics have data that consist of categories or labels, such as the
categories (yes, no) for vegetarians or the labels (Dem, Rep, Ind) for political affiliation.
Figure 1.3 resembles a larger data file from a questionnaire administered to a
sample of students at the University of Florida. That data file, which includes other
characteristics as well, is called “FL student survey” and is available for download
from the book’s website, www.ArtofStat.com, together with many other data files.
To construct a similar data file for your class, try the exercise “Getting to know the
class” in the Student Activities at the end of the chapter.
Example 5
Data Files b
Google Analytics
Picture the Scenario
Many websites track user traffic using a tool called Google Analytics (a small
amount of code that loads with the site) which records various characteris-
tics on who is accessing the site. Suppose you built a website and decide to
record some data on people who visit that site, using information that Google
Analytics provides: (i) the country where the visitor’s IP address is registered,
(ii) the type of browser a visitor is using, (iii) whether the visitor accesses the
site through a desktop, tablet, or smartphone, (iv) the amount of minutes the
visitor stayed on your site, and (v) the age and (vi) gender of your visitors.
For a particular day in March, you tracked 85 visitors to your site. The first
visitor was from the United States, used Safari as the browser on a mobile device,
spent six minutes on your site before moving on, and was a 28-year-old female. The
second visitor came from Brazil, used Chrome as a browser running on a desktop,
spent two minutes on your site, and was a 38-year-old female. The third visitor was
from the United States, used Chrome as a browser on a mobile device, spent eight
minutes on your site, and was a 16-year-old with nonbinary gender information.
Question to Explore
a. How do you create a data file for the first three visitors to your site?
b. How many rows and columns will your data file have once you are fin-
ished entering the six characteristics tracked on each visitor to your site?
Think It Through
a. Each row in your data file will correspond to a visitor and each column
will correspond to one of the six characteristics of tracking information
that you collected. So, the first three rows (below a row that includes
the column names) in your data file are:
Visitor Country Browser Device Minutes Age Gender
1 US Safari mobile 6 28 female
2 Brazil Chrome desktop 2 38 female
3 US Chrome mobile 8 16 nonbinary
b. With 85 visitors to the site, your data file will have 85 rows, not count-
ing the first row that holds the column names. And because you
tracked six characteristics on each visitor, it will have six columns, not
counting the first column that holds the visitor number.
Insight
From the entire data file you could now find summary statistics such as the
proportion of visitors who access it on a mobile device. Once the data file
is available in an electronic format, you can open it in a statistical software
package for further analysis.
In Words Each of the six characteristics measured in this study is called a variable.
Variables refer to the characteristics The variables measured make up the columns of the data set. We will learn
being measured, such as the type of more about variables in Chapter 2.
browser or the number of minutes
visiting the site. cc Try Exercise 1.22
Messy Data
The two examples of data files you saw in Figure 1.3 and in Example 5 were
“clean”: Every column had well defined values and there was no missing
information, i.e., empty cells. Raw data are typically a lot messier than what you
will see in this book. For instance, when you asked your fellow students for in-
formation on gender, some may have entered “m,” “f,” and “nb” and others may
have entered “male,” “female,” and “nonbinary.” Some may leave out the hy-
phen in “non-binary,” others may write “man” and “woman,” and still others
don’t identify with any of the options and left the cell blank. Perhaps some stu-
dents filled in the survey twice. Similarly, the column for married in Figure 1.3
shows values of 0 for not married and 1 for married, but this may already be
processed information from the original responses.
In practice, considerable effort initially has to be dedicated to importing,
In Practice cleaning, and preprocessing the raw data into a format that is useful for statistical
Data importing, cleaning, and analysis. Along the way, decisions have to be made, such as how to code observa-
preprocessing often are the most tions and how many categories to use, how to identify duplicates, and what to do
time-consuming steps of the entire
with missing values. Missing values are especially tricky. A single missing value
statistical analysis. However, they
are an essential part of the statistical
may lead to the entire row being discarded in the statistical analysis, which throws
investigative process. away information. What’s more, sometimes the fact that the observation is miss-
ing is informative in itself. To check if and how decisions made during preprocess-
ing may have affected the statistical analysis, the entire analysis is repeated under
some alternative scenarios.
Contagious and destructive nature of the disease. Mode of extension from the
mouth and pharynx. Causes: bacillus diphtheriæ columbarum; its characters:
pathogenesis to birds, mice, rabbits, Guinea pigs; dogs, rats, and cattle immune;
diagnosis from bacillus diphtheriæ. American disease. Incubation. Symptoms:
prostration; wheezing breathing; sneezing; difficult deglutition; false membrane on
fauces; necrotic changes in mucosa; perforations; lesions of internal organs; blood
infection; nostrils stuffed; bill gapes; lesions on eye, tongue, gullet, crop, intestine;
diarrhœa; vomiting. Skin lesions. Course, acute, chronic. Paralysis. Mortality.
Prognosis. Diagnosis from coccidiosis, from croupous angina of Rivolta, from
aspergillus disease. Treatment: isolation; destruction of carcasses; hatching;
destruction of dead wild birds and rabbits; exclusion of living; quarantine of new
birds; disinfection; locally, antiseptics by inhalation, swabbing, and internally, iron
in water.