Dokumen - Pub Recipes For Science An Introduction To Scientific Methods and Reasoning 1nbsped 1138920738 9781138920736

Copyright © 2018. Taylor & Francis Group. All rights reserved.
Potochnik, Angela, et al. Recipes for Science : An Introduction to Scientific Methods and Reasoning, Taylor & Francis Group, 2018. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/purdue/detail.action?docID=5584122.
Created from purdue on 2021-08-26 19:28:30.
Recipes for Science
Today, scientific literacy is an essential aspect of any undergraduate education. Recipes

for Science responds to this need by providing an accessible introduction to the nature
of science and scientific methods, reasoning, and concepts that is appropriate for any
beginning college student. It is designed to be adaptable to a wide variety of different kinds
of courses, such as introductions to scientific reasoning or critical thinking, philosophy
of science, and science education. In any of these different uses, the book helps students
better navigate our scientific, 21st-century world.
KEY FEATURES
• Contemporary and historical examples of science from many fields of physical, life,
and social sciences.
• Visual aids to clarify and illustrate ideas.
• Text boxes to explore related topics.
• Plenty of exercises to ensure full student engagement and mastery of the information.
• Annotated ‘Further Reading’ sections at the end of each chapter.
• Final glossary with helpful definitions of key terms.
• A companion website with author-developed and crowdsourced materials, including
syllabi for courses using this textbook, bibliography of additional resources and
online materials, sharable PowerPoint presentations and lecture notes, and additional
exercises and extended projects.
Angela Potochnik is Associate Professor of Philosophy and Director of the Center for
Public Engagement with Science at the University of Cincinnati, USA.
Matteo Colombo is Assistant Professor in the Tilburg Center for Logic, Ethics, and
Philosophy of Science, and in the Department of Philosophy at Tilburg University, the
Netherlands.
Cory Wright is Professor of Philosophy and Director of Graduate Studies at California

State University Long Beach, USA.
Recipes for Science
An Introduction to Scientific
Methods and Reasoning
Angela Potochnik
Matteo Colombo
Cory Wright
First published 2019
by Routledge
711 Third Avenue, New York, NY 10017
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2019 Taylor & Francis
The right of Angela Potochnik, Matteo Colombo, and Cory Wright to be identified
as authors of this work has been asserted by them in accordance with sections 77
and 78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or utilised
in any form or by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying and recording, or in any information
storage or retrieval system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered
trademarks and are used only for identification and explanation without intent
to infringe.
Library of Congress Cataloging-in-Publication Data
A catalog record for this book has been requested
ISBN: 978-1-138-92072-9 (hbk)
ISBN: 978-1-138-92073-6 (pbk)
ISBN: 978-1-315-68687-5 (ebk)
Typeset in Berling
by Apex CoVantage, LLC
Visit the companion website: www.routledge.com/cw/potochnik

For all the excellent teachers from whom we’ve learned our love of science
Contents
List of Figures and Tables ix

Acknowledgments xii
Introduction: Science and Your Everyday Life 1
1 What Is Science? 7
1.1 The Importance of Science 7

1.2 Defining Science 15
1.3 Recipes for Science 31
2 Experiments and Studies 46
2.1 Experiment: Connecting Hypotheses to Observations 46

2.2 The Perfectly Controlled Experiment 62
2.3 Experimental and Non-Experimental Methods 72
3 Models and Modeling 89
3.1 Models in Science 89

3.2 Varieties of Models 102
3.3 Learning From Models 115
4 Patterns of Inference 125

4.1 Deductive Reasoning 125

4.2 Deductive Reasoning in Hypothesis-Testing 141
4.3 Inductive and Abductive Reasoning 150
5 Statistics and Probability 167
5.1 The Roles of Statistics and Probability 167

5.2 Basic Probability Theory 172
5.3 Descriptive Statistics 182
viii Contents
6 Statistical Inference 207
6.1 Generalizing From Descriptive Statistics 207

6.2 Using Statistics to Test Hypotheses 221
6.3 A Different Approach to Statistical Inference 232
7 Causal Reasoning 242
7.1 What Is Causation? 242

7.2 Testing Causal Hypotheses 255
7.3 Causal Modeling 262
8 Explaining, Theorizing, and Values 275
8.1 Understanding the World 275

8.2 Theorizing and Theory Change 288
8.3 Science, Society, and Values 297
Glossary 310
References 322
Index 327
Figures and Tables
FIGURES
1.1 Notable early scientists studying carbon dioxide (CO2) and climate 9
1.2 Keeling curve: ongoing increase in atmospheric concentrations of CO2 10
1.3 Ice core data from Antarctica 10
1.4 Unprecedented increases in atmospheric CO2 in the past century 11
1.5 Scientists in the Persian Golden Age 18
1.6 Appearance of retrograde motion 19
1.7 (a) Schematic flowchart of simple falsificationism; (b) Karl Popper 26
1.8 Clever Hans and Wilhelm von Osten 34
1.9 Reorientation from geocentrism to heliocentrism 43
2.1 Illustrations of two crosses between pea plants 47
2.2 Western Electric’s Hawthorne factory illumination study 51
2.3 Isaac Newton’s illustration of his two-prism experiment 52
2.4 William Herschel’s experimental setup to test the relationship between
the color and temperature of light 54
2.5 Three scientists who contributed to our knowledge of light 55
2.6 Headlines reporting on Arthur Eddington’s observations during the 1919
eclipse, which confirmed Albert Einstein’s theory of general relativity 65
2.7 Mars Curiosity rover selfie taken on Mount Sharp (Aeolis Mons) on
Mars in 2015 74
2.8 Cholera epidemic, close-up of Snow’s Broad Street map 79
2.9 Phineas Gage posing with the rod that passed through his skull 81
2.10 Isaac Newton’s cannon thought experiment 86
3.1 View of the San Francisco Bay Model 90

3.2 The Reber Plan 92
3.3 (a) Drosophila melanogaster; (b) The four chromosomes of Drosophila 96
3.4 Visual representation of the Lotka-Volterra model 99
3.5 The problem of curve-fitting 105
3.6 James Watson and Francis Crick’s double helix model of DNA 107
3.7 William Phillips’s MONIAC hydro-economic model 109
3.8 Visual depiction of the sodium-potassium pump 110
3.9 Accuracy versus precision 121
4.1 Edwin Hubble at Mt. Wilson Observatory 126
4.2 Frieze at the Social Hygiene Museum in Budapest, honoring
Ignaz Semmelweis 145
x List of Figures and Tables
4.3 (a) Flint Michigan water crisis (b) Lee Anne Walters, the Flint
citizen-scientist who initially requested water-testing 151
4.4 The black swan of the family (Black Australian swan surrounded by
Bewick’s swans) 154
4.5 (a) The Earth’s landmasses fit together a bit like puzzle pieces;
(b) Marie Tharp and Bruce Heezen 157
4.6 The pan-African dawn of Homo sapiens 162
5.1 Visualization of the conditional probability of rolling a number less than
four given that you roll an odd number 179
5.2 (a) Pie chart of a coffeeshop’s sales; (b) Bar chart of per capita national
beer consumption 185
5.3 (a) Histogram of a unimodal grade distribution; (b) Histogram of a
bimodal grade distribution 186
5.4 Examples of (a) uniform, (b) 艛-symmetric, and (c) 艚-symmetric
distributions; (d) Examples of asymmetric distributions 188
5.5 (a) Histogram of the Quiz 1 grade distribution in Table 5.2;
(b) Histogram of the Quiz 2 grade distribution in Table 5.3 193
5.6 Standard deviation in a normal distribution 195
5.7 An imagined scatterplot of the relationship between alcohol
consumption and decibel level in bars 196
5.8 A regression analysis of Galton’s data on the diameter of sweet pea seeds 198
5.9 Scatterplots depicting correlational strength and direction 199
5.10 Francis Galton 200
5.11 Visualizations for Exercise 5.17: (a) Average expenditure per dollar of
Indiana property tax, 2013; (b) Composite score GRE and academic
major; (c) Iris petal length; (d) Number of digs performed and
amphorae found 203
6.1 (a) Probability distribution of heads for 100 coin tosses; (b) Example of
normal distribution for a continuous variable 211
6.2 Diagram of the 68%–95%–99.7% rule for standard deviations 214
6.3 Four histograms of roughly normal distributions 219
6.4 Fabiola Gianotti, project leader and spokesperson for the ATLAS
experiment at CERN involved in the discovery of the Higgs boson
in July 2012 222
6.5 R. A. Fisher 225
6.6 Probability distribution of the number of guesses your friend will get
correct if she is randomly guessing 227
6.7 Thomas Bayes 235
7.1 Annual seismic activity in Oklahoma 1978–2017 243
7.2 USGS map showing locations of wells related to seismic activity
2014–2015 243
7.3 Visualization of the correlation between per capita consumption of
cheese and number of people who died from getting tangled in their
bedsheets 248
7.4 Generic causal graph with nodes representing variables of interest and
arrows representing direct causal relationships 264
List of Figures and Tables xi
7.5 Causal graph of the relationships between posting copyrighted material

on your Facebook page, a friend reporting you, and your Facebook page
being shut down 266
7.6 Causal graph for the dyspnoea case 269
8.1 Oklahoma Senator James Inhofe speaking before the US Congress in
2015 while brandishing a snowball 278
8.2 Ridership data for Yellow Taxis and Uber in New York City 2015–2017 280
8.3 Occurrence of the word law in PsychLit abstracts per 10,000 entries 284
8.4 Partial sketch of a bicycle 287
8.5 Scientists of the chemical revolution 293
8.6 (a) Rosalind Franklin; (b) Franklin’s x-ray diffraction image that famously
inspired Watson and Crick’s double-helix model of DNA 299
TABLES
1.1 Checklist for evaluating whether an idea or project qualifies as scientific 27

1.2 Individual and social norms that protect against bias and flaws in
reasoning 38
2.1 Elements of the perfectly controlled experiment 71
3.1 Payoff matrix for the prisoner’s dilemma with Dominik 112
3.2 Payoff matrix for a generic prisoner’s dilemma 112
4.1 Conditional statements 130
4.2 Valid inference patterns, invalid inference patterns, and informal fallacies 143
4.3 Annual births, deaths, and mortality rates for all patients at the two
clinics of the Vienna Maternity Hospital 1841–1846 144
5.1 Addition, multiplication, and subtraction rules and their conditions 176
5.2 Imagined data set and central tendencies for 17 student scores on
10-point Quiz 1 190
5.3 Imagined data set and central tendencies for 17 student scores on
10-point Quiz 2 192
5.4 Average diameter of parent/offspring sweet pea seeds 197
5.5 Data on Titanic survivors 205
6.1 (a) Frequency distribution of a bag of 35 M&Ms; (b) Relative frequency
distribution 209
6.2 Imagined questionnaire scores of 100 university students 213

6.3 Summary of statistical hypothesis-testing and its relationship to general
hypothesis-testing 224
7.1 Mill’s methods 259
7.2 Conditional probabilities for the causal graph in Figure 7.5 267
7.3 Possible values for variables in the dyspnoea case 268
7.4 Conditional probabilities of developing lung cancer given level of
pollution exposure and whether or not a person smokes 269
8.1 Thomas Kuhn’s four-stage view of scientific change 292
8.2 Five questions that arise when doing science that our values help
us answer 302
Acknowledgments
Many people have contributed to this book in a variety of ways. Thanks to Gila Sher,
who made possible Cory’s initial conversations with Senior Editor Andrew Beck with
Routledge. Without Gila’s encouragement, there never would have been a book proposal.
Andy’s initial vision for the book was crucial for framing the project, and his later edi-
torial guidance and support was matched only by his enduring patience and flexibility.
Thanks also to Routledge Development Editor Alison Daltroy and Editorial Assistants
Vera Lochtefeld and Emma Starr, along with the dozens of anonymous reviewers of both
the original proposal and the later completed manuscript draft. Their feedback left an
indelible imprint on what went into the book, as well as the final product that resulted.
Several students provided helpful research assistance. Nathan Sollenberger, Alejandro
Garcia, and Karina Laigo from the undergraduate research program at Cal State Long
Beach helped kick off the book proposal, and Christopher Laplante provided very helpful
editorial assistance at the end stages of production. Micah Freeman and Sahar Heydari
Fard at the University of Cincinnati provided valuable comments on the whole manu-
script and assistance with glossary compilation. Several colleagues provided extremely
helpful feedback on parts of the manuscript, including Zvi Biener, Vanessa Carbonell, Jan
Sprenger, Naftali Weinberger, and Nellie Wieland.
Angela owes a further debt of gratitude to Zvi Biener for working with her to design
the University of Cincinnati course How Science Works, which inspired her contributions
to the book. More generally, she deeply appreciates her colleagues and friends at the
University of Cincinnati, inside and outside of philosophy. She also thanks her family for
their patience during the periods of time when she was a bit lost to this project.
Cory is grateful to his family for their patience, and is looking forward to making up
for lost time. He would also like to thank Henk, whose unfailing devotion to this project
and daily emotional support and encouragement was as great as any hound’s could be.
Matteo would like to thank his colleagues and friends at the Tilburg Center for Logic,
Ethics, and Philosophy of Science (TiLPS), his family, and Chiara, for their encouragement,
inspiration, and care. During this project, he was generously supported by the Deutsche
Forschungsgemeinschaft (DFG) as part of the priority program New Frameworks of
Rationality [SPP1516], and by the Alexander von Humboldt Foundation. He would also
like to acknowledge Zio P.’s apt reminders of the quote constanter et non trepide.
Introduction
Science and Your Everyday Life
POLIO, HPV, AND OTHER ILLNESSES
What do the American president Franklin D. Roosevelt, the Mexican painter Frida Kahlo,
and the Jamaican reggae trio Israel Vibration have in common? Many people today can’t
guess the correct answer: they all suffered from polio (or poliomyelitis), which can cause
paralysis and even death. This can be hard to guess because scientists and doctors have
successfully turned polio from a global health problem to mostly just a part of history.
Many other people throughout human history have suffered from this crippling infec-
tious disease—most of them young children. In 1952 alone, the polio epidemic ravaged
nearly 60,000 Americans. In 1955, led by the virologist Jonas Salk, a team of scientists
discovered a vaccine for polio. Thanks to the introduction of mass vaccination programs
immediately thereafter, polio cases have decreased worldwide by over 99%. Today, there
are only three countries where polio still exists: Pakistan, Afghanistan, and Nigeria. As of
2016, there were only 37 known cases remaining.
The eradication of polio counts among the most important human—and scientific—
achievements. Vaccination provides you with immunity, which protects you for life. Going
unvaccinated, in contrast, is a serious risk since polio is highly infectious and human
migration is rapid. Outbreaks are still possible. It’s a no-brainer that people should demand
that they and their children be vaccinated.
And yet, many people today are not vaccinated for polio. In wealthier countries, like
the US, UK, Italy, Australia, France, and Russia, the biggest challenges to vaccination come
from skeptics opposed to vaccination for ideological reasons and from mere complacency.
In other countries, like Nigeria, Pakistan, and Afghanistan, political and religious challenges
intertwine with issues of marginalization and feasibility. It is harder to deliver vaccina-
tions to at-risk communities, which might suffer from extreme poverty and lack needed
infrastructure. In any nation, communication of the effectiveness, safety, and public health
value of vaccination benefits from a sound understanding of the science of vaccines.
Unlike polio, HPV (human papilloma virus) is extremely widespread, with roughly
40% of the world’s population infected; it’s the most frequently sexually transmitted
disease in the world. Among other effects, HPV substantially increases the risk of vari-
ous types of cancer. There’s also a vaccine for HPV. It was first available in 2006, after
thorough testing for safety and efficacy. The World Health Organization (WHO) recom-
mends HPV vaccines as part of routine vaccinations in all countries.
This discussion of vaccination is meant to illustrate that understanding what’s involved
in good science and scientific reasoning is of extreme importance. At some point in their
2 Introduction
lives, many people need to make a decision about whether to be vaccinated for some
disease or whether to have their children vaccinated. And sometimes vaccine skeptics
have louder voices than doctors and other vaccination advocates, so it can seem difficult
to get a clear account of vaccines’ safety, effectiveness, and necessity for public health.
The polio vaccine has undergone thorough testing for safety and efficacy, initially in a
study involving 1.2 million children and in many other studies since. The same is true for
other vaccinations, including the HPV vaccine. And claims from vaccine skeptics about
substantive risks of vaccination have been thoroughly debunked.
But don’t take it from us. Learn about scientific experiments so you can assess the
quality of vaccination studies. Learn sound and problematic forms of inference in order
to assess the scientific inferences supporting the use of vaccines (and the problems with
vaccine skeptics’ attempts to sow fear). Study causal reasoning so that you can critically
assess the weight of the evidence against claims about vaccination causing autism (which
has in fact been thoroughly debunked). These topics and others important for the critical
assessment of scientific findings and their public reception are the focus of this book.
WHY LEARN ABOUT SCIENCE?
As the case of vaccination suggests, scientific findings, and the public’s reactions to them,
dramatically shape our world. More than this, science also regularly and dramatically influ-
ences your life, whether or not you want it to. If this is not immediately apparent, that
may be because of the extensiveness of science’s reach. One way or another, everybody
is impacted by science.
The reach of science means that you have a lot to gain from being able to understand
and assess scientific reasoning. This enables you to make educated decisions about your own
and your family’s medical care. It also makes it possible for you to critically evaluate reports
of scientific findings and the credentials of experts in order to decide what to believe.
This ability is important, since so much of our daily life is impacted by scientific findings.
Here’s another example of unavoidable science related to health. Peanut allergies are
serious and develop early in life, and rates of this allergy are on the rise. In 2015, medi-
cal recommendations regarding when to introduce peanut products to babies changed
radically in the US from waiting until at least one year of age to introducing as early as
possible. Both waiting and then introducing early were said to reduce the risk of allergic
reaction. Should you follow this new advice for your baby, if and when you have one?
If the medical researchers were (apparently) wrong with the last recommendation, why
should you follow this new recommendation?
A sophisticated user of science is also well positioned to make judgments about sci-
ence more globally. Is it good for the government to fund basic scientific research? Is the
level of funding for medical research adequate? Should we worry if private corporations
fund science, given that governmental and university funding is in too short of supply?
Answers to these questions require a view about the status of the scientific enterprise as
a whole, how it should relate to society, and whether and how funding sources matter.
Scientists are, of course, the main practitioners of science. Other researchers have as
their primary focus understanding what science—and scientists—are up to. This latter
group is interested in understanding what science is and how it works, its pitfalls and
Introduction 3
limitations, and its relationship with society. These topics are what this book is all about.
Several disciplines investigate science in this way; primary among these are history, phi-
losophy, and sociology. Historians have worked hard to make sense of the history of
science—how the events unfolded that contributed to making science what it is today.
Sociologists also study science, especially the social and cultural influences on how science
works and what it produces. This book draws from the history and sociology of science,
but its main approach is philosophical. There’s a simple reason for that: we, its authors,
are philosophers of science.
If you haven’t studied philosophy of science, it may sound obscure. But philosophy
of science is just a way of thinking hard about the scientific enterprise. It focuses espe-
cially on questions of what science should be like in order to be a trustworthy route to
knowledge and to achieve the other ends we want it to have, such as usefulness to society.
Although written from a philosophical perspective, this book does not dwell on philoso-
phers’ debates about science. Instead, we aim to use philosophical insights about science
without getting bogged down in controversies, technical terminology, or intricate details.
RECIPES FOR SCIENCE
The title of this book, Recipes for Science, is meant to evoke two ideas about science. First,
recipes for baked and cooked items like bread, pies, and stir-fry come in lots of different
versions. Some differences are rather trivial, like whether measurements are in weight or
volume. Others are substantial, like whether a bread is leavened with yeast or with bak-
ing soda and powder. Enough substantial differences can result in very different products,
even products that go by the same name but contain entirely different ingredients. Science
is also like this. It proceeds in many different ways, and there’s no magical ingredient or
essential list of ingredients that guarantees good science.
At the same time, a recipe is a formula intended to lead to a specific outcome, with
an intentional combination of ingredients and use of methods to achieve that outcome.
Different recipes for a given type of food have certain elements in common, even if many
of their other features vary. So, for example, breads generally incorporate grain of some
kind as a major ingredient, most have a leavening agent of some kind, and they are cooked,
usually but not always by baking in the oven. There are family resemblances among dif-
ferent breads and the recipes used to make them, even if there’s no simple definition of
bread and no one recipe required to make bread.
Science is like this as well. Even as it proceeds in different ways, and even as there’s
no one overarching set of instructions or mechanical procedures that guarantees good
science, there are certain generalizations that can be made about how good science is
conducted. Many different activities count as science, and there are also differences in
how each of these activities is carried out. But there are also family resemblances among
instances of science, just as there are among breads.
This book aims to facilitate a clear understanding of the key elements of science and
why those elements are significant, even as it illustrates the tremendous variety of projects
that count as science.
The first three chapters address the nature of science and its key methods. Chapter 1
surveys what is distinctive and important about science while also showing how elusive
4 Introduction
the very concept of science can be. We suggest a checklist approach to distinguishing
science from non-science and fake science and suggest—in lieu of a single, one-size-fits-
all method—that there are various recipes for science. Chapter 2 outlines the role of
experimentation in science and the features of a perfectly controlled experiment. Then
the chapter catalogues a range of methods for experimental and non-experimental studies
and discusses the advantages and disadvantages of each. Chapter 3 focuses on scientific
models: how they are constructed and used, and the main varieties in which they come.
The chapter ends by discussing the relationship between modeling and experimentation
and asking the question of what features of models contribute to their scientific value.
The next four chapters focus on scientific reasoning. Chapter 4 describes the primary
patterns of inference in science: deductive, inductive, and abductive reasoning. The chap-
ter starts with patterns of deductive inference and their use in scientific hypothesis-testing,
moves to the importance of and challenges with inductive inference, and then turns to
the scientific significance of abductive reasoning, also known as inference to the best
explanation. Chapter 5 surveys basic statistical methods, beginning with their basis in
probability theory and proceeding through descriptive statistics. Chapter 6 expands on
that discussion to outline inferential statistics, including sampling and hypothesis-testing.
The chapter ends by introducing the Bayesian approach to statistics and discussing some
of its differences from the classical approach. Chapter 7 engages with causal reasoning in
science. Topics include the nature of causation, the relationship between causal reasoning
and statistical reasoning, testing causal hypotheses, and causal modeling.
Finally, Chapter 8 examines the purpose of science and its relationship to society. We
address the nature of scientific explanation and scientific theories, how theory change and
progress in science occur, and how society and values influence science. The book closes
with a consideration of the current challenges facing science.
INTENDED AUDIENCES AND HOW TO USE THE BOOK
The intended audience for this book includes anyone who wants to have a more sophis-
ticated understanding of the nature of science and a stronger basis for assessing scientific
reasoning.
This book is not just for students of philosophy or science majors. Indeed, the primary
audience we had in mind as we developed this book is an undergraduate student in a
general education course, who may not take any additional science courses in college. We
asked ourselves, what would that student most benefit from knowing about how science
works? What episodes from historical and current science would that student be interested
to read about and contemplate?
That said, we expect this book will also be useful for some more specialized or more
advanced courses. These include science education courses, especially those that focus on
the nature of science and scientific reasoning. These also include introductory philosophy
of science courses, especially if supplemented with more canonical readings or readings
that address some of the major philosophical controversies about science. We also expect
this book to be of use in introductory science courses, especially methods courses, when
supplemented with appropriate material specific to the particular scientific field of study.
Introduction 5
This textbook was designed to be usable in its entirety in a standard 15-week semester.
Students spend one-third of the semester learning about the nature of science, including
the key features of science, experimentation, and scientific modeling (Chapters 1–3).
Most of the remaining semester is then spent learning about scientific reasoning, includ-
ing deductive, inductive, and abductive reasoning patterns; probability and statistics; and
causal reasoning (Chapters 4–7). The final unit of the course addresses the scientific suc-
cesses of explanations and theories and science’s relationship with society (Chapter 8).
Given the range of course levels and disciplines for which this book is appropriate,
and given the reality that different instructors have different teaching goals, we have also
designed the textbook to be usable in a variety of ways. The textbook is modular; each
chapter can be used independently from the others. Instructors (or independent readers)
can thus choose to use only the chapters that suit their needs. Each section may rely on
information provided in earlier sections of the same chapter but does not presume facility
with information from other chapters. Instructors may choose not to assign later sections
in some chapters that seem overly specialized or too difficult given the focus of their
course. Finally, some material that is more difficult or philosophical is separated from the
main text in boxes. Here, too, instructors can choose whether to assign material in boxes.
Here are a few examples of how this might play out in different courses. A critical
reasoning course focused on science may limit its attention to Chapters 4–7—deductive,
inductive, and abductive inference patterns, probability and statistics, and causal reasoning.
A science education course on the nature of science may use Chapters 1–3 and 8, address-
ing the key features of science, experimentation, modeling, and theories and explanations.
An introductory philosophy of science course might make use of Chapters 1–4 and 8,
supplemented with primary philosophical texts. Other introductory courses might use
the full book except for some of the more difficult sections, like 6.3 on Bayesian statistics
and 7.3 on causal modeling.
SUPPLEMENTARY MATERIALS
Each section in this book ends with a list of exercises. We have tried to provide exercises
that will solidify understanding or challenge students to apply what they have learned.
We encourage instructors to make use of these exercises for in-class group or individual
activities, homework, and exam questions. Individuals who are working through this book
independently might also benefit from completing some of the exercises.
There is a list of suggested further reading at the end of each chapter, which provide
inroads into a more in-depth investigation of individual topics covered. The further read-
ing selections thus provide some options for instructors and individual readers who want
to focus on specific topics in more depth. At the end of the book, there is a glossary of
technical terms and other specialized vocabulary that students can consult as needed.
Terms defined in the glossary are indicated in the main text with bold and italics, as
‘philosophy of science’ was earlier in this introduction.
Finally, there is a website to accompany this textbook: www.routledge.com/
cw/potochnik. The website includes example syllabi for different kinds of courses
utilizing this text, additional exercises, and links to content available on the internet that
6 Introduction
will enrich readers’ experience with the topics covered in this book. Because introductory
scientific reasoning courses are not yet offered at some universities, and in even fewer
philosophy departments, the website also provides information and links to information
about the value of such courses and why philosophy departments are good places to
house them.
EXERCISES
0.1 What do you expect to learn from this textbook and the course you’re reading it for?
0.2 What most concerns you about this textbook and the course you’re reading it for?
0.3 How do you think you will benefit from learning more about the nature of science and
scientific reasoning? Why or why not?
0.4 What do you think is most valuable about learning about science and scientific
reasoning?
0.5 Describe your relationship to science. To help you get started, you might consider the
following questions. Have you taken many courses in science or read about science
on your own? If so, on what topics? Do you know any scientists? Do you think there
are reasons to distrust or dislike science? If so, what are the reasons?
FURTHER READING
For more on HPV and vaccines, see World Health Organization (2017, May). Human
papillomavirus vaccines: WHO position paper. 92, 241–268. Retrieved from http://
apps.who.int/iris/bitstream/10665/255353/1/WER9219.pdf?ua=1
For a concise explanation of myths surrounding vaccines, see PublicHealth.org (2018, May).
Understanding vaccines: Vaccine myths debunked. Retrieved from https://www.public
health.org/public-awareness/understanding-vaccines/vaccine-myths-debunked/
For a concise overview of global health and vaccination, see Greenwood, B. (2014). The
contribution of vaccination to global health: past, present and future. Philosophical
Transactions of the Royal Society B, 369(1645), 20130433.
For a thorough treatment of immunization and vaccination, see World Health Organi-
zation, Research and development. Retrieved from www.who.int/immunization/
documents/research/en/
CHAPTER 1
What Is Science?
1.1 THE IMPORTANCE OF SCIENCE
After reading this section, you should be able to do the following:
• Describe how scientific research supports the finding of human-caused climate

change and why public opinion lags behind the scientific research
• Discuss the nature of knowledge and the varieties of scientific knowledge
• Articulate the limits of science and describe a type of project outside those limits
A Serious Practical Concern

April 22, 2016, was a historic day. While people worldwide celebrated Earth Day and
expressed their support for protection of the environment, representatives of 177 nations
signed the Paris Agreement at the UN headquarters in New York. The Paris Agreement
followed up on the Kyoto Protocol to unite most countries on our planet to deal with
climate change. Indeed, there was near-total international unity and consensus, and only
two countries did not sign: Nicaragua, because the agreement did not go far enough, and
Syria, because of their civil war and subsequent governmental collapse.
The Paris Agreement aims to keep the average global temperature rise this century to
well below 2° Celsius. Two degrees might seem like a minor change in temperature, but
as an average global temperature increase, it would be a really big deal. Think of this tem-
perature increase like a fever. The human body maintains a relatively constant temperature
in the range of 36.5–37.5° Celsius (97.7–99.5° Fahrenheit). Were your body temperature
to increase just 2° Celsius, you would have a raging fever, as your temperature would be
over 38.8° Celsius (102° Fahrenheit). If your body were suddenly that much warmer on
average, it would be a serious and potentially devastating medical emergency—and all
the more so without medical treatment options. An average global temperature increase
of 2° Celsius would be similarly devastating for Earth, and there would be few if any
treatment options.
But why, exactly, would it be so devastating? First, because it changes the Earth’s
climate. The atmospheric concentrations of greenhouse gases, such as methane (CH4),
carbon dioxide (CO2), and water vapor, are a major factor affecting the Earth’s climate.
Greenhouse gases work like a blanket. As incoming radiation from the Sun permeates
our atmosphere, some of this heat hits the Earth and is reflected back out to space. But
15031-1864-Fullbook.indb 7 6/24/2018 7:38:22 AM

8 What Is Science?
greenhouse gases trap some of the heat in the atmosphere; this blanket of radiant heat
warms the planet’s surface, making it hospitable to life. But increasing amounts of green-
house gases trap increasing amounts of heat. As a result, mountain glaciers are shrinking
and ice sheets are melting in the Arctic, Greenland, and Antarctica; sea levels are rising;
precipitation patterns across seasons are more unstable; more droughts and heat waves
are occurring; and the blooming times of flowers and plants are shifting. All these changes
are consequences of global warming.
Second, the changing climate has other downstream effects. The rise in global tem-
perature and resulting climate changes threaten to push some animal and plant species to
extinction, collapse ecosystems, and make extreme weather more frequent. It also threat-
ens to destabilize social conditions. Drinking water will become scarcer and droughts more
frequent and severe; crop yields may decrease. Coastal cities and island nations are at risk
of serious floods and devastating hurricanes. In this way, climate change is also affecting
global health, poverty, hunger, and various nations’ security. Ultimately, global warming
will make the Earth less hospitable for all creatures, including humans, and probably also
a more unjust place in virtue of who will suffer and how this suffering will be managed.
Earth’s climate has never been static; it has been fluctuating for billions of years. Besides
the concentration of greenhouse gases, factors that affect it include variations in the Earth’s
orbit, the motion of tectonic plates, the impact of meteorites, and volcanism on the Earth’s
surface. So, what’s special about the current climate changes? Why is this different?
What’s special about the current changes in Earth’s climate is the role of human activi-
ties in generating them. The basic reasoning behind this conclusion is simple and clear.
We have known since the 18th century that burning carbon-based fossil fuels releases
carbon dioxide (CO2) into the atmosphere. During the last three centuries, at least since
the beginning of the Industrial Revolution, human activities have been releasing CO2
into the atmosphere at an unprecedented rate. Large-scale releases of CO2—one of the
greenhouse gases—into the atmosphere increase its heat retention, thus increasing the
Earth’s average global temperature. And scientists have in fact measured such an increase
in average global temperature. So it’s clear that human activity during the last couple of
centuries has increased the Earth’s average global temperature.
Systematic research on the relationship between CO2 emissions and climate change
began in the 19th century, when the American engineer Marsden Manson noted that
‘the rate at which a planet acquires heat from exterior sources is dependent upon the
power of its atmosphere to trap heat; very slight variations in the atmospheric constitu-
ents [produce] great variations in heat trapping power’ (Manson, 1893, p. 44). A few
years later, the Swedish physicist and chemist Svante August Arrhenius (1859–1927)
completed an extensive set of calculations, showing that the changes in CO2 function
as a ‘throttle’ on other greenhouse gases like water vapor. He also calculated that there
would be an Arctic temperature increase of approximately 8° Celsius (46.4° Fahrenheit)
from atmospheric carbon levels two to three times their known value at the time.
Arrhenius later predicted that ‘the slight percentage of carbonic acid in the atmosphere
may, by the advances of industry, be changed to a noticeable degree in the course of a
few centuries’ (1908, p. 54).
Just before the outbreak of World War II, a British steam engineer, Guy Callendar, pre-
sented a breakthrough paper to the Royal Meteorological Society entitled ‘The Artificial
Production of Carbon Dioxide and Its Influence on Temperature’. Callendar pointed out
that the atmospheric concentration of CO2 had significantly increased between 1900 and
15031-1864-Fullbook.indb 8 6/24/2018 7:38:22 AM

What Is Science? 9
FIGURE 1.1 Notable early scientists studying carbon dioxide (CO2) and climate
1935, which he determined with temperature measurements from 200 meteorological

stations. Based on further calculations, he concluded that:
As man is now changing the composition of the atmosphere at a rate which must be
very exceptional on the geological time scale, it is natural to seek for the probable
effects of such a change. From the best laboratory observations it appears that the
principal result of increasing atmospheric carbon dioxide . . . would be a gradual
increase in the mean temperature of the colder regions of the earth.
(1939, p. 38)
Unfortunately, Callendar’s prescient recognition of the role of human activity on atmo-

spheric temperatures had to wait several decades to become widely accepted.
In May 1958, the American scientist Charles David Keeling (1928–2005) installed
four infrared gas analyzers at the Mauna Loa Observatory in Hawaii; these recorded an
ever-increasing atmospheric CO2 concentration. These measurements have been collected
continuously since 1958, resulting in the so-called Keeling Curve (see Figure 1.2), which
is a graph plotting ongoing change in concentration of CO2 in the Earth’s atmosphere.
Keeling’s measurements provided evidence of rapidly increasing CO2 levels in the
atmosphere, and a 1979 report by the National Research Council—an American non-
profit, non-governmental organization devoted to scientific research—connected this
evidence to a rise in average temperature. This report predicted that doubling CO2 con-
centration in the atmosphere from 300 to 600 parts per million would result in an average
warming of 2° Celsius to 3.5° Celsius. (Parts per million, or ppm, is a unit for measuring
small amounts of a substance in some mixture.) We haven’t yet reached the ominous
level of 600 ppm, but we’re now long past safe levels of CO2 in the atmosphere, which
had been estimated to be about 350 ppm.
In the past several decades, climate scientists have been tracking CO2 levels in the
atmosphere with ever more precise and sophisticated techniques. For example, ice cores
taken from various locations in Antarctica have enabled scientists to extrapolate historic
CO2 levels for comparison to recent levels (see Figure 1.3). A group of 78 scientists
gathered data from ‘climate proxies’ besides ice cores—including tree rings, pollen, corals,
glacier ice, lake and marine sediments, and historical documents about the climate—to
15031-1864-Fullbook.indb 9 6/24/2018 7:38:22 AM

May 27th 2018 CO2 recording: 411.39
410
400
390
CO2 concentration (ppm)
380
370
360
350
Annual Cycle
340
330
320
= 44 Jan Apr Jul Oct Jan
310
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Years recorded at Mauna Loa Observatory
FIGURE 1.2 Keeling curve: ongoing increase in atmospheric concentrations of CO2
May 27th 2018 CO2 recording: 411.39
400
Ronne Ice
350 Shelf
CO2 concentration (ppm)
Dome A
Siple
SOUTH
S
Station POLE
V
Vostok
300 Ross Ice

Taylor
Dome C Law
Dome
Shelf
D
Dome
250
200
1750 1800 1850 1900 1950 2000
Antarctican ice core measurements (1700–1958)
FIGURE 1.3 Ice core data from Antarctica
15031-1864-Fullbook.indb 10 6/24/2018 7:38:24 AM

What Is Science? 11
carbon dioxide level (parts per million)

500
480
460
440
420
400 current level
380
360
340 For millennia, atmospheric carbon dioxide had never been above this line
320 1950 level
300
280
260
240
220
200
180
160
400,000 350,000 300,000 250,000 200,000 150,000 100,000 50,000 0
years before today (0 = 1950)
FIGURE 1.4 Unprecedented increases in atmospheric CO2 in the past century
demonstrate that there are multiple lines of evidence for increasing levels of CO2 in the
atmosphere (see Figure 1.4) and that the average temperature for the end of the 20th
century is higher than in the previous two millennia (Ahmed et al., 2013).
The unprecedented pace of current climate change and its connection to human activi-
ties like burning fossil fuels, cattle ranching, and clear-cutting rainforests are clear. In the
previous 800,000 years, the concentration of CO2 in the atmosphere had never been over
285 ppm. Since the Industrial Revolution—only 0.025% of the last 800,000 years—the
concentration has spiked to 412 ppm. The milestone of 400 ppm was reached in March
2015 (see www.co2.earth). CO2 concentration measured 409.39 on May 30th 2017, the
day before Donald Trump announced that he would withdraw the US from the Paris
Agreement. One year later, in May 2018, the concentration has risen more to 412 ppm.
The last time CO2 levels were this high, humans did not yet exist. The average tempera-
ture of our planet has gone up by about 0.85° Celsius (1.5° Fahrenheit) since 1880, and
the last three decades are estimated to have been the hottest in the last 1,400 years.
The Role of Science

We have already articulated the reasoning leading to the conclusion that human activities
are radically altering Earth’s climate. But how do scientists really know? The short answer
is that scientists know this in the same way that they have come to know anything else.
Scientists know that the structure of DNA is a double helix, that Neptune takes more
than 164.79 years to orbit the Sun, that HIV is a retrovirus that attacks T-cells, and so
on. These and other facts all have good science behind them. None were obvious to begin
with; scientists had to reason their way to the correct answer.
Understanding how scientists acquire new knowledge, the basis for science’s authority
as a source of knowledge, and the limits of that authority gives us greater reason to trust
scientific knowledge. This is so whether the knowledge is about DNA, Neptune’s orbit,
HIV, or climate change.
First, it’s important to consider the nature of expertise. You should trust climate scien-
tists to do climate science in the same way you trust your mechanic with your car or your
15031-1864-Fullbook.indb 11 6/24/2018 7:38:24 AM

12 What Is Science?
favorite restaurant with your dinner. The types of expertise required for these positions
takes years, even decades, to develop, and the expertise doesn’t neatly transfer from one
domain to another. Don’t trust the average climate scientist to fix your car or make you
a delicious meal. Similarly, politicians and policy-makers know things about political and
legislative matters, but they should not be looked to as authorities on climate change. This
includes politicians who deny climate change, as well as those who grant its existence.
Reputable scientists and scientific societies, including the national science academies
of the world and the Intergovernmental Panel on Climate Change (IPCC), agree that
human-caused, or anthropogenic, climate change is occurring. This includes virtually all
climatologists. In 2004, for instance, the historian of science Naomi Oreskes analyzed
928 abstracts on climate change published in peer-reviewed scientific journals from 1993
to 2003; none expressed disagreement with the consensus position that anthropogenic
climate change is occurring (Oreskes, 2004). In 2010, a group of researchers studied the
views of the top 200 climate scientists (defined as the scientists with the most extensive
publication records) and confirmed that more than 97% actively affirm the existence
of anthropogenic climate change as described by the IPCC (Anderegg et al., 2010). So
there is striking agreement among climate scientists about the existence of anthropogenic
climate change.
Climatologists’ agreement on climate change is grounded in a rich body of independent
sources of evidence that support the same conclusion: human activities are causing Earth’s
atmosphere to heat up. Well-established theories in physics explain how heat radiation
works. Physical chemistry shows how CO2 in the atmosphere traps heat, contributing
to greenhouse effects. As we pointed out, at least since the 1890s, scientists have known
about the relationship between CO2 buildup and average global temperature. Satellites
and other technology have enabled scientists to collect many different types of informa-
tion about relevant changes on our planet—including variations of sea level and of oceans’
temperatures, and the decreasing mass of polar ice sheets. Since the 1950s, scientific
models and computer simulations have been helping scientists to make testable predic-
tions about what would happen to the global climate in response to different changes in
human activities. Evidence has confirmed these predictions.
And, yet, despite decisive scientific evidence, public awareness and concern for climate
change lag behind the research (Lee et al., 2015). As of 2016, four out of every 10 adults
worldwide hadn’t even heard of climate change. Whether or not people are sensitive to
the risks of climate change mainly depends on understanding its human causes and on
one’s level of education. In some countries, like the US, however, being better educated
doesn’t guarantee that one is more likely to believe that climate change is really happen-
ing and is caused by human activities. Instead, political views are a better predictor of
Americans’ belief in and concern about the reality of climate change.
People who don’t know much about some topic also tend to experience an illusion
of understanding, where a lack of genuine understanding of some topic is linked to a
lack of appreciation for the depth of one’s ignorance about that topic. Applied to climate
change, this means that people who have no advanced education or training in science,
or who otherwise don’t understand how the climate works, tend to have unwarranted
confidence in their ability to assess scientific findings or make pronouncements about
climate change.
15031-1864-Fullbook.indb 12 6/24/2018 7:38:25 AM

What Is Science? 13
The illusion of understanding has become easier to sustain in today’s society. In part,
this is because finding information through internet searches (so-called Google knowing)
has diminished genuine understanding. We also have limited opportunities for productive
public discourse and disagreement; our conversations online and in person tend to happen
with people who have beliefs similar to our own.
Improving public climate literacy is thus important for informed public engagement
with global warming. And, more generally, understanding the processes that give rise to
trustworthy scientific knowledge is vitally important to deciding what to believe, whom
to believe about what, and how to learn more.
What Science Is Good For

Let’s back up. Why is science so important?
The most obvious answer is science’s role in satisfying our practical goals. Many fun
and useful—even life-changing—innovations have come about through computer science.
The biological and pharmaceutical sciences have vastly improved medical care and our
ideas about healthy living. Skyscrapers and airplanes wouldn’t be possible without a lot of
physics. The list goes on and on. But practical benefits aren’t the only important outcome
of science. More generally, science is the best approach we humans have developed for
answering questions about the natural world.
At its heart, science aims at the production of knowledge. Philosophers have tradition-
ally thought of knowledge as requiring at least three elements: belief, justification, and
truth. First, belief is necessary for knowledge; you can’t know something without believing
it is true. But to know something, it’s not enough to fervently believe it. Knowledge—
including scientific knowledge—is an achievement; certain conditions must be met for
a belief to count as knowledge. Knowledge requires justification. To know something,
one must have good reasons to believe it is so. Finally, sufficiently justified belief isn’t
enough. One could be justified in believing something that still turns out to be false. In
2007, most American football fans had the justified belief, in some cases a fervently held
belief, that the New England Patriots would cap off their perfect season by winning the
Super Bowl. But those football fans didn’t know this, because they were wrong. Justified
beliefs must also be true to count as knowledge.
Consider the knowledge that the Earth’s atmosphere is warming up. On the traditional
conception of knowledge as justified true belief, you have this knowledge just if (i) it’s
true that the Earth’s atmosphere is warming up; (ii) you believe that it is warming up;
and (iii) you are sufficiently justified in believing that it is warming up.
Science is important because it is our best route to knowledge about the world around
us. And scientific knowledge also often has practical benefits and can influence how we act.
If you genuinely know that Earth is warming up and understand why that’s the case, then
you may change the ways you behave—for instance, by petitioning your government, your
society, and your circle of family and friends to develop more energy-efficient practices.
Some scientific knowledge is so-called pure knowledge, or knowledge for its own sake.
For example, scientists have investigated the conditions under which rainbows form, not
because they think that learning about rainbows will generate technological inventions
or cure diseases, but simply because they are interested in optics. Investigating rainbows
15031-1864-Fullbook.indb 13 6/24/2018 7:38:25 AM

14 What Is Science?
yields knowledge about the nature of light and color. Knowledge of these things may
have applications, but that is not why scientists study them. Scientific research that aims
at knowledge for its own sake is sometimes called basic research.
Not all knowledge is equally valuable. For example, it wouldn’t be valuable to know
how many rainbows have ever occurred on Earth; such truths are pointless truths. When
science aims for pure knowledge, the aim is explanatory knowledge, or generating knowl-
edge of how things work and why things are the way they are. We know so much about
our world, and we understand so many things because of scientific discoveries and
theories.
A different type of scientific research is applied research. Scientific research is applied
when it exploits knowledge in order to develop some product, like software, pharma-
ceutical drugs, or new materials. Often, a central motivation for applied research is to
generate products for profit. For example, the scientists who discovered the neurotrans-
mitter dopamine in the human brain in 1957, Kathleen Montagu and Arvid Carlsson,
were doing basic research; by contrast, scientists who are employed by pharmaceutical
companies to improve upon existing dopamine-related treatments for Parkinson’s disease
are doing applied research.
As this suggests, basic and applied scientific research can operate synergistically.
Scientists aiming at the production of knowledge for its own sake often rely on the new
materials and techniques created by scientists doing applied research, while scientists
doing applied research often exploit pure scientific knowledge in order to develop new
products.
Science’s Limitations
So science is our best route to knowledge about the world around us and to developing
innovations based on that knowledge. To appreciate science’s significance, it’s also impor-
tant to recognize what it doesn’t do.
Scientists try to gain knowledge about certain kinds of phenomena, or appearances
of things occurring in the world, and they do so in a certain kind of way. The list of the
phenomena investigated in science is long; in principle, it includes everything in our uni-
verse. But there are some important limitations to the scope of science. Science doesn’t
replace or limit non-scientific intellectual pursuits, like literature or philosophy—or poli-
tics for that matter. Basing our scientific knowledge about climate change on fluctuating
political agendas would be a mistake. But when it comes to addressing climate change
with policy interventions, debating which steps are politically feasible and desirable is
fair game for politicians.
Scientific knowledge differs from theological doctrine and religious practice too. Unlike
religious practitioners, scientists attempt to explain things without appeal to supernatural
entities or influences, such as deities or miracles, or to literary allegories or culturally
significant myths. Of course, one can be religious in any number of ways, and people can
be religious and believers in scientific knowledge, or even scientists themselves. People
disagree about the role religion should play in our society, but whatever role that might
be, science is not designed for fulfilling the role of religion.
Science’s limitations will become clearer in the next section, where we examine what
distinguishes science from other human projects.
15031-1864-Fullbook.indb 14 6/24/2018 7:38:25 AM

What Is Science? 15
EXERCISES
1.1 How do scientists know that human activities are radically altering Earth’s climate?
Why are these changes a serious concern?
1.2 Do all scientists, by virtue of being scientists, have the expertise to make pronounce-
ments about global warming? Give reasons to support your answer.
1.3 Some people know much more than the average layperson about some topics; these
people are experts on those topics. Think of at least three people you consider to
be experts and their areas of expertise. Why exactly do you consider them to be
experts? Is your answer the same or different for the three experts you listed? Why?
1.4 Laypersons are not always in a position to recognize who is a genuine expert on a
certain topic. Many people don’t know enough about the topic to assess expertise,
and genuine experts sometimes disagree with one another about the topic of their
shared expertise. Think again of the people you listed as experts in Exercise 1.3. How
can laypeople identify whether they should trust each of these experts? Considering
your answers, describe the kind of evidence, in general, that a layperson can use to
identify genuine expertise.
1.5 Based on the text or your other knowledge, list a few reasons why public concern about
anthropogenic climate change lags behind scientific research. Given that lag, how
should climate scientists affect environmental policy in the government? Should they
merely collect evidence and produce knowledge, leaving the construction of policy to
policy-makers? Do they have any obligations to more actively engage with the public?
1.6 Define knowledge, and say how science relates to knowledge. What are the limita-
tions to the kinds of knowledge science can produce?
1.7 What’s distinctive about science, in comparison to activities like literature, music, and
art, as a source of knowledge about the world? Do you think there are any important
differences between scientific and artistic ways of gaining knowledge? Support your
answers with justification.
1.8 Define basic research, and describe why you think scientists may choose to pursue it.
Is basic research important? If so, how? Should it be funded by the government? Why
or why not? How do you think it should be decided what kind of scientific research to
fund?
1.2 DEFINING SCIENCE
• Define pseudoscience and give examples

• Describe how you might define science by its history, its subject matter, and its
methods
• List the most important features of science and characterize each feature
• Analyze whether a given claim or topic of research counts as scientific
15031-1864-Fullbook.indb 15 6/24/2018 7:38:25 AM

16 What Is Science?
The Tricky Work of Defining Science

In the last section, we described some of the clear and abundant scientific evidence for
anthropogenic climate change. But why should we trust climatologists here and not, say,
astrologists? After all, astrologists make predictions about human affairs and events in
their horoscopes. A tempting answer to this question is that climatology is a science, while
astrology is not—that’s why you should trust climatologists and not astrologists. But then
this raises a new question: What is science? If we are going to exclude astrology, we have
to be able to say what makes it unscientific.
This is a question about the nature of science. It can be divided into two parts. First,
there’s a question of what’s distinctive and important about science when it comes to
generating knowledge. Second, there’s a question of where we ought to draw the lines
between science and non-science. These questions are, respectively, about the definition
of science and how to draw the boundaries of science—that is, how to demarcate or set
the boundaries of science in contrast to other kinds of projects.
We have already suggested that science is unrivaled in its ability to generate explana-
tory knowledge about our world. Science has earned a kind of authority and legitimacy
from centuries of successes and improvements. The demarcation of science from non-
science is especially important because some non-scientific projects are designed to look
enough like science to deceive people into thinking that these projects too can lay claim
to the authority and legitimacy of science. These deceptive attempts to appear scientific
are sometimes called pseudoscience, which literally means false, or fake, science. (Other
non-scientific projects don’t pretend to be scientific, and these are perfectly fine.)
A classic example of pseudoscience is astrology (which shouldn’t be confused with
astronomy, the scientific field addressing celestial objects and space). Astrology is com-
monly associated with horoscopes, which use zodiac signs to make predictions about
future events, romantic relationships, health, job prospects, and the like. Tests of astrol-
ogers’ theories, however, have demonstrated their utter failure to predict or explain.
Perhaps for this reason, advocates of astrological theories, as a community, rarely engage
in systematic attempts to test those theories, and the theories have changed little since
astrology peaked in popularity centuries ago. These theories fail to incorporate accumu-
lated scientific knowledge of physical mechanisms or add any such knowledge of their
own. And yet, even though astrology is bunk, it is often promoted as a legitimate source
of prediction and explanation. Massive numbers of astrologers, clairvoyants, psychics, and
other charlatans earn billions of dollars every year for their consultations.
The seductive allure of astrology illustrates why a definition of science is needed.
Science has an ever-increasing impact on society, and it’s dangerous for pseudoscience
to be taken seriously in this way or for good science to be dismissed as no better than
pseudoscience. How we define science determines who has the authority to speak for the
scientific community, who has the legal standing to offer expert testimony, what kind
of ideas should inspire the health care we receive, who gets to apply for public funding
of science, what ideas about the world we take seriously, what should be taught in the
classroom, and many other important matters.
Pseudoscientific ideas, by masquerading as science, have done great damage to our
legitimate knowledge of the world and the laws and policies informed by science. Besides
astrology, damaging pseudosciences include creationism and intelligent design, which are
15031-1864-Fullbook.indb 16 6/24/2018 7:38:25 AM

What Is Science? 17
religiously inspired beliefs intended to compete as scientific alternatives to evolutionary

theory; conversion therapy, where psychological or spiritual intervention is used to try to
change a person’s sexual orientation; and homeopathy, which is a system of alternative
medicine based on the idea that substances that cause the symptoms of a disease can cure
similar symptoms in sick people when repeatedly administered in diluted water. Some
versions of climate change denial also have features of pseudoscience.
So, a good definition of science must exclude pseudoscience. This is harder than it
might at first seem. Take astrology, for example. Astrology’s mystical origins might be a
reason to deem it unscientific. However, chemistry had its origins in alchemy, mystical
ideas developed in the Middle Ages and Renaissance that aimed at discovering methods
for converting baser metals into gold and finding an elixir for life, among other things.
Some people believe astrology for irrational or illogical reasons. But beliefs in certain
principles of quantum physics can appear strange and irrational. You might point out
that, for proper sciences, it’s possible to get a degree in that discipline in academic
institutions. But there are also organized institutes issuing degrees in graphology—the
discredited, pseudoscientific study of handwriting to discern a person’s character. Perhaps
astrology is not scientific because it’s too narrowly focused on personality and the posi-
tion of celestial bodies, without taking into account other accepted scientific theories.
Then again, economics is focused specifically on the production, exchange, distribution,
and consumption of commodities and hardly attends to any scientific theories from
other fields.
As you see, it’s not simple to find the essential difference between science and non-
science, especially when pseudoscience is added into the mix. Sure, no one mistakes
English literature for science. But can you state how they’re different exactly? Even this
may be more difficult than you might think. Scientific results and discoveries are com-
municated in books, articles, and conferences, as is research in English literature. There are
scientific and clinical literature reviews; why aren’t these just another form of literature?
Studies of English literature employ textual, linguistic, and historical evidence. How is
this different from the use of evidence in science? Even when the line between science
and non-science is clear, articulating the relevant differences is challenging.
There is also tremendous variety in science. One common distinction is between the
physical sciences, like astronomy and inorganic chemistry; the life sciences, like botany and
neurobiology; and the social sciences, like anthropology, social psychology, and behavioral
economics. Different fields often study different kinds of things, make different assump-
tions, use different methods, and have different aims. This variety makes it even more
difficult to formulate a general definition of science. Any feature used to define science
may accidentally exclude not just pseudoscience and non-science but some scientific
projects as well. For example, some projects in theoretical physics seem to be very dis-
tantly related to empirical evidence; so it seems we don’t want to require all science to
be tested directly with empirical evidence.
These difficulties may make it impossible to give a neat and exceptionless definition
of science. But there are many candidates for distinctive ingredients of science. Our next
task is to survey some of those ingredients and assess how well they do in demarcating
science from other activities. Ultimately, we’ll conclude that none of these ingredients is
decisive by itself, but together, they are useful as a guide to help us puzzle through what
counts as science.
15031-1864-Fullbook.indb 17 6/24/2018 7:38:25 AM

18 What Is Science?
Defining Science by Its History

We have already noted that science aims at the production of knowledge. This aim traces
back to the origins of the very word science. This word derives from the Latin words sci-
entia and scīre, which pertain to knowledge. So science, from its origins, has been about
the pursuit of knowledge. But pursuing knowledge isn’t the exclusive province of science.
Looking more closely at the origins and history of science might help us diagnose how
science’s pursuit of knowledge is distinctive.
Most historians of science agree that the cultural, social, and technological changes that
unfolded in Europe between roughly 1550 and 1700 are very important to the origination
of modern science. This period, often referred to as the Scientific Revolution, began with
the work of the Polish astronomer Nicolaus Copernicus (1473–1543), who put forward
a heliocentric theory of the cosmos, and ended with the work of the English physicist
Isaac Newton (1642–1727), who proposed universal laws of physics and a mechanical
universe in his famous treatise Philosophiæ Naturalis Principia Mathematica. The Scientific
Revolution brought about fundamental transformations in our knowledge of the natural
world and in how claims to knowledge ought to be justified. Many of the methods and
ideas developed during that period remain at the heart of science.
But let’s start our consideration of science’s history even further back. Way before
the Scientific Revolution, a variety of innovations across diverse civilizations—including
ancient Egypt, Iran, India, China, Greece, and the pre-Columbian Americas—provided
fertile grounds for proto-scientific activity. For example, a variety of civilizations contrib-
uted to the refinement of systems of weights and measures, which was important for a
number of later scientific developments. Early catalogues of descriptions of constella-
tions provided a record of observations against which later astronomical predictions and
discoveries could be checked.
Arguably the most important period in the development of science prior to the
Scientific Revolution was the 500 years from the 8th through 13th centuries, known as
the Islamic or Persian Golden Age, involving the work of many scholars from Central Asia
to the Iberian Peninsula. Here is a brief account of some of the scientifically important
developments from that period.
The Hindu-Arabic numeral system, which greatly advanced the symbolic represen-
tation of numbers and calculation, was invented between the 1st and 4th centuries in
FIGURE 1.5 Scientists in the Persian Golden Age
15031-1864-Fullbook.indb 18 6/24/2018 7:38:25 AM

What Is Science? 19
India. The Persian polymath Muḥammad ibn Mūsā al-Khwārizmī (c. 780–c. 850) further
developed this system and brought it to Arabic mathematics, and his work later intro-
duced this numeral system to Medieval Europe. Al-Khwārizmī also made significant con-
tributions to algebra, geometry, and astronomy. Around the same time, the Persian Abū
Bakr Muhammad ibn Zakariyyā al-Rāzī (854–925) was responsible for many innovations
in medicine, including advocating for experimental methods and developing classifica-
tions of contagious diseases. And the Arab scientist Ibn al-Haytham (c. 965–c. 1040) did
revolutionary work in optics and vision, including the insight that vision occurs by eyes
detecting light deflected by objects.
Other Persian and Arabic polymaths, including especially Ibn Sina (980–1037), known
also by the Latinized name Avicenna, as well as ibn Aḥmad Al-Bīrūnī (973–1048) and Ibn
Rushd (1126–1198), or Averroes, preserved and developed theories about the natural world
from the Greek philosopher Aristotle (384–322 BCE). This was in turn the basis of ideas
about the natural world in 15th-century Europe, with ideas added from Christian, Jewish,
and Islamic cosmogony and theology. Based on Aristotle’s views, the universe was thought
to be geocentric—the Earth at the center—and with two regions: terrestrial for Earth and
celestial for the planets and stars. The celestial region was thought to contain transparent
concentric spheres that rotate around the Earth. The Greco-Egyptian astronomer Ptolemy
(c.100–168) had supplemented this with an account of the apparent motions of the stars
and planetary paths, including detailed models and tables that could be used to calculate
the positions of the stars and planets. Geocentrism in 15th-century Europe blended observa-
tions and calculations with religious ideas and ideas about humanity’s place in the universe.
A longstanding problem with the geocentric view of the cosmos was the appearance
of so-called retrograde motion. The planets sometimes seem, in observations made over a
series of nights, to stop in their orbit, reverse course back across the sky, then stop again,
and reverse course yet again to continue on their original way. An example of this is shown
in Figure 1.6. Following Ptolemy, geocentrists explained retrograde motion by positing
epicycles, mini-orbits of planets that themselves orbit the larger orbits. This successfully
accounted for retrograde motion, but it wasn’t as intuitive and seemingly obvious as the
other elements of geocentrism.
In 1543, in what is considered to be the beginning of the Scientific Revolution,
Copernicus presented a radical alternative conception of the cosmos as heliocentric, or
FIGURE 1.6 Appearance of retrograde motion
15031-1864-Fullbook.indb 19 6/24/2018 7:38:25 AM

20 What Is Science?
centered around our sun. This provided an alternative explanation for retrograde motion.
According to heliocentrism, retrograde motion of planets was due to Earth changing
position relative to other planets as they all revolved around the sun. Copernicus’s pro-
posed heliocentric conception of the cosmos was met with skepticism. It violated widely
accepted beliefs and called for a fundamentally new physics of the heavens. Besides, the
mathematics of Copernicus’s system was just as complex as Ptolemy’s epicycle solution
to retrograde motion, and it did not make predictions of planetary motion any more
accurate. So, few astronomers were convinced by Copernicus’s system.
The situation changed with the research of Michael Möstlin (1550–1631), Johannes
Kepler (1571–1630), and Galileo Galilei (1564–1642), each of whom championed and
improved the Copernican heliocentric system. Möstlin and Kepler were German math-
ematicians and astronomers with interest also in astrology. Kepler devised a set of laws
that described the motions of planets around the Sun. Based on calculations of the orbits
of Mars, he inferred that planets do not have the circular, uniform orbits proposed by
Copernicus. Their orbits are ellipses. This simplified the Copernican theory and signifi-
cantly improved the predictive accuracy of heliocentric models.
Born in Italy, Galileo is one of the most important figures of the Scientific Revolution.
He was instrumental in establishing Copernicus’s heliocentric system and, more generally,
in replacing Aristotelian mechanics of the separate terrestrial and celestial realms with a
new single physics. Galileo invented the telescope, which he used to observe the phases of
Venus and to discover that Jupiter had moons orbiting it. This was a significant discovery
for heliocentrism: if our Earth were the center of the universe around which all things
orbited, then those moons should be orbiting Earth instead.
Recalling the main purpose of this discussion, might this early period of science give
us a way to approach defining it? In the Scientific Revolution, the rapid development
of new ideas, methods, and tools resulted in the swift accumulation of knowledge. A
similar process played out in the later development of the fields of chemistry, biology,
and psychology. Perhaps, then, science can be defined simply as those pursuits that have
descended from the Scientific Revolution. Something ‘clicked’ that facilitated the devel-
opment of knowledge about our world, and today’s scientists are still engaged in that
process of accumulating knowledge.
One problem with this suggestion is that many of the pursuits that furthered scientific
knowledge also included religious, theological, and philosophical ideas that we would not
consider scientific nowadays. In the Persian Golden Age and the Scientific Revolution,
philosophy, theology, and science were not divided as they are now, and often, the same
ideas had significance for religious belief and for beliefs about the natural world. Another
problem with defining science straightforwardly by its history is that it’s unclear whether
and how some of today’s scientific disciplines, like economics and neurolinguistics, relate
to the Scientific Revolution.
Perhaps we can instead look to the methods developed as science was established as
the defining features of science. Methods established in the Persian Golden Age and the
Scientific Revolution that may be characteristic of science include looking to sense experi-
ence and performing experiments to decide what’s true, the systematic use of mathematics
to study phenomena, and the institutionalization of investigation in formal organizations.
These will all find their way into our eventual attempt to identify the main ingredients
of science. But scientific methods have also significantly developed and changed since the
Scientific Revolution. For example, statistical and computational methods emerged in the
15031-1864-Fullbook.indb 20 6/24/2018 7:38:25 AM

What Is Science? 21
late 1800s. These methods are staples of present-day science, and they are essential for
understanding complex phenomena like Earth’s climate. The institutional and social struc-
tures governing scientific practice have also undergone massive changes in the last centu-
ries. One profound transformation in the social organization of scientific activity was the
professionalization of science in Europe and North America beginning in the mid-19th
century. So, although science’s methods are key to defining it, we’ll have to look beyond
the Persian Golden Age and the Scientific Revolution to fully characterize those methods.
Here’s one more idea for defining science inspired by this quick look into science’s
history: perhaps we can define science by focusing on what it is that scientists investi-
gate. The Scientific Revolution was a decisive step toward the separation of scientific
from non-scientific questions. Recall that geocentrism had implications not just for the
natural world but also for religious belief and views of humanity’s role in the universe.
Heliocentrism was more explicitly a view just about the universe around us. So maybe the
definition of science relates to its subject matter—the world we see around us—as distinct
from philosophical, religious, and theological investigations of, for example, meaning and
purpose. We’ll explore this idea next.
Defining Science by Its Subject Matter

In science, the world itself and all of its parts and properties are investigated in order to
better understand and control them. This seems different from other human projects.
So, we might look to the subject matter of science—planets, animals, disease, and so
forth—to define it.
An immediate problem with this approach is the sheer variety of topics among the vari-
ous fields of science. Subjects of these investigations range from subatomic particles like
quarks, to DNA, emotions, societies, and many other things besides. It can seem as if there
is a science of absolutely everything! Professional sports are a good example. Some scientists
devote their research to learning how to improve athletic performance. Other topics of
scientific research are even more abstract or hypothetical. An example is string theory—a
highly theoretical subject in physics that posits one-dimensional ‘strings’ as the basic building
block of our universe. Even if we could give a list of all the topics of science, this wouldn’t
be a good way to define science. We want to be able to say something about what all those
subjects have in common and why pseudosciences like astrology don’t belong on the list.
So we should look for what all these various topics of science have in common. Here’s
an idea. Recall from the end of Section 1.1 that, in contrast to religious belief and litera-
ture, science attempts to explain things without appeal to the supernatural, to allegories,
or to myths. We can thus describe the aim of science as providing natural explanations
of natural phenomena.
Natural phenomena are objects, events, regularities, or processes that are sufficiently
uniform to make them susceptible to systematic study. Disease epidemics, lunar eclipses,
and droughts are all natural phenomena. Inflation, poverty, and unemployment are all
phenomena in human societies, but they also count as natural phenomena under this
definition. We’ve already defined phenomenon as that which appears, is seen or otherwise
experienced. Phenomena include all observable occurrences, where observable means
detectable with the use of our senses, including the use of our senses aided by technologi-
cal devices like telescopes that extend their reach. The requirement that natural phenom-
ena be uniform, or occur according to some pattern, makes it so that different scientists in
15031-1864-Fullbook.indb 21 6/24/2018 7:38:25 AM

22 What Is Science?
different times and places can observe the same natural phenomena. Observability across
people, times, and places is essential to scientific study.
Natural explanations invoke observable features of the world to account for natural
phenomena. If there’s an epidemic in Florida or increased employment in Colombia, you
might wonder how that came to be. A natural explanation of the epidemic might specify
a contagion and a mechanism of transmission, or other such factors. A natural explanation
for the increase in employment might specify private investments in industry and legisla-
tive choices made by labor unions and political parties. These are natural explanations
of natural phenomena.
Box 1.1 Naturalism and the Meaning of ‘Natural’

Two forms of naturalism often come up in discussions about the character of science.
Methodological naturalism is the idea that scientific theories shouldn’t postulate
supernatural or other spooky kinds of entities. Ontological naturalism is the idea
that no such entities exist (ontology is the branch of philosophy concerned with the
study of what exists). One can believe methodological naturalism is true without
subscribing to ontological naturalism.
The idea that something is or isn’t natural also comes up in public debates about
applications of scientific knowledge, for example in debates about genetically modi-
fied organisms (GMOs). In these debates, calling something natural loosely means
that it was produced by nature without human intervention. Most people believe that
natural things are healthier, morally better, or kinder to the environment than unnatural
things. This wasn’t always so. Until the end of the 19th century, natural product
meant perishable or toxic (Stanziani, 2008). Our current understanding of natural
as an indicator of healthfulness and safety is influenced by social concerns about
technological innovation. There are certainly cases in which lack of human interven-
tion is better, but the general association many people make between natural and
good is not based on scientific evidence (Rozin et al., 2012). In any case, this sense
of natural is wholly different from the idea of naturalism we discuss here.
The meaning of natural in this context can be better understood by contrasting it

with the term supernatural. Supernatural entities and occurrences, if they exist, are not
governed by natural laws and may not be observable. Any supernatural entities or occur-
rences that might exist are not natural phenomena, and so they are not relevant to science.
Were there to be any supernatural entities or occurrences, like miracles or ghosts, science
by definition won’t be able to deliver knowledge about them. Nor does science appeal
to supernatural entities or occurrences in order to explain other things. For instance, ‘A
miracle caused her to recover from disease’ couldn’t possibly be a scientific explanation,
even though recovering from a disease is a natural phenomenon.
Science is always naturalistic in what it investigates and how it explains. Notice that
this does not mean science has demonstrated that there are no supernatural entities
or occurrences. Science simply can’t tell us anything about miracles, ghosts, or other
ssupernatural subjects, not even whether or not they exist. These are simply outside the
15031-1864-Fullbook.indb 22 6/24/2018 7:38:26 AM

What Is Science? 23
realm of what science can investigate because scientific investigation is limited to natu-
ralistic inquiry. This suggests that science need not interfere with most forms of religious
belief. The exception is when religious belief is used to provide competing explanations
for natural phenomena.
However, pursuit of natural explanations for natural phenomena doesn’t by itself
adequately demarcate science from non-science. Some naturalistic approaches to natural
phenomena aren’t things we consider to be scientific. Take football coaching, for example.
Its subject matter ranges from physical training and development of individual technical
skills to psychological motivation and knowledge of tactics and strategy, and coaching
employs what we know of the world to engage with this subject matter. But football
coaching is not a science. Naturalism might be an ingredient of science, but it isn’t defini-
tive of science all by itself.
Defining Science by Its Methods

In our attempts to define science by its history and subject matter, we’ve touched upon
one distinctive ingredient of science’s methods: science involves empirical investigation
using one’s senses. These methods facilitated the breakthroughs of the Persian Golden
Age and the Scientific Revolution and are linked to the importance of naturalism for
science. What scientists see, hear, smell, touch, and so forth can all be used as empirical
evidence for or against some attempted natural explanation.
But the use of empirical investigation is, by itself, not enough to define science. We all
use our senses in everyday life to learn about the world around us, beginning when we
are infants. You know when it is time to wake up because you hear your alarm go off.
You know it’s a clear day because you can see and feel the sun shining through the win-
dow. This approach to gaining knowledge has been fine-tuned and perfected in scientific
reasoning, but empirical investigation using one’s senses is part of the human condition,
not distinctive to science.
So, let’s look for other methods that, when coupled with empirical investigation, might
be used to define science. A hint of where to start comes from the constant revision of
scientific ideas. Even scientific theories that are widely held are continually subject to
investigation. Occasionally, widely believed theories are rejected as a result of this continu-
ing investigation. Much more often, continual, critical, and self-corrective investigation
results in theories being fine-tuned and expanded in the light of new evidence.
This continual investigation results from science’s commitment to evidentialism, the
idea that a belief’s justification is determined by how well the belief is supported by
evidence. Coupled with science’s commitment to empirical investigation, evidentialism
suggests that scientific beliefs should be supported by empirical evidence. Any scientific
claim about the world comes with the burden of showing why that claim should be
believed.
We should note, though, that much of the empirical evidence supporting your beliefs
doesn’t come directly from your own sensory experience. Empirical evidence sometimes
comes from other people reporting on their experiences. Only the physicists carrying
out an experiment on subatomic particles will have sense experiences that provide evi-
dence in support of the belief that certain subatomic particles—say, quarks—exist. Other
people have only indirect access to that empirical evidence in the form of the physicists’
reports about their sense experiences. Furthermore, sense experience sometimes confirms
15031-1864-Fullbook.indb 23 6/24/2018 7:38:26 AM

24 What Is Science?
a scientific claim only indirectly. Even the physicists who study quarks haven’t directly
observed quarks. Instead, they have made predictions based on the idea that quarks exist,
and those predictions have been supported by empirical evidence collected in carefully
controlled experimental conditions.
Box 1.2 Empiricism and Rationalism in Philosophy

Historically, there has been philosophical disagreement about the extent to which
knowledge about the world is dependent on sense experience. Rationalists like René
Descartes (1596–1650) and Gottfried Wilhelm Leibniz (1646–1716) believed that
some genuine knowledge about the world is independent of sense experience and
can be gained via pure reasoning. Mathematical knowledge is sometimes used as
an example. In contrast, empiricists like John Locke (1636–1704) and David Hume
(1711–1776) believed that experience is our only way to gain knowledge about
the world. In more recent discussions in science and philosophy, the terms empiri-
cism and rationalism have been used to refer to the generic views, respectively, that
experience is fundamentally important to knowledge and justification and that human
reasoning is the basis of knowledge and should be the basis of beliefs.
In addition to science’s commitments to empirical investigation and to evidential-

ism, we should also note that scientists leave open the possibility that their ideas are
mistaken—even their most cherished or most certain beliefs. No scientific claim, no
natural explanation of a natural phenomenon, is ever taken to be beyond all doubt.
Karl Popper was a philosopher who studied science in the early 20th century. He was
troubled by the problem of separating science from pseudoscience. It occurred to him
that both scientific ideas and pseudoscientific ideas can be supported with evidence, but
only scientific ideas are tested against evidence that might refute them (Popper, 1963).
Based on this insight, he developed the philosophical theory of falsificationism, which
states that scientific reasoning proceeds by attempting to disprove ideas rather than to
prove them right—that is, by advancing ‘bold and risky conjectures’ and then trying to
falsify or refute them.
The idea of falsificationism has been very influential among scientists, but it remains
controversial for a number of reasons. We’ll consider this debate more in Chapter 4. For
now, we’ll sound a few quick cautionary notes about falsificationism as a view of science
before focusing on what seems accurate about it. First, the relationship between empirical
evidence and a scientific theory can be complicated, so it is sometimes hard to say when
the evidence disproves an idea. Second, trying to prove central ideas false again and again
would limit scientific progress. Sometimes scientists accept a theory or a finding and move
on to developing it or exploring its consequences. It seems like a stretch to claim that
scientists are really always aiming to prove their theories false!
But two key elements of falsificationism do seem to accurately describe scientific
reasoning. First, any scientific claim should be falsifiable. This means that one should be
able to describe what kind of empirical evidence would, if found, show that the claim is
15031-1864-Fullbook.indb 24 6/24/2018 7:38:26 AM

What Is Science? 25
Box 1.3 Evidence, Evidentialism, and String Theory

Evidence is information that plays the role of making a difference to what one is justi-
fied in believing. Evidentialism implies that one’s beliefs should be backed by evidence
and that a belief’s justification is proportional to its evidential support. Scientists use
empirical evidence to test their theories. But it can be very hard to obtain empirical
evidence regarding some scientific theories. String theory, for example, is a theory in
physics that is currently detached from empirical evidence. This theory says that the
fundamental objects in the world are strings, which are very tiny, extended, one-
dimensional objects that cannot be empirically detected with present-day instruments.
Despite the lack of empirical evidence, string theorists hold strong belief in this theory.
Are they being unreasonable? Maybe not—especially if not all evidence is
empirical evidence. And, in fact, string theorists justify their belief by appealing to
non-empirical evidence. They emphasize the unifying and explanatory power of
their theory. String theory would unify quantum mechanics with general relativity
theory, providing an integrated explanation of phenomena at a microscopic scale
and at a cosmic scale. String theorists also claim that there simply are no alterna-
tives to string theory; it is the only candidate for a ‘final theory of everything’.
While these and other non-empirical considerations are routinely used to evaluate
scientific theories, most scientists agree that the degree to which one is justified to
believe any theory, including string theory, ultimately depends on how well the
theory is supported by empirical evidence.
wrong. This is required for scientific claims to be subject to empirical evidence. Notice
that true claims can still be falsifiable—you can describe what kind of evidence would
prove them wrong; it’s just that, because they are true, you will never actually find such
evidence. Even for false claims, scientists may never be in the right circumstances to obtain
falsifying evidence. But for any scientific claims—any bold and risky conjectures—it should
at least be possible to say what falsifying evidence would look like, even if we aren’t in
the position to get such evidence or even if the evidence does not exist (because the
claim is true). Falsifiable claims enable science to be based on empirical evidence and to
reject ideas when the evidence warrants doing so.
Second, science requires honesty when evidence seems to go against a claim or theory.
When scientists discover apparently falsifying evidence, they should begin to doubt the
idea under investigation. In general, we humans try really hard to hold on to our existing
beliefs, even when they are challenged. Scientists are no different. But the norms of good
science obligate them to doubt any scientific claims—even ones they really like or thought
were really promising—in the face of evidence that challenges those claims. It is part of
the very idea of science that any claim or theory should be abandoned when the pre-
ponderance of evidence suggests it’s wrong. We might call this openness to falsification.
To summarize, falsificationism implies that scientists are always earnestly trying to falsify
their scientific theories, even and especially the ones they are the most certain about. This
is up for debate. But it does seem like all scientific claims should be falsifiable, at least in
principle, and that scientists should be open to the possibility, at least in principle, that any
15031-1864-Fullbook.indb 25 6/24/2018 7:38:26 AM

26 What Is Science?
conjecture
attempted
refutation
successful?
YES NO
Karl Popper
(1902–1994 )
FIGURE 1.7 (a) Schematic flowchart of simple falsificationism; (b) Karl Popper
claim or theory will need to be given up if sufficient evidence is found that goes against it.
This is depicted in Figure 1.7 as a process of conjecture and attempted refutation.
Let us briefly mention two other candidates for hallmark methods of science. Much
of science makes use of mathematical techniques ranging from statistics to linear algebra
and geometry. This is another distinctive characteristic of science. Quantitative analysis,
or the use of mathematical techniques to measure or investigate phenomena, is found
in most science. Not all science employs numbers, however. So to say that quantitative
analysis is a hallmark of science is not to say that qualitative analysis, or the investiga-
tion of phenomena without using mathematical techniques, is not. For example, social
scientists often rely on in-depth interviews, focus groups, and other probative techniques
that don’t involve any mathematics.
Finally, another method distinctive of science is found in its social and institutional
structure. Science relies on communities of many people working together on related
projects but also with different ideas, techniques, aims, and values. Scientists are in some
ways always collaborating; teams of scientists work together on research projects, and all
research is based on the findings of other scientists’ work. In other ways, scientists are
always competing with one another, for example, to make a discovery first, to get their
research projects funded, and to show that one’s idea is better supported by the evidence
than an opposing idea. This social aspect of science is one of its most salient characteris-
tics. This social and institutional structure also relates to science’s role in society, which
we’ll explore in Chapter 8.
The Nature of Science

We have discussed many distinctive features of science. These include aiming to generate
knowledge, a basis in the Scientific Revolution, naturalism, empirical investigation, evi-
dentialism, falsifiability and openness to falsification, the use of mathematics, and social
structure. Some people have advocated one or another of these features as the best way
to define science. Others have suggested these different features together comprise a list
of hallmark ingredients of science. We think such a list is the most promising approach.
15031-1864-Fullbook.indb 26 6/24/2018 7:38:26 AM

What Is Science? 27
So we shall characterize science as an inclusive social project of developing natural expla-

nations of natural phenomena. These explanations are evaluated in the light of empirical
evidence, and should be subject to additional open criticism, testing, refinement, or even
rejection. Science regularly, but not always, employs mathematics in both the formulation
and evaluation of its explanations.
Consider how this characterization of science relates to our earlier example of climate
change. Because science is naturalistic, it is limited to natural explanations of natural
phenomena in the way we described earlier. The warming of the Earth’s climate is a
natural phenomenon, subject to empirical investigation. All scientific claims must be
testable, and potentially falsifiable, with the use of empirical evidence. Claims such as
that the concentration of atmospheric greenhouse gases has increased since the Industrial
Revolution and that the global sea level rose about 17 centimeters (6.7 inches) in the
last century are testable and falsifiable. These claims have been thoroughly tested and
not falsified; they are accepted by the scientific community only because there is strong
evidence in their favor. Scientists gather evidence with a wide variety of tools and often
by using quantitative methods. As new evidence becomes available, scientific hypotheses
are corroborated, revised, corrected, or rejected. All of this is true of climate change
research, which involves a number of different fields of science and techniques, and our
understanding of climate change and predictions of its effects are always being adjusted
and fine-tuned. Scientific hypotheses are open to criticism and correction by a network
of researchers embedded in the social and institutional structures that regulate scientific
practice. Climate change research involves numerous scientists and institutions working
in tandem and also regularly with different hypotheses in competition with one another.
The basic idea of anthropogenic climate change has persisted because no challenges to
the idea or to the research supporting it have been successful.
Let’s return now to the demarcation question and, in particular, the problem of dis-
tinguishing science from pseudoscience. The characterization of science developed here
can provide a kind of checklist for assessing whether some activity qualifies as scientific,
as pictured in Table 1.1.
TABLE 1.1 Checklist for evaluating whether an idea or project qualifies as scientific
A scientific activity or project:
✓ Aims to provide natural explanations of natural phenomena (naturalism)
✓ Puts forward ideas that can be tested with empirical evidence (empirical investigation,
falsifiability)
✓ Updates ideas based on available evidence (evidentialism)
✓ Would abandon any idea that was thoroughly refuted (openness to falsification)
✓ Employs mathematical tools appropriately when they are useful (mathematical techniques)
✓ Involves the broader scientific community (social and institutional structure)
15031-1864-Fullbook.indb 27 6/24/2018 7:38:26 AM

28 What Is Science?
Here’s an obvious contrast with science as we have defined it: researchers in literature
do not collect measurements or other similar forms of evidence to test hypotheses about
the literary value of a piece of written work. Disagreements about the literary value
of, say, Dante’s most famous work, La Divina Commedia, cannot be settled by running
experiments. By reading this work, you can learn about 13th- and 14th-century social
life in Italy and about moral and theological views in Europe. But the literary work itself
is a work of fiction, not intended to directly provide natural explanations of features of
the natural world.
Now consider astrology, a canonical example of pseudoscience introduced earlier. The
primary claims made in astrology, such as horoscope predictions, are not designed to be
falsifiable, and many are even designed to be unfalsifiable. They are vague in ways that
allow many different interpretations; so for any interpretation that is wrong, another
can be offered in its place. Further, the systems of horoscopes used by astrologists are
inconsistent with well-understood basic theories of biology, physics, and psychology. This
violates the expectation of the collaborative exchange of ideas among scientists.
Astrology may be a harmless fad, with negative consequences largely confined to mis-
spent leisure time and money. Other pseudoscientific projects are much more dangerous.
Denials of anthropogenic climate change, for example, can be no less pseudoscientific than
astrology, and they have contributed to a lack of political will to address climate change—a
failure that may well have catastrophic consequences. Generally, the prominent climate-
change deniers have no genuine interest in engaging with the science. Their project is
not the earnest and disinterested search for truth, wherever it leads, but instead one of
shielding their cultural or political values by introducing doubt, distraction, and bluster
and lobbing personal attacks (Oreskes & Conway, 2010). Their denial of climate change
is not designed to be falsifiable; no amount or kind of evidence will change their minds.
Indeed, some climate-change deniers have even rejected the idea that science is a trust-
worthy source of knowledge in order to hold fast to their commitment against the idea
of climate change. Climate-change deniers also violate the expectation of collaborative
and competitive exchange among scientists, insofar as they neither produce hypotheses
and evidence for other scientists to evaluate nor acknowledge the vast empirical evidence
that supports the theory of anthropogenic climate change.
Anti-vaccination advocacy is another example of pseudoscience with pernicious effects.
One popular anti-vaccination argument is that vaccines increase the risk of autism. But,
as we will discuss in Chapter 7, all vaccines have been subject to incredibly extensive
testing for safety, and those tests have demonstrated conclusively that there is no causal
connection between vaccination regimes and the incidence of disorders like autism. This
conclusion of safety is scientific; it is based on evidence, is open to falsification, and
would be rejected if sufficient evidence against it were found. But existing research is
so extensive and compelling that the possibility of newfound disconfirming evidence is
virtually nonexistent. Nonetheless, propaganda outlets and anti-vaccination groups peddle
misinformation, trying to induce doubt with hearsay and uncritical stories of children
who were diagnosed with autism after vaccination. (This does regularly happen, for the
simple reason that vaccination regimes and many symptoms of autism tend to emerge in
the same stage of early childhood.)
Yet another example of pseudoscience with pernicious effects is creationism and intel-
ligent design. In the United States, for more than 50 years, creationism has posed under
15031-1864-Fullbook.indb 28 6/24/2018 7:38:26 AM

What Is Science? 29
the guise of ‘creation science’ as an alternative to evolution as a theory of the origins of

life. In 1987, the US Supreme Court ruled in Edwards v. Aguillard that ‘creation science’
was not actually scientific but a particular religious belief. The verdict showed that science
has certain features (guided by natural law, explanatory, testable, falsifiable, tentative, and
so on) and that creation science failed on all counts. In response, creationists coined the
term intelligent design to describe the idea that an intelligent creator is responsible for
life on Earth. This was an attempt to avoid the religious connection of creationism by
avoiding explicit reference to gods. In 2005, the US Supreme Court ruled in Kitzmiller v.
Dover Area School District that intelligent design also fails to qualify as scientific.
The basic idea of both creationism and intelligent design is that life forms are so complex
that they couldn’t possibly have come about without the help of an intelligent designer (such
as the Judeo-Christian God). It’s difficult to see how this claim of the necessity of an intel-
ligent designer could be tested (or falsified). But this idea has inspired some claims against
evolutionary explanations of various features of organisms that are testable. These claims have
been tested with evidence from biology and related fields of science, and they have been
thoroughly refuted. Notice that this does not mean evolutionary theory has proven there is
no god. Rather, what has been shown is that science can provide—and has provided—natural
explanations of the complex life forms that exist. Despite this scientific success, intelligent
design advocates persist in promoting the idea of shortcomings in evolutionary theory, with-
out engaging with the existing evidence against their view or even indicating what evidence
would weigh in favor of evolutionary theory and against intelligent design.
Contrast these examples of pseudoscience—astrology, climate-change denial, anti-vac-
cination advocacy, creationism, and intelligent design—with climate science, our main
example of science in this chapter. As we have seen, evidence supporting the theory of
anthropogenic climate change and informing claims about the effects of climate change
comes from many different sources. As the Earth’s climate includes the oceans, the wind,
the biosphere, the atmosphere, glaciers, and clouds, researchers can tap into sources of
evidence like the rise of sea levels, the warming of the oceans, the shrinking of ice sheets,
the glacial retreat, and the increased frequency of extreme weather conditions. Gathering
this kind evidence and assessing the magnitude of climate change involve the use of
sophisticated instruments and very complex mathematical models, and the reliance on the
expertise of a number of different scientists from a variety of different fields of science.
Multiple studies published in peer-reviewed scientific journals independently confirm
that human activities have contributed to glacier retreat and climate-warming trends
over the past century. In addition, most of the leading scientific organizations worldwide
endorse this conclusion. The Intergovernmental Panel on Climate Change (IPCC), for
example, issued a public statement in 2014 explaining that the evidence for the warming
of the Earth’s climate systems is ‘unequivocal’. The panel wrote:
[S]ince the 1950s, many of the observed changes are unprecedented over decades
to millennia. The atmosphere and ocean have warmed, the amounts of snow and
ice have diminished, and sea level has risen … Human influence on the climate
system is clear, and recent anthropogenic emissions of greenhouse gases are the
highest in history. Recent climate changes have had widespread impacts on human
and natural systems.
(IPCC, 2014)
15031-1864-Fullbook.indb 29 6/24/2018 7:38:26 AM

30 What Is Science?
EXERCISES
1.9 Choose one scientific development from the Persian Golden Age or the Scientific
Revolution. Describe how that development constituted progress in the subject mat-
ter of science and in the methods of science.
1.10 Order the following disciplines from most scientific to least scientific, consulting
the discussion of defining science and the checklist for science in Table 1.1: astrol-
ogy, economics, cinematic theory, cultural anthropology, social work, paleontology,
criminology. (You might need to first investigate what some of these disciplines are.)
For each, briefly explain why you ranked it as you did, making reference to the
hallmark features of science.
1.11 Describe how the history, subject matter, and methods of science are each relevant
to the nature of science.
1.12 Outline the specific elements of science’s history, subject matter, and methods that
relate to hallmark features of science. Rate each of these on a scale of 1 to 5, where 1
is the least important to the nature of science and 5 is the most important. Choose one
feature you rated ‘1’ and one you rated ‘5’, and say why you gave each this rating.
1.13 Define pseudoscience in your own words. Then, choose one of the examples of pseudo-
science from this section and evaluate it using the checklist of science. Describe how it is
similar to science and how it is different. Can you identify any features of the example
you’ve chosen that seem to be intended to appear more like science than they are?
1.14 Based on the information we have provided in this section, evaluate intelligent
design against the checklist for science. Assign it a letter grade, where A+ is fully
scientific and F bears no resemblance to science. Defend the grade you’ve assigned
with reference to the checklist.
1.15 Enter the phrase intelligent design into an internet search engine. Find and consult at
least one site that endorses intelligent design and at least one site that is critical of the
idea that intelligent design is scientific. (a) Evaluate the case presented by each side,
taking into account the checklist for science when it’s relevant. Describe your findings
in writing. (b) Say what, if any, differences you identify between the sources—that is,
between the two websites—and whether and how those differences matter to the author-
ity of these sources on the question of whether intelligent design is a scientific theory.
1.16 Why must science be limited to the study of natural phenomena? Why must it give
only natural explanations? Can you think of any scientific projects that don’t seem
to satisfy these requirements? If so, describe one or more such projects and say why
they might not be naturalistic. If not, describe a non-scientific idea that seems like it
is not naturalistic and say why.
1.17 Mythology and science are generally understood to be very different from one an-
other. And yet early science had its origins in, and then grew out of, mythology, and
both myths and scientific theories provide explanations of the natural and social
phenomena observed in the world around us.
a. Look up three creation myths from different cultures and historical periods—that is,
look up myths of how the world began and how people first came to inhabit it.
b. Identify similarities and differences across the three myths.
c. Describe similarities between the creation myths and scientific theories of
human origin. In particular, identify potential similarities between the kind
15031-1864-Fullbook.indb 30 6/24/2018 7:38:26 AM

What Is Science? 31
of methods and evidence involved in devising creation myths and in building

scientific theories of human origin.
d. Describe differences between the creation myths and scientific theories of
human origin. In particular, identify dissimilarities between the kind of methods
and evidence involved in devising creation myths and in building scientific
theories of human origin. How do these dissimilarities make a difference to
what you’re justified to believe about human origin?
1.18 It was discovered in the 19th century that the planet Mercury was not following
the orbit predicted by Newton’s theory of gravity. When this happened, Newton’s
theory was not considered falsified. Instead, it was hypothesized that this anomaly
was the result of another planet, named Vulcan, orbiting between Mercury and the
Sun. Despite a systematic search, Vulcan was never found. The anomalies exhib-
ited by Mercury’s orbit could be explained only a hundred years later by Albert
Einstein’s theory of general relativity.
a. Why do you think scientists initially refused to consider Newton’s theory falsified?
b. Was this a failure of science? Should the scientists have given up Newton’s
theory sooner? Why or why not?
c. Does this mean Newton’s theory of gravity was not falsifiable? Why or why not?
1.3 RECIPES FOR SCIENCE
• Explain why there is not a single thing we can call ‘the Scientific Method’
• Name two general flaws in human reasoning that science is designed to counteract,
and give examples of their influence
• Describe five features of scientists and of the scientific community that are important
to the trustworthiness of science
• Describe each of the three steps found in most recipes for science and why each is
a challenge
The Scientific Method Is a Myth

We now have a working definition of science and plenty of examples of what is, and is
not, properly scientific. Our characterization of science ended up relying heavily on meth-
ods common in science, such as evidence-gathering, mathematics, openness to criticism,
and collaboration. In this section, we focus further on the methods of science. This is
an important part of figuring out what makes science so good at producing trustworthy
knowledge about our world. This discussion of methods—or recipes for science—will set
the stage for all the topics we will cover in the rest of the book.
Spoiler alert: as with our characterization of science, we’re going to start by telling
you the answer isn’t a simple one. In some science class along the way, perhaps in high
school, you probably learned about the scientific method. But interpreted literally, the
idea that science always uses the scientific method is a myth. The American Academy for
the Advancement of Science (AAAS) put the point this way: ‘[The scientific method] is
often misrepresented as a fixed sequence of steps’, when instead, it is ‘a highly variable
and creative process’ (AAAS, 2001, p. 18).
15031-1864-Fullbook.indb 31 6/24/2018 7:38:26 AM

32 What Is Science?
Some of the most important scientific breakthroughs had decidedly unscientific-seeming

origins. For example, there was no real method by which German chemist Friedrich August
Kekulé (1829–1896) discovered that the benzene molecule was structured like a ring; he
just had a daydream of a snake biting its tail. (Although, this daydream came after Kekulé
had been studying chemistry and the nature of carbon-carbon bonds for years.) Similarly,
the idea that natural selection is the mechanism of evolutionary change occurred to the
British naturalist Alfred Russel Wallace (1823–1913) during a feverish attack of malaria
while travelling in Indonesia in 1858—or so he wrote in his autobiography. Not only is
there no real method for at least some crucial scientific discoveries; there are also many dif-
ferences in how and the degree to which scientific claims are tested by empirical evidence.
This is why, in this book, we talk of recipes for science. Think of culinary recipes. They
have several components like the name and origin of the dish, the ingredients and their
quantities and proportions, cooking times, and the necessary equipment to make the
dish. Following the preparation steps and techniques involved in a culinary recipe doesn’t
guarantee a delicious dish. Often, knowledge of ingredients, adapting the recipe to your
circumstances, and even collaboration with another cook are also required. Similar to
culinary recipes, recipes for good science have several components, involve a wide array
of techniques and instruments, and rely on others’ expertise via collaboration or knowl-
edge sharing. Similar to culinary recipes, there is no single set of mechanical instructions
and step-by-step procedures that guarantees good science. Just like great cooking, good
science is a highly variable and creative process. It’s also often messy.
Box 1.4 Normative Versus Descriptive Claims in Science

Some English-speakers say ‘I don’t know nothing’, even though their teachers have
lectured them that double negation isn’t good English. In this case, the teachers are
making a normative claim—that is, they are expressing a value judgment. Noting
that English-speakers sometimes use double negation is, in contrast, a descriptive
claim; it’s simply describing what occurs.
A normative claim says how things ought to be. In contrast, a descriptive claim
attempts to describe how things in fact are, without making any value judgments.
Descriptive and normative considerations are both part of science. For example,
Nobel Prize winner Daniel Kahneman and his collaborator Amos Tversky used in
their studies in the psychology of reasoning and decision-making a normative stan-
dard that specifies what choices rational agents ought to make in order to satisfy
their own desires. Relying on this normative theory, Tversky and Kahneman then
studied how people actually make decisions, constructing a descriptive theory of
decision making under uncertainty.
Just as science involves both normative and descriptive claims, both kinds of
claims can be made about science. One can simply attempt to characterize what
science is—that is, how scientists in fact develop theories and test claims—or one
can attempt to say how science should work, that is, what features science should
have for it to succeed at generating knowledge. We will be doing a little bit of
both in this book.
15031-1864-Fullbook.indb 32 6/24/2018 7:38:26 AM

What Is Science? 33
The Flaws in Human Reasoning

Let’s talk a little about the purpose of science before we get to the methods, or, in keeping
with our culinary metaphor, the recipes. Why is science needed to give us knowledge about
the world in the first place, beyond just our ordinary human powers of observation? We
humans are predisposed to investigate our world using our senses from our first days of
infancy, which is also central to how science works. But we humans are also predisposed
to some pretty serious flaws in how we gather evidence and how we reason. Science is
a valuable route to knowledge about the world because it incorporates ways to protect
against those flaws. We’ll first introduce some significant flaws in human reasoning; then
we’ll survey how science counteracts them.
It is only natural for people to initially favor some ideas over others. We can then use
our experiences in the world, investigation of existing knowledge, and hard thinking to
make sure the ideas we favor are, in fact, good ideas. The problem is we also seek out and
interpret information in ways that fit with our favored ideas, and we avoid information
that challenges those ideas. This is a well-established feature of human reasoning called
confirmation bias, the tendency we all have to look for, interpret, and recall evidence in
ways that confirm and do not challenge our existing beliefs.
For example, when someone asks her friends if they like the restaurant she’s chosen, she
may say, ‘It’s good, isn’t it?’ Framing the question in this way promotes agreement with
the view she has of the restaurant—it’s a way of looking for confirming evidence. Similarly,
someone who’s skeptical about climate change may google ‘climate change doubt’ to learn
more, or may focus on what critics say and ignore what climate scientists say. Both of
these strategies tend to generate information only on the side of climate-change denial.
In one study, proponents and opponents of the death penalty both read the same dis-
cussion of the death penalty. The two sides interpreted the discussion totally differently;
each side thought it weighed more in favor of their own position (Lord, Ross, & Lepper,
1979). And then, a smoker may clearly remember the distant relative who lived to 100
and smoked a pack of cigarettes every day, while pushing from her mind other smokers
who fared less well. The tendency toward confirmation bias is general to all people, but it
tends to be stronger for politically and emotionally charged issues, such as climate change,
vaccination of children, and health issues.
Confirmation bias can involve looking only for ideas and evidence that support your
existing beliefs, cherry-picking which research to believe and which to ignore, holding
evidence against one’s views to a higher standard than evidence in favor of one’s views,
and more easily remembering supporting evidence than contrary evidence. Confirmation
bias is exhibited by everyone, and scientists are no exception.
Sometimes, scientists’ expectations or desires about the results of scientific research
end up leading to incorrect findings. One way in which this can happen is through the
observer-expectancy effect, when a scientist’s expectations lead her to unconsciously influ-
ence the behavior of experimental subjects. A famous example of this involved Clever
Hans, a horse who was thought to have sophisticated abilities, including performing
arithmetic calculations. Hans’s owner, Wilhelm von Osten, was a mathematics teacher,
horse trainer, and phrenologist. (Phrenology involved studying the shape of the skull as
an indicator of personality and mental abilities, and its status as a science has since been
discredited.) Hans was trained to recognize numerals from 1 to 9 and to tap his hooves
to indicate which numbers he recognized. Eventually van Osten had Hans tapping out
correct answers to questions like ‘What’s the number of 4s in 16?’
15031-1864-Fullbook.indb 33 6/24/2018 7:38:26 AM

34 What Is Science?
FIGURE 1.8 Clever Hans and Wilhelm von Osten
In 1891, van Osten travelled around Germany to exhibit his amazing horse. There was
such fanfare that the famous psychologist Carl Stumpf appointed a special commission
to provide critical scrutiny. In 1904, the commission concluded that Hans’s abilities were
legitimate. The horse was able to answer a great variety of questions on topics from simple
arithmetic to square roots, fractions, and decimals; units of time; musical scales; and the
value of coins. Hans could even respond accurately even when van Osten wasn’t present.
The commission was wrong. Stumpf’s pupil Oskar Pfungst demonstrated that Clever
Hans was not actually performing the sophisticated mental calculations attributed to him
(Pfungst, 1911). Pfungst used blinders to vary whether Hans could see the questioner, and
he varied who played the role. Hans produced the correct answer even when van Osten
himself did not ask the questions, but Hans’s performance fell apart when either the
questioner did not know the answer or could not be seen by Clever Hans. In particular,
when Hans could not see the spectators and questioners, his ability to produce correct
answers fell dramatically from 89% to 6%. Further observations confirmed that Hans was
being unwittingly cued by his human audience. Questioners’ body language and facial
expressions became tauter as his tapping approached the correct answer, and then more
relaxed upon the final tap; this change prompted Hans to stop tapping.
Like van Osten and all the other people who posed questions to Clever Hans, our
expectations can affect how things play out, even when we don’t intend for this to hap-
pen. More generally, because of confirmation bias, our expectations can influence what
sources of evidence we seek, how we view the evidence we encounter, and how well
we remember evidence later. These problems increase with emotionally and politically
sensitive topics. All of this makes it hard for people—including scientists—to reason their
way to the right answers.
15031-1864-Fullbook.indb 34 6/24/2018 7:38:27 AM

What Is Science? 35
Norms of Investigators
An element of science’s great success in generating knowledge about our world is its fea-
tures that protect against or counteract the basic flaws in human reasoning. We’ll discuss
those features in two categories: first, norms of science that apply to individual scientists
and second, norms of science that apply to the scientific community. You can think of
these norms as rules or guidelines against which scientists’ actions can be deemed good
or bad, desirable or undesirable.
Because of science’s aim of producing knowledge, scientists are obligated and trained
to have a certain kind of integrity. Scientific integrity requires scientists to be sincere and
honest, and to avoid improper influence by others. Violations of these norms can under-
mine science’s ability to produce trustworthy knowledge.
Plagiarism is an obvious example of dishonesty. Plagiarism consists of presenting
somebody else’s ideas, scientific results, or words as one’s own work, intentionally or
unintentionally, by not giving proper credit. When plagiarism is discovered in science, it
is severely penalized, perhaps including a ban from publishing in peer-reviewed journals
or suspension or expulsion from one’s institution.
Faking data to support a desired conclusion is another egregious violation of scientific
integrity. In 2011, Diederik Stapel, a Dutch social psychologist, published a widely read
study in Science—one of the most prestigious scientific journals. The study presented evi-
dence supporting the dramatic conclusion that trash-filled environments lead people to
be more racist. But rather than earnestly collecting actual data, Stapel just made up the
data. When this was discovered, Stapel’s reputation shifted from that of a respected aca-
demic to a prominent example of fraud. All of his other publications were scrutinized, and
approximately 60 other papers were retracted for data fabrication. Other scientists have
also been forced out of science after their ethics violations were discovered, such as the
stem-cell researcher Hwang Woo-suk from Seoul National University and the Harvard evo-
lutionary biologist Marc Hauser. Some science journalists have helped increase awareness
of retraction due to issues like data fabrication by running blogs such as Retraction Watch.
But scientific integrity requires more than just not misattributing or misrepresenting
ideas and data. Scientists also ought to avoid conflicts of interest —that is, financial or
personal gains that may inappropriately influence scientific research, results, or publica-
tion. Scientists are obligated to disclose any potential conflicts of interests. The existence
of a potential conflict of interest does not necessarily lead to researcher bias or misinter-
pretation of data, but transparency about any potential conflicts of interest allows others
to better evaluate the possibility of improper influence. Conflicts of interest, especially
when research is funded by organizations with a financial stake in the findings, can result
in researchers intentionally or unintentionally altering the research they conduct, their
findings, or what they report in publications.
Clair Patterson, a geochemist at Cal Tech in California, became famous for definitively
calculating the age of the Earth (≈4.54 billion years) in the 1950s. He also led the cam-
paign to remove lead from gasoline in the 1960s and 1970s. Leaded gasoline contained
lead tetraethyl, which is extremely toxic—a single drop of pure lead tetraethyl can be
fatal—and had to be handled with utmost caution by its manufacturers. Because the
campaign against leaded gasoline threatened their profits, the fossil fuel industry—led
by the Ethyl Corporation—fought bitterly against Patterson. Among their tactics was to
15031-1864-Fullbook.indb 35 6/24/2018 7:38:27 AM

36 What Is Science?
procure a shill, Robert A. Kehoe, who was paid handsomely by the fossil fuel industry to
attest to the safety of leaded gasoline. Led by Patterson, honest science eventually carried
the day. But serious damage had already been done: generations of Americans suffered
from elevated lead levels in their blood through the end of the 20th century.
Kehoe later used his scientific credibility to endorse pollutants like Freon, undermining
scientific evidence showing their damage to the earth’s ozone layer. He was later com-
missioned by DuPont, General Motors, and other companies to produce studies assert-
ing that dangerous carcinogens were safe. Ultimately, Kehoe’s efforts have been a model
for executives in a range of industries (tobacco, asbestos, pesticides, fracking, and so on)
for how to obstruct scientific efforts with misinformation. The ‘Kehoe approach’ is still
being deployed by the fossil fuel industry to evade evidence and undermine the scientific
research about anthropogenic climate change.
One way to promote scientific integrity is holding scientists accountable for their work.
Scientists should be prepared to engage with others about the ideas, methods, and data
they use to support their scientific findings. In climate science, for example, scientists
should be able to answer questions about what kinds of uncertainties their models involve
and what kinds of evidence they have for the reliability of their findings. This is related
to a second norm: scientists should be open to criticisms of their work and to new ideas.
Remember that science is always in revision. We have said that scientific ideas should
be in principle falsifiable and that scientists should be open to the possibility that any
idea will turn out to be false. Similarly, scientists should at least sometimes be willing to
entertain ideas that might initially seem unlikely, and they should be open to criticism
of their ideas and methods.
A third norm governing scientists as individuals is ingenuity. This is a natural partner
to the norm of openness to new ideas. Science benefits from the development of lots of
interesting ideas, violating our preconceptions. It’s often impossible to tell at the outset
which ideas will prove to be promising. Many, even most, new ideas in science will turn out
to be false. But time and time again, science has gone in unexpected directions, and that
can’t happen without new, creative ideas about the world and about how to pursue science.
The Blackawton Bees project is one striking example of how ingenuity can guide scien-
tific research. In this project, 28 children between 8 and 10 years of age from Blackawton
in Devon, UK, conducted a collaborative scientific study on bumblebees’ visuospatial
abilities, supervised by an ophthalmologist and an educator. The children wondered how
bumblebees decide to which flower to go to for food and whether bumblebees could learn
to recognize different flower shapes and colors. The children ingeniously brainstormed
about possible answers and creatively designed experiments to test their ideas. Their co-
authored findings were published as an original article in the scientific journal Biological
Letters. The article summarizes its discoveries as follows:
We discovered that bumble-bees can use a combination of color and spatial rela-
tionships in deciding which color of flower to forage from. We also discovered that
science is cool and fun because you get to do stuff that no one has ever done before.
(Blackawton et al., 2011, p. 168)
As the children involved in the bumblebee project can attest, scientific reasoning can and
should be ingenious, challenging, and creative.
15031-1864-Fullbook.indb 36 6/24/2018 7:38:27 AM

What Is Science? 37
Social Norms
No matter how many requirements are placed on scientists as individuals and no mat-
ter how good scientists get at satisfying those norms, science cannot be fully protected
against the flaws inherent to human reasoning by that alone. Individuals—even very bright
scientists—often aren’t aware of the flaws in their own thinking, and often aren’t in a
good position fix them by themselves. Thus, another important form of protection against
flaws in reasoning involves requirements placed on the scientific community as a whole.
We’ve already mentioned the importance of the social organization of science. This
organization is especially salient when research involves collaborations among lots of
researchers, sometimes with different disciplinary backgrounds, working at different times
and in different physical locations, often using different kinds of complex instruments.
Climate science regularly involves radically collaborative endeavors of this sort. After all,
climates are extremely complex, interconnected systems; no single person and no single
field of expertise can alone produce knowledge of how the climate works or how it will
change. Instead, many different people perform specific tasks based on their expertise
and their available instruments. When these different tasks are added together, they lead
to knowledge that no one could have produced alone.
But even scientific research that is not so visibly collaborative relies on the communities
of science. Research is informed by previous results found by other scientists, papers are
reviewed for publication by other scientists, and other scientists decide whether and how
to react to any given published finding. In massively collaborative research and solitary
research alike, it is really scientific communities, instead of scientists as individuals, that
produce knowledge.
The good functioning of scientific communities depends on various social norms and
incentive structures. One primary social norm in science is trust. Scientists’ trust in one
another is the glue of scientific communities. For example, collaborative projects on
climate change involve scientists with a range of different expertise, including climatolo-
gists, ecologists, physicists, statisticians, and economists. None of these scientists alone
possesses comprehensive expertise to understand the full range of evidence that bears
on our understanding of anthropogenic climate change. So these scientists must rely on
each other and must trust one another’s scientific work. Individual members of the public
often don’t have the expertise to evaluate most scientific findings, and so they too must
trust the scientists who are experts on those topics.
But it’s also true that the work of scientists must be critically evaluated, by other
scientists and the public alike. For the public, the most straightforward way to evaluate
scientific work is to assess the quality of the arguments presented. This is not always
possible, however, as scientific information can be technical and difficult to understand
by non-experts. For this reason, it is also important to pay attention to the reputation of
the alleged expert. Scientists’ reputations are based on their track record of accomplish-
ments in their field, as judged by other scientists with similar research expertise. Scientists
critically evaluate one another’s work by deciding whether particular results warrant
publication, evaluating the strengths and weaknesses of scientific studies, and choosing
whether and how to respond to published findings.
One form that scientists’ healthy skepticism toward the work of others and their
openness to criticism takes on is attempts to replicate others’ findings. In replication,
15031-1864-Fullbook.indb 37 6/24/2018 7:38:27 AM

38 What Is Science?
TABLE 1.2 Individual and social norms that protect against bias and flaws in reasoning
Individual norms Social norms
Integrity Trust
Openness Skeptical evaluation
Ingenuity Diversity
an experiment or study is carried out again in order to see whether the same results
are achieved. If successful, the replicated results further confirm the idea under inves-
tigation. If the results are not replicated, this raises doubts about the results of the
original work.
This balance of trust and skepticism among scientific communities also helps protect
against individual biases. Imagine that a scientist with legitimate expertise is paid by an
oil company as a consultant. The same scientist pursues research aiming to show that the
evidence for anthropogenic climate change is inconclusive. Such a scientist would have an
obvious conflict of interest; this should lead other scientists and members of the public to
be very cautious about trusting those findings. It doesn’t guarantee the scientist is wrong,
but it does raise questions about whether her judgment is clouded.
Communities of experts also protect against the undue influence of individual eco-
nomic, religious, and political values, and thus confirmation bias and expectancy bias, in
another way. Scientists are each susceptible to biases that make them see the world more
like what they expect it to be and how they hope it will be. But their expectations and
hopes are different from one another’s. So, in a scientific community, these biases tend to
balance each other out. The conclusions that different scientists all agree to are thus less
likely to have resulted from bias. Hence, if a large and diverse group of scientists agree
about some result, we should be more confident that the result is accurate.
This brings us to one last point about the social norms of science, on which we will
elaborate in Chapter 8. To adequately protect against individual bias and flaws in reason-
ing, scientific communities need to be diverse in any way potentially relevant to scien-
tists’ values and, thus, to flaws in their reasoning. The best science is done by female and
male scientists with diverse cultural knowledge, various ethnic and racial backgrounds,
and from a variety of nations. This kind of diversity benefits science by guarding against
any individual biases. The individual and social norms of science we have identified are
summarized in Table 1.2.
Methods in Science
The third and final topic about how science protects against flaws in reasoning is also the
most important. The definition of science developed here stressed the methods used in
science. And it’s these scientific methods that bear most of the responsibility for helping
science overcome the flaws and limits of individual scientists’ reasoning.
We have suggested that, despite what you might have heard in high school science
classes, there’s no set scientific method: science proceeds in myriad ways. But this doesn’t
15031-1864-0FM-CH04.indd 38 6/24/2018 1:32:58 PM

What Is Science? 39
mean there are no methods to science. In fact, as the title of this book suggests, the oppo-
site is true: science may have no set method, but it does proceed according to familiar
recipes. It is the purpose of this book to outline some of the main recipes for science.
At the heart of many of those recipes, there is a pattern that includes something like
these three steps: the formulation of hypotheses, the development of expectations based
on hypotheses, and testing expectations against observations. One common way this plays
out is for scientists to formulate hypotheses about the world—what was described earlier
as bold and risky conjectures—and then use those hypotheses to generate specific expec-
tations regarding their experiences. If their observations conform to those expectations,
their hypotheses are confirmed. If not, they return to the drawing board.
These steps can occur in different orders, and they happen again and again in vari-
ous combinations. Different methods may also be involved in each of the three steps.
For example, sometimes, scientists have a specific hypothesis to investigate; other times,
research is more exploratory and open ended. Sometimes, hypotheses have obvious empir-
ical implications; other times, scientists need to use statistics to develop their expectations.
Sometimes, scientists design experiments to test their expectations; other times, they
develop models. Other scientific work simply isn’t described well by this trio of hypoth-
esis, expectations, and observations, such as the theoretical work behind string theory.
But these three steps are integral to the production of scientific knowledge. They are
the basic ingredients that, with tremendous variation, occur in many of the recipes for
science. With the help of these steps, scientific theories, laws, models, and other advances
are developed and refined. And these advances, when successful, are the vehicles of our
scientific knowledge.
We’ll conclude this chapter by talking about each of these three steps in greater detail.
Throughout the rest of the book, we’ll regularly refer back to these three ingredients and
the different ways they factor into the recipes for science.
Hypotheses
As we have seen, empirical investigation is how we learn about our world. Scientists
make observations to try to figure out what’s out there, why things are the way they
are, how things change, and so forth. But simple observations can’t accomplish that task
by themselves. A second ingredient is needed: theoretical claims. Theoretical claims are
claims made about entities, properties, or occurrences that are not directly observable.
These things might be excessively large or small, like the whole universe or quarks; they
might be too distant in time or space to observe, like the first forms of life on Earth
or black holes; or they might not be directly observable at all, like what physicists call
‘dark matter’.
As an example of a theoretical claim, consider a claim about all of something of some
kind, like the claim that all salt dissolves in water. You might have seen plenty of salt
dissolving in water, but you will never be able to witness all of the salt that exists dis-
solving in water. So direct observation can’t guarantee that all salt dissolves in water. We
have plenty of evidence that this is so, but the claim is theoretical because it goes beyond
what we can directly observe.
Science is centrally concerned with a special kind of claim called a hypothesis. A
hypothesis is a conjectural statement based on limited data—a guess about what the
15031-1864-Fullbook.indb 39 6/24/2018 7:38:27 AM

40 What Is Science?
world is like, which is not (yet) backed by sufficient, or perhaps any, evidence. Scientists
do not yet know whether any given hypothesis is true or false; when there is sufficient
evidence in favor of some hypothesis, it graduates from that category. Formulating a
hypothesis often requires some imagination; if you could observe something we can’t—if
you could witness the beginning of life on Earth, fly into a black hole, or see all the salt
in the world—what would you find?
Scientists might formulate a hypothesis before any observations have been made, just
with the use of their imagination. But often initial observations, other hypotheses, or
background knowledge about related phenomena help inspire new hypotheses. Before
scientists knew about the properties of potassium chloride, they’d seen table salt—sodium
chloride—dissolves in water. This informed their expectations for potassium chloride, a
similar compound. Scientists’ hypotheses about the first life forms were shaped by what
they know about organisms, existing and extinct, and how the Earth has changed over
geologic time.
Scientists can have different levels of confidence in different hypotheses. If a hypothesis
is informed by lots of experience with similar objects or significant background knowl-
edge of related phenomena, scientists might be much more confident in it than if it is a
random guess. But, by their very nature, hypotheses are guesses. This is why hypotheses
must be tested.
Expectations
Learning whether a hypothesis is true is often more circuitous than just making direct
observations. A second ingredient is usually needed to test hypotheses; this is developing
expectations based on hypotheses. Expectations are conjectural claims about observable
phenomena based on some hypothesis. These claims are conjectures since they go beyond
what scientists have observed so far, but, unlike hypotheses, their truth or falsity can be
discerned directly from the right observations. Indeed, expectations are claims about what
scientists expect to observe if a given hypothesis is true.
Expectations do not regard just any potential observations, but observations that sci-
entists anticipate being able to make. We could say what we would expect to experience
if we were in the middle of a black hole (given some hypothesis about black holes), but
since we don’t expect to ever be making observations from inside a black hole, those
expectations are useless. Instead, expectations based on a hypothesis regarding black holes
should be about what scientists would expect to see through a telescope, patterns of x-rays
and gamma rays detected from Earth or from a satellite, and so on.
Depending on the nature of a hypothesis, developing expectations in light of the
hypothesis can be relatively straightforward or incredibly complicated. On one extreme,
the hypothesis that all salt dissolves in water leads directly to an expectation: any given
sample of salt will dissolve when placed in water. But even then, there are some other
conditions to stipulate. Should salt dissolve when placed on a chunk of frozen water (that
is, ice)? What if some salt is already dissolved in the water? Should we still expect the
sample of salt to dissolve?
Expectations regarding black holes are much more complicated to develop. Black
holes are objects so massive, so dense, that even light gets pulled inside. No one will ever
15031-1864-Fullbook.indb 40 6/24/2018 7:38:27 AM

What Is Science? 41
take measurements on the edge of a black hole. Even if someone could get there and sur-
vive to take measurements, the measurements couldn’t be recorded in a way that escaped
the black hole. Nor does anyone expect to see a black hole through a telescope, since that
would require light to leave the black hole and travel to our telescope. Hypotheses about
black holes must thus be investigated entirely by formulating expectations regarding their
effects on other objects, objects that can emit electromagnetic radiation (and thus be seen
through telescopes) and give off measurable x-rays or gamma rays.
No matter whether deriving expectations is relatively straightforward or incredibly
complicated, this is an important and nontrivial step of scientific work. Expectations set
scientists up to make observations that can provide evidence for or against the truth of
a hypothesis. Deriving expectations thus serves as a bridge between conjectural claims
(hypotheses) and immediate observational claims (data).
Observations
Nearly all science fundamentally depends on observations. This is because, as we have
discussed, scientific inquiry is ultimately an empirical inquiry. It’s not enough to think up
interesting ideas about how the world might work; those interesting ideas must also be
evaluated by how well they fit with our observations of the world. Observations include
any information gained from your senses—not only what you see, but also what you hear,
smell, and touch and sense in any other way you can experience the world.
Your observations belong only to you. If we are on a hike together, we might both
hear a rattling sound coming from behind a boulder. But each of us only has access to
our own experience of the sound. You can’t compare my observation to yours. Data are
different. Data are public records produced by observation or by some measuring device.
(The singular form in Latin is datum.) Observations are important because they are your
only way to directly access the world. Data based on observations are important because
they allow us to record and compare our observations.
Our powers of observation are ultimately capacities to detect physical information
and then—literally—to incorporate it into our bodies. For example, when one hears a
serpentine rattle from behind a boulder, an acoustic waveform with a number of distinct
physical features enters into the ear canal and causes the tympanic membrane to vibrate.
The vibrations are ‘forwarded’ to the cochlea via the bones of the middle ear, where
shearing force of the tectorial membrane mechanically moves the hair cells of the basilar
membrane. Hair cell movements are then transduced into electrical signals, which leave
the organ of Corti and travel via the main auditory nerve to the brain. The embodied
brain then has to interpret that signal (as a serpentine rattle rather than a baby rattle,
for instance). This is what it takes for humans to hear a serpentine rattle from behind a
boulder. A similar transformation occurs when you see that a test strip of litmus paper
has turned blue, feel the heat produced by a chemical reaction, and so forth.
Observation isn’t passive. We can move our heads to see different things and relo-
cate our bodies to different places where we can hear different things. We can also use
observations from multiple senses together. If you’re wondering about that rattling sound
from behind the boulder, you can walk around to the other side to see whether there’s a
rattlesnake there. Besides changing our position and using multiple senses to enhance our
15031-1864-Fullbook.indb 41 6/24/2018 7:38:27 AM

42 What Is Science?
observations, we can also change the world around us to create opportunities for different
observations. Crushing a leaf lets you better smell whether it’s sage or mint. When we do
such things, we have begun to do simple experiments on the world around us. We will
talk about experiments in Chapter 2.
Humans have also found many ways to use tools to enhance our powers of observation.
Light can be refracted with mirrors, prisms, and lenses to extend the reach of vision. We
now can see not just through our eyes alone but also through our eyes aided by telescopes,
microscopes, and other devices. To help us hear beyond our ears’ capabilities, we have
developed microphones, stethoscopes, and so on. These technological enhancements range
from observational correctives like eyeglasses and simple sensory aids like microscopes to
much more complex technology with highly specific purposes, like an fMRI machine to
show brain activity and the Large Hadron Collider, which uses superconducting magnets
to cause streams of high-energy particles to collide in a detectable way. Such enhance-
ments have allowed humans to generate what we might call super-observational access
to what would otherwise be undetectable to us, given our sensory modalities.
Making observations, collecting data, is at the heart of science’s ability to generate
knowledge of our world. But observations aren’t always independent from the ideas about
the world we already have. Changes in what we believe to be true can have a significant
impact on what we observe. For instance, when we observe the Sun at the horizon, what
we seem to see is the Sun at one point on its path across the sky. Geocentricism organizes
this and similar observations into an easily understood pattern, and those observations con-
firm the geocentric conception. But from the perspective of the heliocentric conception
of the solar system, with your head slightly turned to the side, the Sun at the horizon and
the other planetary bodies that appear comprise a different observation. See Figure 1.9.
Heliocentrism is a different perspective, and it may also create a different perceptual
experience, or observation—the Sun setting not because it moves below the horizon but
because your position on Earth rotates away from it. New ideas can sometimes have a
strong effect on what we think we see, on our very observations. Observations are crucial
to science, but they aren’t always the starting point, and they aren’t always decisive.
EXERCISES
1.19 Why do the authors suggest there’s no unitary scientific method? Evaluate that idea,
raising considerations in favor of it as well as considerations opposed to it.
1.20 Describe three types of influence confirmation bias has, and define the observer-
expectancy effect. Think of a novel example for each of the four (three types of influ-
ence of confirmation bias and observer-expectancy effect). Make sure it’s clear how
each example illustrates each idea.
1.21 Describe three kinds of scientific fraud or scientific misconduct, giving an example
of each. Explain how each example undermined science’s ability to produce trust-
worthy knowledge.
1.22 How should trust and skepticism be balanced in scientific communities, and why is this
important to science? How should trust and skepticism of the public toward scientific
findings be balanced, and why is this important for the public’s relationship to science?
15031-1864-Fullbook.indb 42 6/24/2018 7:38:27 AM

FIGURE 1.9 Reorientation from geocentrism to heliocentrism
15031-1864-Fullbook.indb 43 6/24/2018 7:38:27 AM

44 What Is Science?
1.23 Choose one of the following, and invent a pseudoscientific theory about it. Feel free
to be creative!
a. The origin of the universe
b. The healing power of music
c. People’s handwriting
d. The change of organisms over time
Then, describe how the norms for scientists, the scientific community, and the meth-
ods of science could help guard against your made-up theory. Try to make your
answer comprehensive, involving all the main topics from this section.
1.24 Search the internet (news websites, magazines, and so forth) for a story or advertise-
ment about a scientific finding or a medical or health treatment that purports to be based
on science, and answer the following questions about it. Make sure you include the story
or advertisement when you submit your answers, as well as a link to it on the internet.
a. What is the source? Is the person or entity making the claims someone with
genuine expertise in what he or she is claiming?
b. Does it seem like there’s any conflict of interest? Why or why not?
c. Does the claim involve vague or ambiguous language?
d. Do the claims fit with other well-confirmed scientific theories?
e. What is the evidence cited in support of the claim?
f. Does this describe good science? Why or why not?
1.25 What is the difference between observations and data? What is important about obser-
vations in particular and why? What is important about data in particular and why?
1.26 Hypotheses, expectations, and observations are all important ingredients for most
science. Describe the importance of each, a typical way that the three ingredients
work together, and what they accomplish together.
1.27 Hypotheses, expectations, and observations are all important ingredients for most
science. Describe a difficulty with each, or circumstances in which it can be difficult.
1.28 Imagine you are a doctor in a large medical practice. The other doctors are consider-
ing introducing a homeopathic service for their patients. They ask you to prepare a
report summarizing the pros and cons of doing so. One of the other doctors, Dr. A,
is entirely dismissive of homeopathy on the grounds of the weakness of its scientific
basis; another doctor, Dr. B, has read a report of a study that she says shows that
homeopathy can outperform placebo and is inclined to be sympathetic. Yet another
doctor, Dr. C, has said that he doesn’t care about the evidence, so long as homeopa-
thy works and is not toxic. Write a 500- to 800-word report describing homeopathy
(you’ll probably have to do a bit of research), addressing each of the other doctors’
points of view, and recommending whether to introduce homeopathic service. You
should employ any of the concepts from this chapter that you find useful.
FURTHER READING
For more on the science of climate change, see the Intergovernmental Panel on Climate
Change. (2014). Fifth Assessment Report (AR5). Geneva: IPCC. Retrieved from www.
ipcc.ch/report/ar5/
15031-1864-Fullbook.indb 44 6/24/2018 7:38:27 AM

What Is Science? 45
For the latest data and information for stabilizing Earth’s atmosphere, climate, and living
environments, see CO2.earth: Retrieved from www.co2.earth/
For more on political influence used to cast doubt on climate change research and other
scientific findings, see Oreskes, N., & Conway, E. (2010). Merchants of doubt. New
York: Bloomsbury.
For more on the demarcation between science and pseudoscience, see Pigliucci, M., &
Boudry, M. (eds.) (2013). Philosophy of pseudoscience: Reconsidering the demarcation
problem. Chicago: University of Chicago Press.
For more on the Scientific Revolution, see Kuhn, T. S. (1957). The Copernican revolution:
Planetary astronomy in the development of Western thought. Cambridge: Harvard Uni-
versity Press. See also Shapin, S. (1996). The scientific revolution. Chicago: University
of Chicago Press.
For more on science in the Persian Golden Age and other periods around the world, see
the History of Science Society: Introduction to the history of science in non-Western tradi-
tions. Retrieved from https://hssonline.org/resources/teaching/teaching_nonwestern/
For a concise treatment of the illusion of explanatory depth, see Keil, F. C. (2003). Folk-
science: Coarse interpretations of a complex reality. Trends in Cognitive Sciences, 7(8),
368–373.
For the psychology of confirmation bias and bias more generally, see Nickerson, R. S.
(1998). Confirmation bias: A ubiquitous phenomenon in many guises. Review of Gen-
eral Psychology, 2(2), 175–220. See also Hahn, U. & Harris, A. J. (2014). What does
it mean to be biased: Motivated reasoning and rationality. The Psychology of Learning
and Motivation, 61, 41–102.
For more on how social norms and social structures influence scientific inquiry, see Mer-
ton, R. K. (1942). Science and technology in a democratic order. Journal of Legal and
Political Sociology, 1, 115–126. See also Boyer-Kassem, T., Mayo-Wilson, C., & Weis-
berg, M. (eds.) (2018). Scientific collaboration and collective knowledge. Oxford: Oxford
University Press.
15031-1864-Fullbook.indb 45 6/24/2018 7:38:27 AM

CHAPTER 2
Experiments and Studies
2.1 EXPERIMENT: CONNECTING HYPOTHESES

TO OBSERVATIONS
• Describe the role of experiments in testing hypotheses

• Identify the main features of an experiment in an example of scientific research
• Define extraneous variables and articulate why these must be controlled in an
experiment
• Describe the problem of underdetermination and how scientists deal with it
• Identify three other uses of experiments in science besides hypothesis-testing
How Experiments Contribute to Science

A central aim of science—perhaps the central aim—is to produce knowledge about
the world, which involves formulating natural explanations of natural phenomena.
Experimentation is one primary strategy used to achieve this aim. Many scientists per-
form experiments in order to test new hypotheses and to extend existing knowledge. So
let’s ask: what are experiments?
Recall from Chapter 1 the three common ingredients in recipes for science and the
relationships among them:
1. Hypotheses are used to generate expectations.

2. Expectations are compared with observations.
3. That comparison is used to develop, confirm, reject, or refine a hypothesis
When it comes to testing hypotheses, experiments contribute to ingredient 2. Experiments

provide a structured way to make observations—that is, to collect data—and to compare
those observations with what we would expect to observe if the hypothesis under inves-
tigation were true. We will see later in this chapter and subsequent chapters that there
are different ways to collect data besides experiments and that experiments are not only
used to test hypotheses. But this is a good starting point.
Suppose you want to find out how the physical characteristics of plants and animals
influence the characteristics of their offspring—for example, how your height depends on
Experiments and Studies 47
the heights of your parents or how the shape of a pea plant seed depends on the shapes
of the seeds of the parents of that plant. How could you investigate this? The scientist and
friar Gregor Mendel (1822–1884)—born in the Austrian Empire in what is now the Czech
Republic—investigated such questions by breeding pea plants. He fertilized some pea plants
with pollen from their own flowers and others with pollen from the flowers of plants with
different physical characteristics. In this way, Mendel controlled the physical characteris-
tics of the parent plants, such as their seed shape (smooth or wrinkled) and flower color
(purple or white). He could then observe what characteristics resulted for their offspring.
For example, if a pea plant with purple flowers (whose parents all had purple flowers)
is crossed with a pea plant with white flowers, the offspring will all have purple flowers.
Mendel’s selective fertilization of pea plants illustrates a key feature of experiments.
In an experiment, a researcher introduces specific changes to a system and observes the
effects of these changes. The patterns in characteristics resulting from his selective breed-
ing of pea plants led Mendel to posit units of heredity (now known to be genes) that
determine variation in inherited characteristics according to set patterns across biological
organisms from pea plants to humans. In part, Mendel conjectured that some heredity
units are dominant and others recessive; this accounts for why purple-flowered plants and
white-flowered plants have purple-flowered offspring (Mendel, 1865/1996).
Figure 2.1 illustrates two crosses between pea plants, showing flower color (purple or
white) and dominant or recessive heredity units, or genes (A and a, respectively; each plant
has two). Flower color was observed from experiments; from these, Mendel postulated the
genes shown here. The grid on the left shows that a purple-flowered pea plant with two
dominant genes and a white-flowered pea plant with two recessive genes have offspring
with entirely purple flowers. But, as the grid on the right shows, two purple-flowered
pea plants bred in this way have one dominant and one recessive gene. Despite having
entirely purple flowers, these plants have 25% offspring with white flowers.
In experiments, as Mendel’s work illustrates, scientists introduce specific changes to
a system in order to make observations about how the system responds. In Chapter 1,
we learned that data are public records produced by observation. Experiments are used
to produce data of one kind or another. Experimental data can include various kinds of
a a A a
A A
Aa Aa AA Aa
A a
Aa Aa Aa aa
F1 generation F2 generation
FIGURE 2.1 Illustrations of two crosses between pea plants, representing dominant and reces-
sive genes for flower color
48 Experiments and Studies
measurements, artifacts, signs, the location of some object, or even an object’s absence.
Mendel’s data consisted of systematic records of the fertilization history and physical
properties of each pea plant. For physicians, the results of blood tests and testimony
about one’s medical history can both count as data. Fossils, tracks, and recordings of the
geochemical features of rocks all can count as data for a paleontologist; and for anthro-
pologists, data may include monuments, pottery, and written documents.
Another concept from Chapter 1 that we need to structure this discussion is scientists’
use of empirical evidence to justify their scientific beliefs. When a hypothesis is used to
develop clear expectations for the outcome of some experiment, and data are gathered
from the experiment that match or conflict with those expectations, then the experi-
ment has produced empirical evidence for or against the hypothesis. The data collected
by Mendel, for instance, turned out to be empirical evidence supporting the belief that
inherited characteristics are caused by discrete units of heredity that come in pairs, one
from each parent.
Often, it’s not obvious what a hypothesis should lead one to expect to observe.
Explicit expectations must be developed from a hypothesis before that hypothesis
can be tested with empirical evidence. Before Mendel’s experiments, most people
believed that physical characteristics resulted from a blending of each parent’s char-
acteristics. This hypothesis would lead us to expect that offspring tend to have
traits that are intermediate between the traits of their parents. So, for example, the
offspring of purple-flowered pea plants and white-flowered pea plants should have
light-purple flowers. Mendel’s observations did not support this expectation; both
crosses in Figure 2.1 illustrate this. But is flower color just an exception to a general
pattern of blended inheritance? Notice that this is a question about what the hypoth-
esis of blended inheritance should lead us to expect. Many different experiments, on
different traits, helped confirm that Mendel’s hypothesis of hereditary units holds
up more generally.
Scientific experiments are designed to be a particularly powerful way to test expecta-
tions against observations. Challenges stem from humans’ tendency toward biased rea-
soning, including a tendency to observe what we want to be true, and the difficulty of
discerning what hypotheses should lead us to expect. Experiments offer two different
approaches to overcoming such difficulties. In some experiments, typically performed in
a laboratory, data are produced under tightly controlled conditions that are designed to
make expectations and observations both as clear as possible. In other experiments, often
performed outside a laboratory or ‘in the field’, scientists compare specific features of
individuals in different, variable groups to make it easier to check observations against

expectations. We will dig into the features of both styles of experiment in the next two
sections.
Variables and Their Control

Experiments involve introducing specific changes to a system in order to make observa-
tions about how the system responds. At this point, introducing some terminology will
help clarify this central feature of experiments. A variable is anything that can vary,
change, or occur in different values. For example, the number of books read during the
past year, height, the flower color of a pea plant, and the temperature in your hometown
are all variables. The value of a variable is just its state or quantity in some instance. For
example, the flower color of a pea plant may have values like white, purple, or pink; and
your hometown temperature might have the value 62° Fahrenheit one summer evening.
In experiments, there are three categories of variables: independent, dependent, and
extraneous. An independent variable is a variable that stands alone, that is, whose values
vary independently from the values of other variables in an experiment. When scientists
introduce specific changes to a system in an experiment, they do so by changing the value
of one or more independent variables. This is often called an intervention. A dependent
variable is a variable whose change depends on another variable. When scientists change
the value of an independent variable in some experiment, they do so in order to investigate
how that change affects one or more dependent variables. For example, one might vary
the amount of visible light (independent variable) in a factory or workspace and then
look for changes in workers’ productivity (dependent variable).
Experimental methods are designed to enable scientists to isolate the relationship
between independent and dependent variables. This requires controlling background con-
ditions, or extraneous variables, as much as circumstances allow. Extraneous variables
are other variables besides the independent variable that can influence the value of the
dependent variable. If you’re exploring the relationship between the amount of visible
light in a factory (independent variable) and workers’ productivity (dependent variable),
then extraneous variables include noise levels in the factory, the heights of the workers,
the amount of coffee workers drink daily, the country in which the factory is located,
the weather, and so on.
If extraneous variables are not taken into account, they, and not the independent vari-
able, may be responsible for any changes in the dependent variable. Alternatively, extrane-
ous variables may counteract the influence of the independent variable on the dependent
variable. In these ways, extraneous variables can ‘confound’ the relationship between
the independent and dependent variables. If they do so, they are known as confounding
variables. These are extraneous variables, which vary in ways that influence the value of
the dependent variable in unanticipated ways. Confounding variables can interfere with
the accuracy of the conclusions drawn from an experiment.
Imagine now that you want to investigate the relationship between the amount of
visible light in a factory and workers’ productivity. In particular, your hypothesis is that
better lighting increases workers’ productivity. To test this hypothesis, you could run an
experiment by varying the amount of light (independent variable) and subsequently look-
ing for changes to workers’ productivity (dependent variable). What are some ways you
can think of to change the value of the independent variable? One option is to change
the number of light fixtures in some workspace; another option is to wait for the time of
year to change (there’s more sunlight for longer hours in summer than winter). A third
choice is to compare two factories, one with better lighting than the other.
One thing to consider when weighing these options is the possibility of confounding
variables. Of these options, which introduces the least number of extraneous variables?
Think about all the changes between summer and winter beyond the amount of light;
perhaps wearing scratchy wool clothing in the winter or the shorter winter days decrease
work productivity, or perhaps summer heat decreases productivity. These are extraneous
variables that could easily become confounding variables. Likewise, different factories
can have many other differences between them beyond just the quality of lighting.
Perhaps one pays a better hourly wage than the other, offers more vacation, or has free
coffee in the break room. The best option seems to be to choose one workspace and
then vary the number of light fixtures, while keeping all other conditions as uniform
as possible.
You will want to measure and record the values of the independent variable (amount
of light), so as to compare those values with the values observed for the dependent vari-
able (workers’ productivity). How could you measure worker productivity? Perhaps by the
number of widgets produced in one hour? But what if the number of people who come
to work on a given day varies? It’s probably better to measure the number of widgets
produced in one hour divided by, or averaged over, the number of workers. That takes
into account, or controls for, the extraneous variable of number of workers.
The general point here is that the setup of the experiment, including its location
and timing, how the independent variable is intervened upon, the type of measure-
ments, and so forth, are all shaped by the need to minimize the possibility of con-
founding variables. This is the key to effective experimental design: an independent
variable is varied in a controlled way, and the value of a dependent variable is mea-
sured, while keeping all extraneous variables fixed or taking them into account in
some other way. (This is why we divided widgets produced by number of workers
present to take into account any variation in the number of workers who showed
up to work that day.)
In experiments involving human participants, one confounding variable can be the
Hawthorne effect or observer bias, where experimental participants change their behav-
ior, perhaps unconsciously, in response to being observed. The name of this effect
originates from a series of experiments from the late 1920s and early 1930s that were
performed in a Chicago suburb at Western Electric’s Hawthorne factory (Parsons, 1974).
Some of these experiments investigated the effects of lighting conditions on workers’
productivity. Two groups of workers participated in the study. One group worked in an
area where there was a dramatic increase in the quality of the lighting. For the other
group, the lighting conditions remained just as before. Experimenters discovered that
the worker productivity in the well-illuminated area increased significantly compared to
the other group. This finding seemed to support their hypothesis that improved light-
ing increases productivity. However, the experimenters were surprised to discover that
workers’ productivity also improved with changes to rest breaks, to working hours, and
to the types of meals offered in the factory’s canteen. As they experimented further,
they found that even dimming the lights to original levels increased productivity! This
result undermined the initial findings about the relationship between the amount of
light and productivity. The experimenters eventually concluded that the changes to the
quality of illumination had no real impact on job productivity. As it turned out, workers
became more productive simply when they knew they were being studied. This is, of
course, the Hawthorne effect in action.
FIGURE 2.2 Western Electric’s Hawthorne factory illumination study
The Hawthorne effect can be found in almost any experiment with human participants
and can be a serious confounding variable. This is related to the observer-expectancy
bias discussed in Chapter 1, which is when researchers’ expectations are themselves a
confounding variable in an experiment. Fortunately, there are experimental methods that
control for the extraneous variables of researchers’ and participants’ expectations; we’ll
get to that topic later in the chapter.
What’s the Stuff of Light?

Let’s dig in more deeply to features of experiments by looking at how experimentation
contributed to our knowledge of light. The light we ordinarily see is visible or ‘white’
light; what is not illuminated appears to us as shadow, darkness. What is the nature of
light? Is it made of more basic matter, and if so, what? And is the light we see the only
kind of that stuff there is? Intuitively, it’s hard to imagine that light could be anything
other than something visible to us.
The nature of light and its relation to the color spectrum visible in rainbows have
been studied for millennia. In Chapter 1, we mentioned Ibn al-Haytham (Latinized as
Alhazen), who, during the Persian Golden Age, made important contributions to the
scientific understanding of vision, optics, and light. In his book Kitāb al-Manāẓir (Book
of Optics), he evaluated existing theories of light and vision, emphasizing that carefully
designed experiments are a basis of our knowledge of the world. Through experiments
using lenses and mirrors, Ibn al-Haytham showed that light travels in straight lines. From
dissections, he began to explain how the eye works and synthesize the medical knowledge
of previous scholars. In particular, Ibn al-Haytham demonstrated that light is not produced
by the eye, as some theories had claimed, but instead that it enters the human eye from
the outside. Once it was clear that light given off by objects enters the eyes, this raised
new questions about the nature of light (Al-Khalili, 2015).
In the centuries following Ibn al-Haytham’s breakthrough work, many other philoso-
phers and scientists engaged with those questions. In the 17th century, influential natural
philosophers thought that colored light was produced by the modification of white light
by interactions with objects and the materials through which it travels. So, passing light
through a glass prism was thought to produce a spectrum of colors because white light is
modified by the impurities of the glass. Similarly, it was thought that we perceive colorful
rainbows because sunlight is modified by going through drops of moisture.
Isaac Newton (1643–1727), one of the most influential scientists of all time, was not
convinced by this view. Instead, he hypothesized that colors are always contained within
the light itself and that passing light through materials just separates out the colors of
which light is made. To test these competing hypotheses, Newton darkened his room and
bored a small hole in the window shutters, so that only a thin beam of light could enter
the room. When Newton placed a glass prism in the beam, the spectral colors—a rainbow
of light—appeared on his wall. This observation was consistent with both hypotheses,
however. Both the modification hypothesis and Newton’s hypothesis that white light is
a mixture of colors could explain the observation that a beam of light travelling through
a glass prism produces a spectrum of colors.
In another experiment, Newton passed a beam of light through two prisms instead
of one. What would you expect to observe if the modification hypothesis were true?
Presumably, the impurities contained in the two glass prisms would continue to modify
white sunlight and just spread out the color spectrum further. When Newton let the beam
of light pass through the first prism, it split into a spectrum of colors as expected, just
like in the previous experiment. But when the spectrum of colored light passed through
FIGURE 2.3 Isaac Newton’s illustration of his two-prism experiment
the second prism, it recomposed back into white light! This observation was unexpected
under the modification hypothesis, but it was consistent with Newton’s thought that
white light is composed of colors. So, this experiment provided Newton with evidence
against the modification hypothesis and in support of his own hypothesis that passing
light through a prism merely separates out what is already there.
While experiments can generate scientific knowledge, they also often prompt new
questions. This was so for Ibn al-Haytham’s finding that light does not originate in the
eye and also for Newton’s later prism experiments. Light isn’t just something we see but
also something we feel; surely you’ve noticed that ordinary sunlight is warm. Newton’s
finding that visible white light is actually a spectrum of colors prompted further questions.
If light is a spectrum of colors, is it also a spectrum of temperatures? Or are different
colors of light the same temperature as one another?
In 1800, the British astronomer William Herschel (1738–1822) used a telescope to
observe sunspots, which are regions on the Sun that appear temporarily dark (Herschel,
1801). Observing sunspots is hazardous for the eyes, so he used colored glass filters to
reduce the intensity of the rays. Herschel noticed that he could feel the Sun’s heat com-
ing through the filters. Different filters seemed to differ in temperature; but since the
filters didn’t differ in material, Herschel wondered whether the different colors of the filters
might actually be responsible for the differences in temperature. Notice that this wasn’t
what Herschel had set out to investigate; sometimes experiments, or observations more
generally, take us in unanticipated directions.
Herschel tested his hypothesis about a relationship between light’s color and
temperature by directing sunlight through a prism to spread the spectral colors, as
Newton had. Then he measured each color—red, orange, yellow, green, blue, indigo,
violet—with a mercury thermometer. He also measured the ambient temperature in
the room in order to have a baseline temperature to compare with the temperature
measurements of the light. This setup yielded data in the form of measured values of
color (independent variable) and measured values of temperature (dependent variable),
which could be used as evidence to evaluate the hypothesis that different colors of
light differ also in temperature. The evidence confirmed this hypothesis: Herschel
found that the temperatures increased incrementally from the ‘cool’ colors like blue
to the ‘warm’ colors like orange.
Another of Herschel’s observations introduced a new question about light. Herschel
also measured the temperature of the air just beyond the beam of red light, outside
the edge of the spectrum created by sunlight through the prism, where no light was
visible. His hypothesis was that this temperature would be the same as the ambient
temperature in the room, since it was beyond the edge of the light spectrum. To his
surprise, the temperature at that location was much warmer than the ambient room
temperature, even higher than any of the temperature measurements for the light
spectrum. How could that be?
Herschel’s observation immediately led to a new hypothesis: some kind of invisible, hot
light exists just beyond the red part of the visible spectrum. This hypothesis would explain
the observation—anticipated by the French physicist Émilie du Châtelet (1706–1746)
almost 65 years earlier—that the temperature continued to increase beyond the edge of
red light. Later observations confirmed this hypothesis, and we now accept the existence
of this hot, invisible light. It’s called infrared light.
FIGURE 2.4 William Herschel’s experimental setup to test the relationship between the color
and temperature of light
FIGURE 2.5 Three scientists who contributed to our knowledge of light
Experimental Setup
Experiments have different aspects—physical, technological, and social—that need to fit
together in the right way for scientists to harvest useful evidence; how these aspects are
arranged is the experimental setup.
First, there are concrete, physical aspects. Experiments involve one or more subjects:
humans, non-human animals, or inanimate objects. They also often include instruments:
technological tools or other kinds of apparatus that help enable the experimental process.
Newton and Herschel used telescopes, lenses, prisms, light filters, pencils, and notebooks
to collect and analyze their data. Present-day experiments in high-energy physics at the
European Organization for Nuclear Research, CERN, take place in the Large Hadron
Collider. This is located in a tunnel on the border between France and Switzerland, and
it is used to accelerate and collide subatomic particles. The Large Hadron Collider took
10 years to construct (1998–2008) and involved the collaboration of over 10,000 sci-
entists and technicians from more than 100 countries and hundreds of universities and
laboratories. With a circumference of 27 kilometers, it is currently the largest scientific
instrument in the world. CERN experiments also require the use of powerful computers
for data collection, analysis, and visualization of the myriad particles produced by colli-
sion in the accelerator.
Experiments also occur in some place, over some period of time. Experiments can take
place in laboratories located in universities and hospitals or in the field, that is, in natural
settings like classrooms, subway stations, glaciers, coral reefs, nesting areas, and so on.
Some experiments have a short duration; others can last many years. Herschel observed
different temperatures related to different colored sun filters in one day, on February 11,
1800. Mendel’s experiments with pea plants stretched over a seven-year period. Present-
day experiments at CERN can take dozens of years, as do the experiments carried out in
space by the US National Aeronautics and Space Administration (NASA).
Experiments are also normally carried out by one or more individual scientists.
Collaborative experiments are common in contemporary science; this is one element of the
social structure of science discussed in Chapter 1. Most collaborative experiments involve
scientists with different backgrounds who rely on one another’s expertise. Experiments at
CERN, for example, are highly collaborative, run by hundreds of scientists and engineers
from all over the world, each of whom brings some specific expertise to bear. This is
more extensive collaboration than is common across science though, and some scientific
experiments are still run by a single lab or even an individual. But even in those cases,
communities of scientists, represented by scientific institutions and societies, determine
protocols to be followed in experimental design and data analysis.
Another aspect of experimental setup is harder to discern but just as important to pro-
ducing evidence. These are the background conditions or extraneous variables. Consider
Newton’s prism experiments. The room at Trinity College, Cambridge, where he per-
formed these experiments, had a certain ambient lighting, temperature, and humidity. The
angle at which sunlight hit the room’s windows varied by time of day and season. Prisms,
the instruments Newton used, were not commonly thought of as scientific instruments
in the 1660s and so were sold simply for their entertainment value. As a result, they
were irregular in both size and composition. These factors were all in the background of
Newton’s experiment.
So, Newton needed to show that none of these background factors undermined his
conclusion that apparently white sunlight contains distinct colors within it (Newton,
1671/1672). As it happened, the Royal Society—the learned society for science of which
Newton was a member—criticized his results on the basis of the condition of the prisms.
The Royal Society suggested that, consistent with the earlier modification hypothesis,
the prisms’ bubbles, veins, and other impurities caused the light to become colored as it
passed through. In general, managing background conditions is one of the most challeng-
ing issues of running experiments.
Collecting and Analyzing the Data

An experimental setup elicits data. Those data must be collected and analyzed in order to
compare the experimental outcome with expectations. Collecting data involves gathering
and often measuring information about the values of variables of interest at particular
times, places, and contexts. Climate scientists collect data from things like glaciers, oceans,
and the atmosphere—for example, glaciers’ mass balance, sea surface temperatures, and
the atmospheric pressure at sea level. The choice of an appropriate method for data col-
lection depends on many factors, including one’s research interests, the hypothesis under
investigation, the variables of interest, and the available instruments. Any method should
ensure that data are collected thoroughly and accurately—enough to provide evidence
of the desired form and to enable replication.
Quite often, data collection involves one or more specialized instruments. This
may sound odd, but the acceptance of instruments for data collection in science
was not achieved without struggle. During the Scientific Revolution, a main chal-
lenge was to legitimize the data collected using glassware like prisms, telescopes,
and microscopes, as well as scales, chronometrical devices, and other instruments.
We saw earlier how this challenge factored into the reception of Newton’s findings
(Schaffer, 1989).
While there’s no longer any question that instruments in general can play an essential
role in data collection, questions about the reliability of specific instruments still arise. No
scientific instrument is free from error. For example, in 2017, scientists at the National
Institute of Standards and Technology (NIST) used a Kibble balance—an instrument for
making extremely accurate measurements of the weight of an object—to determine the
most precise value yet of the Planck constant, which is an important quantity in quantum
physics named after the German physicist and Nobel Prize winner Max Planck (1858–
1947). But even after more than 10,000 measurements, those scientists were still left
with uncertainty about the exact value of the Planck constant, partly because of the error
involved in any measurement (Haddad et al., 2017). (The value of the Planck constant
is about 6.626069934 × 10−34 Joule · second, in case you were wondering.) Such mea-
surement error is an inherent part of data collection. Ultimately, the best that scientists
can do is to avoid systematic measurement error by continually calibrating instruments,
where calibration involves the comparison of the measurements of one instrument (for
example, an electronic ear thermometer), with those of another (for example, a mercury
thermometer), to check the instrument’s accuracy so it can be adjusted if needed.
Different types of data can be analyzed in different ways. One basic distinction is that
data can be either quantitative or qualitative. Quantitative data are in a form—often
numerical—that makes them easily comparable. Climate science data, for example, are
often quantitative. It is recorded as arrays of numbers, numerical indices, and symbols
that correspond to measurable physical quantities. Such quantitative data can be used
for statistical analysis (see Chapters 5 and 6) and computer simulation (see Chapter 3).
Qualitative data consist of information in non-numerical form. This information can
be obtained, for example, from diary accounts, unstructured interviews, and observa-
tions of animal behavior. Analysis of qualitative data is often less straightforward than
quantitative analysis. It requires accurate description of subjects’ responses and behavior,
trustworthy informants, and significant background knowledge. We will say more about
qualitative research in Section 2.3.
In experiments with human subjects in the social, cognitive, and behavioral sciences,
data collection often involves questionnaires that create quantitative, numerical data from
qualitative information. These questionnaires may include multiple-choice questions and
scales of various kinds. For example, standardized tests like the SAT, used for admission
decisions to colleges and universities in North America, are considered predictors of stu-
dent performance. Student performance varies along multiple dimensions, but the SAT
and similar tests boil this down to a single score for each test taker that is relative to other
test takers’ performance. Other questionnaires provide quantitative data about personal-
ity traits, political opinions, attitudes toward some topic or group of people, and so on.
Questionnaires can be a very useful form of data collection, but good questionnaire
design is vital for collecting reliable data. This is like the need for instrument calibration
described earlier. And, for questionnaires, effective design and calibration can be surpris-
ingly difficult. A poorly designed question can prime subjects to answer in a certain
way—often, because of the observer-expectancy effect, the way the experimenter expects
or desires them to answer. Questions can also be vague or ambiguous, eliciting different
kinds of responses from different people or unintentionally ask about more than one thing
at once. Frankly, there are many ways to go wrong, and so there are many more poorly
designed surveys out there than well-designed surveys. Poorly designed questionnaires can
result in data that are too weak to count as evidence or to support inferences, or that are
otherwise useless because they cannot be analyzed in the intended way.
Crucial Experiments and Repeat Experiments

Think back to Newton’s two-prism experiments. Newton thought these experiments
proved that white light is a mixture of colors, that it is not modified by a prism to
become colored. Newton called this a crucial experiment (experimentum crucis), which
is an experiment that decisively adjudicates between two hypotheses, settling once and
for all which is true.
Such decisive experiments are exciting, but very few experiments are actually crucial
experiments. One reason why relates to extraneous variables. We have said that controlling
extraneous variables, or background conditions, is important to an experiment’s ability
to provide good evidence. But it’s virtually, if not entirely, impossible to control all back-
ground conditions. Some minor background condition assumed to be irrelevant might
turn out to be a confounding variable, invalidating the experimental result.
Even if an experiment could completely control all extraneous variables, this still
might not be enough to guarantee the experimental result is correct. Data may match
the expectations, or may fail to do so, for unexpected reasons. Some other, unknown
phenomenon that hasn’t yet been investigated might actually turn out to be responsible
for the experimental result; some other hypothesis might turn out to be true instead.
This is called the underdetermination of hypotheses by data: the evidence not sufficient
to determine which of multiple hypotheses is true. Some think that every hypothesis is
always underdetermined by the data, that there is always some hypothesis (perhaps not
yet known) that is also consistent with all the data, no matter how much is collected.
Here is an illustration of underdetermination. Suppose you want to test the hypothesis
that playing violent video games causes violent behavior. If this hypothesis is true, then
you should expect that more time spent playing violent video games is linked to more
instances of violent behavior. But this is also what you should expect if those people who
are already prone to violence tend to play violent video games more often than other
people, or if people’s tendencies to be violent and to play violent video games are both
caused by some other factor, such as a personality disorder or parental neglect. There are
experiments that can determine which of these three possibilities is right. For instance,
you might assign people to play different amounts of violent video games (intervening
on the independent variable of violent-video-game playing) and then record their level
of violence. If you observe increased violent behavior then the intervention—the violent-
video-game playing—is responsible for it. But is it the violent video games, or would
playing any video games at all result in more violent behavior? A new experiment is
called for. If it’s the violent video games, is it a particular form of violence or any violent
video games? These kinds of questions are always possible. Other untested hypotheses
often lurk right around the corner. For that reason, few if any experiments are crucial
experiments that decisively favor a given hypothesis.
There’s also a problem with the idea that an experiment can definitively prove some
hypothesis is wrong. An experiment to test some hypothesis involves a number of aux-
iliary assumptions—assumptions that need to be true in order for the data to have the
intended relationship to the hypothesis under investigation. When data do not match
expectations, this might be because the hypothesis is wrong, or it might be because one
of the auxiliary assumptions is wrong. Perhaps your data collection instrument is miscali-
brated, or your group of subjects is atypical, or there’s some confounding variable you
haven’t predicted. So, whether the data from an experiment match your expectations
or not, this is not truly decisive. One experiment can weigh in favor of or against some
hypothesis, but it generally can’t settle the matter once and for all.
Box 2.1 How Should Scientists Handle Underdetermination?

We have suggested that the underdetermination of hypotheses by the data is
common or possibly unavoidable. How should scientists proceed in light of this?
One response to the problem would be to suspend judgment about which hypoth-
esis should be accepted. But suspending judgment just isn’t an option when we
need to build a bridge or design an effective drug. One solution is to seek more
evidence to help us decide between the hypotheses we’re most concerned with.
Additionally, hypotheses and theories that fit with the existing data are sometimes
more or less appealing in other regards. In the mid-16th century, both the Ptolemaic
geocentric and Copernican heliocentric theories fit all of the existing cosmological
data. But Copernicus’s theory was said to be more elegant and harmonious than
Ptolemy’s. It certainly was simpler, as Ptolemy’s theory could accommodate some
data only by introducing adjustments that complicated the theory tremendously.
Considerations of this kind were part of the reason why Copernicus’s theory
superseded Ptolemy’s. In other cases, one hypothesis might lead to more fruitful
novel experiments, might fit better with other scientific findings, and so on. The
general point is that underdetermination seems to be a circumstance in science
where considerations beyond empirical evidence contribute to which hypotheses
or theories scientists accept.
We’ve discussed three sources of uncertainty about what an experiment shows: extra-
neous variables, unanticipated hypotheses, and auxiliary assumptions. One of the primary
ways to minimize uncertainty from these three sources is for experiments to be repli-
cated. Replication involves performing the original experiment again—often with some
modification to its design—in order to check whether the result remains the same. If, for
example, the spectrum of light recombining into white light observed by Newton is also
observed by different people, using different prisms, in different places and at different
times, then this additionally supports Newton’s hypothesis that white light contains a
spectrum of colors. If some experimental result cannot be replicated—if different scien-
tists follow similar experimental procedures but do not get the same result—then the
original experimental result may be a fluke, or it may be due to some confounding variable
in the experimental setup that the scientists haven’t yet identified.
The replicability of experiments is an indispensable ingredient of science, so much so
that a persistent failure to replicate findings may undermine a scientific field’s credibility.
For example, we saw in Chapter 1 that astrology’s failure to replicate findings is part
of its pseudoscientific status. Recently, it has also been suggested that the field of social
psychology faces a crisis in replicability, where different research groups have tried but
failed to replicate some classic experimental results. This suggests we should perhaps not
put too much stock in those findings, unless this failure in replicability is resolved (Pashler
& Wagenmakers, 2012).
The difficulties in designing genuinely crucial experiments and the importance of rep-
lication fit with the idea that science is essentially a collaborative, social venture. Because
of this, gaining scientific knowledge via experimentation is generally more complicated
and slower than a single dramatic experiment. Also, scientific knowledge can go in unex-
pected directions: a surprising finding that upends something we thought we understood
might be right around the corner.
Other Roles for Experiment

So far, we have focused on one central purpose of experiments: to test hypotheses by provid-
ing confirming or disconfirming evidence. But experiments play many other roles as well.
Experiments can be used to evaluate whether scientific instruments like telescopes and
prisms function as expected. For example, to persuade other members of the Royal Society
that his hypothesis was true, Newton had to show that his prisms ‘worked properly’.
Many of his experimental trials were then aimed at testing how prisms with different
shapes and composition affected the spectrum produced. The publication of his Opticks
(1704/1998) described these trials and their results in detail. Supported by this extensive
data and Newton’s theory of colors, prisms became accepted scientific instruments.
We have already discussed calibration in this chapter. Any instrument for data collec-
tion must be calibrated using known measurements before it can be used in an experiment
with uncertain results. For example, a thermometer must be shown to measure the known
temperature accurately before we can trust its measurement of an unexpected tempera-
ture. Calibrating thermometers requires the establishment of ‘fixed points’, such as the
boiling (100° Celsius) and freezing (0° Celsius) points of water, to create a meaningful
temperature scale to apply across different thermometers. When some standardized scale
is established, instruments can be used repeatedly and their measurements compared
over time and across instruments. This body of measurement data might then be used to
construct more stable measurement scales and more accurate instruments.
Brain-imaging techniques provide another illustration of using experiments to establish
the function of and to calibrate an instrument for data collection. Functional magnetic
resonance imaging (fMRI) machines track blood flow in the brain. They do not directly
measure neural activity, but that is what the scientists employing these machines want to
assess. Neuroscientists use data about blood flow to reason about neural activity because they
know that greater neural activity requires more energy, which requires increased metabolism,
which uses more oxygen, and oxygen is delivered by blood flow. The expectation that blood
flow provides a good proxy for neural activity is also confirmed by findings concerning brain
metabolism and the relationship between different brain areas and functions.
Besides evaluating and calibrating instruments, experiments can be used to deter-
mine the value of physical constants, or quantities that are believed to be universal and
unchanging over time. We mentioned Planck’s constant earlier. Another physical constant
is the speed of light in a vacuum. In the Opticks, Newton reported the calculations of the
Danish astronomer Ole Rømer (1644–1710) regarding the speed of light. Rømer observed
that there could be a difference of up to 1,000 seconds between the predicted and
observed times of the eclipses of Jupiter’s moons. Based on the estimated distance between
Jupiter and the Earth, Rømer concluded that light travels at about 200,000 kilometers per
second. In 1849, the French physicist Hippolyte Fizeau ran the first major experiment to
precisely determine the speed of light. Fizeau built an experimental apparatus in which
an intense light source and a mirror were placed eight kilometers (about five miles) apart.
He placed a rotating cogwheel between the light source and the mirror and increased
the speed of the wheel until the reflection back from the mirror was obscured by the
spinning cogs. Based on the rotational speed of the wheel and the distance between the
wheel and the mirror, Fizeau calculated that the speed of light is 313,000 kilometers per
second. Rømer’s estimate and Fizeau’s later calculation were on the right track; today,
we take the speed of light to be 299,792 kilometers per second.
A third role of experiments is exploratory. In this use, experimentation does not rely
on existing theory and may not be aimed to test a specific hypothesis. An exploratory
experiment is used to gather data to suggest novel hypotheses or to assess whether a poorly
understood phenomenon actually exists. Herschel’s work on the relationship between
heat and light, for example, did not rely on a particular theory or a hypothesis about the
relationship. When, in the course of investigating sunspots, he discovered that red light
has a greater heating effect, Herschel surmised that the light spectrum is made of both
heat and colors. This idea was on the right track, but it was not until James Maxwell’s
(1831–1879) theory of electromagnetic radiation that Herschel’s observations could be
adequately explained and his work vindicated.
EXERCISES
2.1 Review the discussion of Newton’s prism experiment. Identify the hypothesis under
investigation, the independent variable, and the dependent variable, and describe
the intervention.
2.2–2.5 The Anglo-Irish scientist Robert Boyle (1627–1691) used equipment like vac-
uum chambers, air pumps, and glass tubes in his experiments. With the assistance
of Robert Hooke, Boyle conducted a series of experiments in the 1660s to ascertain
how the pressure and volume of the air vary when the air is either ‘compressed or
dilated’. He used a J-shaped glass tube. The tube was closed off at the short end,
and the long end was left open. By adding mercury in the longer end, Boyle could
trap air in the curved end of the tube; by changing the amount of mercury, he was
also able to change the air pressure at the short end. Boyle repeated this experi-
ment, measuring the volume of the air in the short end of the tube at a range of
pressures. What he discovered was that, as he increased the pressure on the air,
the volume of the air would decrease. Boyle’s formulation of this relationship would
become the first gas law, now known as Boyle’s law.
2.2 What was the hypothesis under investigation? Use that hypothesis to identify the
independent variable and the dependent variable. What evidence was gained from
this experiment?
2.3 Make a list of 10 extraneous variables in Boyle’s experiment. Put a star next to any
variables that you think might have been confounding variables, and say why. Try
to do this for at least two variables on your list.
2.4 Think of an alternative hypothesis that could account for the results of Boyle’s experi-
ment. State that hypothesis, and describe how it could account for the data.
2.5 Define calibration, and describe how it was involved in Boyle’s experiments.
2.6 Describe three features of experiments that are particularly valuable to testing
hypotheses, and describe the value of each of those features.
2.7 What is the relationship between extraneous variables and confounding variables?
Why are experiments designed to limit confounding variables?
2.8 List the three kinds of sources of uncertainty regarding what a given experiment
shows. Describe each one, and give an example of each.
2.9 Describe the problem of underdetermination, and discuss how scientists deal with it.
2.10 Briefly describe three roles for experiments other than testing hypotheses, and give
an example of each. Then discuss how each of these might relate indirectly to testing
hypotheses.
2.11 Before Ibn al-Haytham’s work, some thought that vision involved light shining out of
the eye, coming into contact with objects, and thereby making them visible. This was
known as the emission theory of vision.
Describe an experiment that would test the emission theory of vision. What would
you expect to observe in that experiment if the emission theory were true? Finally, list
the auxiliary assumptions you would need to make in order for the emission theory
to generate those expectations.
2.12 Ibn al-Haytham set up the following experiment to test the emission theory of vision.
He stood in a dark room with a small hole in one wall. Outside of the room, he hung
two lanterns at different heights. He found that the light from each lantern illuminated
a different spot in the room. For each, there was a straight line between the lighted
spot, the hole in the wall, and one of the lanterns. Covering a lantern caused the
spot it illuminated to darken, and exposing the lantern caused the spot to reappear.
a. What data were produced by this experiment?
b. How do the data provide evidence against the emission theory?
c. Describe one way in which the emission theory might be adapted to account for
the data (but still remain an emission theory of vision).
d. Describe one new hypothesis you can formulate based on the results of Ibn al-
Haytham’s experiment.
2.2 THE PERFECTLY CONTROLLED EXPERIMENT
• Identify the features of a perfectly controlled experiment and characterize the

importance of each
• Describe the difference between direct and indirect variable control

• Describe the steps to conducting a perfectly controlled experiment of a given
hypothesis
The Perfect Experiment

In a perfectly controlled experiment, experimenters perform an appropriate interven-
tion on an independent variable and then measure the effect of this intervention on
the dependent variable. All extraneous variables are fully controlled, so no confound-
ing variables are possible. Any change in the behavior of the system thus must be due
to the experimenters’ intervention. This doesn’t eliminate the possibility that some
unknown hypothesis also accounts for the data or that some auxiliary assumption was
wrong, but it does eliminate the possibility that some confounding variable interfered
with the effect.
Such an experiment is simple to describe, but in fact no experiment is perfect. It’s
very difficult to get even close to this ideal in practice. But a great way to shed light on
important elements of experiment design is to consider the ideal of the perfectly con-
trolled experiment.
We’ll start the discussion with a step that didn’t even make it into the brief charac-
terization of the perfectly controlled experiment just given: defining expectations. We’ll
then discuss intervention, variable control, and controlling for bias. Our description of
the perfectly controlled experiment leaves out the later stages of data collection and
analysis. We gave an overview of those stages in Section 2.1, and Chapter 6 discusses the
statistical analysis of data in depth.
Defining Expectations
To test a hypothesis with an experiment, an important first step is to articulate what the
hypothesis would lead you to expect for the outcome of the experiment. Those expecta-
tions are predictions of the results of some intervention if the hypothesis in question is
true. The expectations might also be informed by background knowledge or some general
accepted theory. Ideally, expectations are clearly and precisely defined in advance in a
way that makes them easily comparable to the data the experiment will produce. This is
important for controlling the extraneous variable of experimenters’ beliefs, which other-
wise may influence their perceptions of the experimental results (recall from Chapter 1
the power of confirmation bias).

Suppose that your knowledge of Sigmund Freud’s psychoanalytic theory leads you to
form a hypothesis about someone’s personality. Perhaps you wonder whether your friend
Philippe’s fear of horses is due to an Oedipus complex—that is, to Philippe’s unconscious
and suppressed desire for his mother. How might you test this? Freudian psychoanalysis
involves interesting ideas, but they’re just too imprecise and intractable to use as the
basis for the formulation of clear predictions. Since psychoanalysis does not yield any
clear predictions in advance, it seems to be open only to confirmation, not falsification or
disconfirmation. This recalls the discussion of pseudosciences, like astrology and home-
opathy, from Chapter 1.
Contrast this with Albert Einstein’s theory of general relativity. This theory revolu-
tionized our understanding of space and time. While Newton believed that space is a
sort of absolute stage on which events unfold, Einstein conceived of space and time as
a single interwoven manifold, a fabric of sorts. For Newton, gravity was a force; Einstein
instead explained gravity as the curvature of the space-time manifold. Just as marbles
placed on a fabric sheet held in the air bend the sheet around them, massive objects like
the Sun warp space-time in their vicinity. This is why other objects accelerate toward
those massive objects.
Unlike Freud’s psychoanalytic theory, Einstein’s theory of general relativity generates
clear expectations. One of these expectations is that light, just like any other form of
matter, is affected by gravity. If a beam of starlight passes near the Sun, then it should be
deflected, or bend, toward the Sun. The beam’s deflection can be measured as the angle
between where we actually see the star and where we would expect to see the star if
the beam of light had travelled in a straight path. Einstein’s theory also provides us with
a precise prediction of this angle.
This prediction could first be tested a few years after Einstein completed his theory in
1915. On May 29, 1919, when a total solar eclipse blocked out the dazzling light of the
Sun, a group of scientists led by English astronomer Arthur Eddington took photographs
of stars visible near the dimmed Sun. They compared these to other photographs taken
at night, when the light of those same stars did not pass close to the Sun before reaching
Earth. From this comparison, Eddington was able to test, and confirm, Einstein’s predic-
tion of the light’s deflection. The Sun changed the path of nearby starlight as the theory of
general relativity predicted, providing confirmation of that theory (Dyson, Eddington, &
Davidson, 1920). When the press reported that a key prediction of Einstein’s theory had
been borne out by observation, Einstein became a famous public figure.
Here’s another example of a clear and precise expectation based on a hypothesis.
This example comes from game theory, which is a broad framework for thinking about
conflict and co-operation among strategic decision-makers. Imagine you are given $10.
You’re asked to share this sum with a partner, and you and your partner must agree about
how to divide it. You can propose a division of the $10, and your partner can accept or
reject that offer. If your partner rejects your proposed division, neither of you will get
any money; if your partner accepts your offer, you’ll each get your agreed-upon share of
the money. What would you do?
Based on standard game theory, if everyone acts in their own self-interest, one would
expect that proposers in this situation will offer close to nothing to their partners and
that responders will accept anything more than $0. For responders, it’s rational to accept
anything, since otherwise they’ll get nothing. And proposers know this, so it’s rational for
them to offer only a small amount.
This expectation has been experimentally tested time and again, and it turns out to
be wrong (Güth, Schmittberger, & Schwarze, 1982). The average offers are around 40 to
50% of the total sum, that is, about $4 or $5 dollars when $10 is being divided. And when
proposers offer less than 30%, responders consistently reject the offer, deeming it unfair,
even though this results in them getting no money at all. The proposers and responders
were on the same page, apparently willing to sacrifice self-interest for fairness. This was
not at all what standard game theory predicted.
FIGURE 2.6 Headlines reporting on Arthur Eddington’s observations during the 1919 eclipse,
which confirmed Albert Einstein’s theory of general relativity
As both this example and the one before illustrate, scientists’ hypotheses and theories
often involve concepts and variables that we don’t have an obvious way to test. This is
a stumbling block in formulating clear expectations for an experiment. How can you
measure the values of variables like wealth, violence, mood, and fairness? To manage this
difficulty, scientists often use operational definitions and clusters of indicators to charac-
terize fuzzy concepts in a way that allows for measurement.
An operational definition is a specification of the conditions when some concept

applies, enabling measurement or other kinds of precision. In the game theory example
from the previous page, we might operationally define fair offers to include any offer of
40–60% of the money. This definition is clear and useful, not because correctly states the
nature of fairness, but because it offers a precise way to proceed with testing. Operational
definitions can lack nuance. Wealth, for example, means more than simply having a high
income or a parent who is a high earner. This is why economists often study poverty and
wealth using a combination of indicators such as yearly income, access to education and
health care, and permanent housing. Such cluster indicators identify several markers of
some variable in order to more precisely measure it while not oversimplifying it.
For some concepts, there simply is no single best definition or measure. There is,
accordingly, choice in how to operationally define it or which cluster indicators to use.
Still, some definitions are better than others. Some definitions or sets of indicators get
closer than others to capturing what we have in mind for, for instance, a fair deal or being
wealthy. Our theories of the phenomena under investigation regularly inform how we
define concepts. For example, some definitions may be shown to specify the nature of
poverty more accurately than others because they accord better with our best economic
and sociological theories or because these definitions have been shown to better predict
future events consistently across studies.
Intervention
An experimental intervention is the centerpiece of a perfectly controlled experiment.
Recall that an intervention is a direct manipulation of the value of a variable. Because
of this intervention, that variable is called the independent variable. Interventions could
include the administration of a drug to a group of patients, fertilizer to a plot of land,
or deliberate changes in the lighting conditions in a workplace. During an experiment,
scientists deliberately intervene on the independent variable and then measure the impact
of their intervention on the dependent variable. In an agriculture experiment, for example,
scientists may assess the hypothesis that a particular fertilizer is better for crop yield.
Their intervention would consist in changing the value of the variable of interest: the
type of fertilizer. In particular, they would change the fertilizer to the particular fertil-
izer the hypothesis predicts is better. They then would watch for changes in crop yield,
the dependent variable. The expectation based on the hypothesis is that crop yield will
increase; their measure of the value of the dependent variable, crop yield, is a way to
assess this hypothesis.
There are many different ways to perform experimental interventions. But ideally,
scientists want interventions to be ‘surgical’. This metaphor suggests interventions should
be made with the precision that surgeons bring to the operating table; the incision should
be carefully made at the exact location that will bring about the desired effect. If an
intervention is surgical in this sense, it affects only the independent variable. Any change
in the value of the dependent variable can then be traced back to the independent vari-
able’s influence. A surgical intervention on the type of fertilizer will simply switch out
the old fertilizer for a new kind. Everything else should remain the same: when and how
frequently the fertilizer is applied, the method used to apply it, the location of the field,
the crop, the growing time, and so forth.

Some interventions cannot be performed for ethical or practical reasons. For exam-
ple, it obviously would be unethical to subject healthy individuals to major brain
damage or to diseases like syphilis in order to study their consequences. In other
cases, experimental intervention is impractical. Suppose you want to find out about
how the distance of the Moon from the Earth influences the motion of the tides.
The most direct way would be to intervene on the distance of the Moon from the
Earth and see how the tides change in response, but no scientist can currently alter
the orbit of the Moon. And even if this were possible to do, this change would not
be surgical. Altering the orbit of the Moon will almost certainly have other effects on
Earth, and these changes may in turn have an effect on the tides. In such cases, when
ethical or practical considerations prevent a surgical intervention, scientists look

for ways to approximate a desired intervention. Different experimental and non-
experimental approaches do this in different ways, as we will see in the next section.
Controlling Variables
In ‘surgical’ interventions, conditions are created in which no variables, other than the
independent variable and the dependent variable, change when an intervention is per-
formed. So, another key feature of a perfect experiment is the full control of all extra-
neous variables. Full variable control is exceedingly difficult to accomplish. There are
always countless extraneous variables in an experiment, many of which scientists don’t
fully understand or aren’t even aware of. All of those extraneous variables need to be
controlled in order to avoid confounding variables, but it’s hard to control what you have
not identified!
Control over variables can be approached in a number of ways. These can be divided
into two broad categories: direct and indirect. Direct variable control is when all extrane-
ous variables are held at constant values during an intervention. Because the extraneous
variables are unchanging, they cannot be responsible for any changes to the dependent
variable. So, if direct variable control is successful, only the intervention can be responsible
for a change in the dependent variable.
Recall Newton’s prism experiments. Newton could directly control some extraneous
variables, like the time of day at which he ran his experiments and the lighting conditions
in his chambers. Keeping those variables constant ensured that, for example, any difference
in the composition of morning and afternoon light didn’t affect his findings. Newton also
attempted to control for the confounding influences of air bubbles and other impurities
in the prisms by using higher-quality prisms.
The carefully arranged conditions in today’s laboratories help scientists to directly con-
trol many variables. Temperature, cleanliness, lighting, noise, instructions to human sub-
jects—all of these factors and more are extraneous variables, and all should be held fixed
during an experiment. Consider again experiments conducted with the Large Hadron
Collider at CERN, the world’s largest laboratory. One important independent variable is
the proton-proton collision. Dependent variables, which are measured and analyzed by
scientists at CERN, are features of the by-products of these collisions. During experiments,
scientists use sophisticated technologies to keep many variables under direct control, such
as the magnetic fields and temperature in the collider.
In many experiments, however, direct control of all extraneous variables is simply not
possible. As we have seen, scientists often don’t even know all the extraneous variables
that may be relevant. The second category of variable control, indirect variable control,
helps with this. The basic idea is to allow extraneous variables to vary in a way that is
independent from the intervention. Then, although extraneous variables will vary, they
should vary in a way that is the same for the different values of the independent vari-
able. Any systematic differences in the dependent variable between different values of
the independent variable can then be reasonably attributed to the independent variable.
The first step to indirect variable control is to set up two groups of experimental entities
(whether cells, plots of land, people, mice, or other subjects) to compare. The intervention
should be the only thing that distinguishes these groups from one another. One group,
the experimental group, receives the intervention to the independent variable. The other
group, the control group, experiences the default other value(s) of the independent vari-
able. And then, some approach is used to try to ensure that all extraneous variables affect
the two groups equally.
One approach to indirect variable control is randomization: the indiscriminate assign-
ment of experimental entities to either the experimental group or the control group.
Some method of group assignment is adopted so that no features of the experimental
entities can be taken into account, even unconsciously, in determining group member-
ship. This is meant to ensure that any differences among the experimental entities vary
randomly across groups and thus bear no relation to the systematic difference between
groups, the intervention. Many scientists believe randomization is the gold standard of
indirect variable control.
Randomization is one of the best approaches to indirect variable control, but it’s not
a surefire guarantee. It could happen that all patients with some characteristic—say, all
smokers—are randomly assigned to the experimental group, while all nonsmokers are
randomly assigned to the control group. In an experiment designed to test, say, the effect
of exercise on health, whether people smoke is surely a significant confounding variable.
This example is extreme, but there is a much more general point behind it. Random
group assignment guarantees extraneous variables are not related to group assignment,
but it does not guarantee that extraneous variables do in fact vary equally across the two
groups. Even with random assignment, the experimental and control groups may still
differ from one another in ways other than the intervention.
For this reason, there’s another condition that must be met for randomization to be an
effective approach to indirect variable control: the sample size must be sufficiently large.
Sample size refers to the number of individual sources of data in a study; often, this is
simply the number of experimental entities or subjects. If the sample size is very small,
chance variations between randomly assigned experimental and control groups is likely.
If the sample size is very large, such chance variation is exceedingly unlikely, so unlikely
that these variables can be considered effectively controlled.
Imagine an experiment that involves only four people, two of whom are smokers. It is
reasonably likely that both smokers will be randomly assigned to one group. Indeed, this
would happen one out of every three times they are randomly assigned to groups. Now
think about all of the variables among those four people: age, gender, medical history,
education level, and so on. It’s all but guaranteed that at least some of these extraneous
variables will be randomly distributed unevenly between the experimental and control
groups, becoming confounding variables. Imagine, in contrast, an experiment with a sam-

ple size of 10,000 people, roughly half of whom smoke. It is exceedingly unlikely—so
unlikely as to be virtually impossible—that all smokers would be randomly assigned to
one group. More generally, a large sample size helps to make sure that variation in all
extraneous variables is more or less equally distributed across randomly assigned groups.
Controlling for Bias

An important set of extraneous variables that must be controlled are human expectations.
As we saw with the Hawthorne effect or observer bias in Section 2.1, human experimental
subjects’ expectations or desires can confound the results of an experiment. Likewise,
scientists often harbor background beliefs or even specific expectations about the outcome
of an experiment. These are also extraneous variables that can readily become confound-
ing variables; recall from Chapter 1 the power of confirmation bias.
The strategies of direct and indirect variable control that we have talked about so far
don’t help with these kinds of extraneous variables. Recall the example of investigating
the effects of some exercise regime on health and, in particular, how randomization and a
sufficiently large sample size control for the extraneous variable of cigarette smoking (and
many others as well). If the researchers administering the tests used to evaluate health (the
dependent variable) know whether the subjects they are testing exercised or not, then this
knowledge and their expectations regarding the effects of exercise might subtly influence
their evaluation of subjects’ health. Randomization and large sample size are no help here.
To control for potential researcher bias, scientists sometimes design their experi-
ments so that not even they know which subjects are in the control group and which
are in the experimental group. This protocol is called a blind experiment. In the exercise/
health experiment, assignment to groups should be not only random but also blind;
researchers shouldn’t know which subjects are in which group. Then, when they test
a subject’s health, their expectations regarding the effects of exercise can’t influence
their judgments of that individual’s health, since they won’t know whether the subject
has exercised or not.
With a blind experimental setup, the researchers’ expectations cannot influence the
findings but the expectations of the experimental subjects might. Imagine you’re assigned
to the experimental group, and you dutifully exercise as assigned. You might be motivated
to work extra hard on the assessment of your health, or your expectation of good health
might decrease your blood pressure, or there may be some other unintended influence on
your health because of your expectation of the exercise’s effects. You might also simply
want to please the researchers by helping show they are right about the value of exer-
cise. This possibility is eliminated if both researchers and subjects are unaware of which
subjects are in which group. This is called a double-blind experiment.
Double-blind experiments are especially important for drug trials that test out new
medicines. If participants or experimenters expect a particular medicine to be effective,
then that expectation can directly lead to improved health. This is called the placebo
effect. For this reason, it’s important that neither experimenters nor experimental
participants know which participants receive the medicine being tested. The control
group receives a placebo, an impotent substance or therapy. This way, no participants
can discern whether they are receiving the real medicine, and they will be equally
subject to the placebo effect. (This is, then, indirect control of the extraneous variable
of placebo effect.)
Another way to control for participants’ expectations is with deception. Whereas
blinding involves omitting some piece of information, deception involves actively
misinforming participants to interfere with how their expectations influence their
behavior. The American social psychologist Stanley Milgram (1933–1984) often used
deception in his experiments. For instance, Milgram wanted to understand people’s
willingness to obey an authority figure who instructed them to inflict serious harm
to others. It probably wouldn’t have worked to tell the experimental participants this
was what was being tested. Few of us want to be viewed as inflicting harm on others
just because someone in power told us to! So, Milgram falsely told participants that
they were helping another person learn some material by quizzing the other person and
delivering electric shocks to them to punish any incorrect answers. In reality, there was
no other person learning, and no electric shock. The experimenters were simply study-
ing how far participants would go in harming others simply because they were told to.
(Ethics guidelines are more stringent now than they were then, and this study would
probably not pass muster now.)
Box 2.2 The Milgram Experiment

Milgram’s experiment involved three roles: learner, experimenter, and teacher.
Each subject waited in a lounge with another person whom they were led to
believe was a second subject. In fact, the second person was a confederate —an
actor pretending to be a subject—who was to play the role of learner. A third
person—yet another confederate—played the role of authoritative experimenter.
This person briefly and vaguely gave a contrived explanation of the experiment
(not the real experiment). The supposed experimenter then pretended to randomly
assign the other two individuals to play the roles of teacher and learner. In
fact, the assignment was rigged; the naïve subject was always assigned to play
the role of teacher, and the second ‘subject’—Milgram’s confederate—was
always assigned to be the learner. The experimenter accompanied both indi-
viduals into a staged laboratory setting, using heavy restraints to strap in the
learner to what appeared to be an electrified chair apparatus. To ensure that
the naïve subject believed that the chair was actually operative, the experimenter
delivered a real but mild shock.
The experimenter then led the subject to a separate room with what appeared
to be an electric shock generator. The machine had an instrument panel, consisting
of 30 horizontal switches, each labeled by voltage, or strength of electric current.
The labeled voltage ranged from 15 to 450 volts. Switches were grouped into
eight categories of shock: slight, moderate, strong, very strong, intense, extreme
intensity, danger: severe shock, and, finally, just ‘XXX’. A switch was flipped, then
a red light turned on, an electric buzzing was heard, and the voltage meter would
fluctuate. The experimenter pretended to have the teacher (the naïve subject)
administer a learning task of four word pairs, which the learner was supposed
to learn. The experimenter instructed the subject to flip a switch for each wrong
answer, starting from 15 volt shocks and increasing for each error until the learner
had learned all the pairs correctly.
The dependent variable of the real experiment was the maximum shock
subjects were willing to administer before refusing to continue. What results
do you think Milgram obtained? Out of, say, 100 subjects, how many do you
think would have administered shocks up to the highest level when instructed
to do so? In Milgram’s first study, he found that, although many displayed
deep discomfort at doing so, a full 65% of subjects administered the highest
level of shock, marked ‘XXX’.
TABLE 2.1 Elements of the perfectly controlled experiment
1. Expectations are clearly articulated; if needed, concepts are defined operationally or

using cluster indicators.
2. An intervention is performed on the independent variable.
3. All other variables are controlled, either
(a) directly, by holding all other features constant, or
(b) indirectly, by comparing an experimental group to a control group, with
randomization and large sample size.
4. The experiment is blind or double-blind, as appropriate, to control for bias.
EXERCISES
2.13 List all the features of a perfectly controlled experiment. For each, say what is impor-
tant about that feature and what is challenging about accomplishing it.
2.14 Imagine you want to establish what effect, if any, taking notes on a laptop during
class instead of on paper has on retention of information.
a. Specify your hypothesis regarding the note-taking medium and memory. What
are your expectations for your experiment, given this hypothesis?
b. Describe your ideal experiment to test this hypothesis. Don’t worry about how
easy it would be to actually conduct the experiment or if it’s even possible.
Make sure to specify all the main features of the experiment.
c. Identify three major challenges to conducting the ideal experiment you have
described. Say why each is a problem.
2.15 Philosophy majors tend to perform very well on all of the main entrance exams required
by graduate programs and professional schools. They are the only major to score above
average on all four of the following: the General Management Admissions Test (GMAT),
the Law School Admissions Test (LSAT), the verbal portion of the Graduate Record Exami-
nation (GRE), and the quantitative portion of the GRE. Philosophy majors are vying with
physics majors each year for the best comprehensive GRE scores, and they also have
had the highest average on the verbal portion of the GRE, second highest on the GMAT
(after mathematics), and third highest on the LSAT (after physics and economics).
Formulate three different hypotheses that are each compatible with these data.
Choose one of the three hypotheses, and design an experiment that could test it.
Make sure you specify the independent and dependent variables, the intervention,
your expectations for the findings if the hypothesis is true, and how you will control
for extraneous variables, including experimenter and subject bias.
2.16 We have discussed how Einstein’s theory of general relativity generates the expecta-
tion that light, just like any other form of matter, is affected by gravity. This was surpris-
ing in the sense that it predicted certain events that had not been observed before.
a. Why are surprising expectations, or novel predictions, important for testing
hypotheses?
b. How can surprising expectations, or novel predictions, be generated in sci-
ences like archaeology and paleontology that study the past?
c. How can surprising expectations be generated about events that have already
occurred or about data that scientists already have?
2.17 Suppose you want to test the hypothesis that baseball players who eat pizza every
day hit more home runs. Let’s suppose that to test this hypothesis, you want to divide
the baseball players of some team into two groups that are balanced in all important
background variables that can affect players’ performance. The only difference you
want between the two groups is that the members of one group eat pizza every day
and the members of the other group do not.
Rank the following four strategies from best to worst for accomplishing this goal:
1. Sit in the clubhouse after a game. The first players who enter the clubhouse are
assigned to the group of pizza eaters (the experimental group), while the fol-
lowing players are assigned to the control group.
2. Allocate players born in the first six months of the year to the experimental group
and players born in the second six months of the year to the control group.
3. For each player in the team you toss a coin. If the coin lands on heads, then the player
is in the experimental group; otherwise, the player is assigned to the control group.
4. Assign all players over 230 pounds to the experimental group and the rest of
the players to the control group.
Justify each of your rankings by describing how well or poorly you expect that strat-
egy will control the extraneous variables.
2.18 What is the purpose of having an experimental group and a control group in an
experiment? How does division into two groups achieve this purpose?
2.19 Describe what randomization involves, why it can help to control for confounding
variables, and what its limitations are.
2.20 Define direct variable control and indirect variable control. Then, describe (a) how
each is accomplished and (b) the advantages and disadvantages of each approach.
2.21 The American Psychological Association (APA) code of ethics maintains that experi-
mentation may not involve use of deceptive techniques unless doing so has significant
prospective scientific, educational, or applied value; that effective non-deceptive
alternative procedures are not feasible; that participants are not deceived about
research that is reasonably expected to cause physical pain or severe emotional
distress; and that psychologists explain any experimental deception to participants
as early as is feasible. Now, given these guidelines, think about Milgram’s (1963)
experiment, and answer these questions:
a. How were Milgram’s experimental participants deceived?

b. Was deception necessary for this study? Why or why not?
c. Evaluate the importance of this research. In your view, did this work justify
deception? Why or why not?
2.3 EXPERIMENTAL AND NON-EXPERIMENTAL METHODS
• Distinguish between lab and field experiments and identify the features of each
• Define external validity and internal validity and describe the importance of each
• Describe the main types of non-experimental design

• Assess the advantages and disadvantages of the various features of experiments and
non-experimental studies
Variation from the Perfect Experiment

The perfectly controlled experiment may be the ideal way to test hypotheses, but
such experiments are seldom if ever possible to perform. And even when a near-
perfect experiment is possible, circumstances can favor other approaches. So, real
experiments deviate from the ideal in a variety of interesting ways. The type and
degree of variation are influenced by the kind of phenomena under investigation,
the goal of the investigation, the nature of the hypothesis, what confounding vari-
ables are expected, and the types of experimental entities. When experiments can-
not be performed, there are a variety of non-experimental methods of empirical
investigation that may provide insight into phenomena of interest. We might call
investigations that use these methods, generically, non-experimental studies. In this
section, we describe a variety of experimental and non-experimental approaches
used to acquire scientific knowledge, indicating some of the main advantages and
disadvantages of each.
Non-experimental studies may be called for when performing an intervention
needed of investigate a hypothesis experimentally is unethical, impractical, or down-
right impossible. Suppose you are investigating whether major childhood stress
decreases life span. The relevant intervention, imposing on an experimental group of
children distressing conditions like parental death, extreme poverty, or poor nutrition,
would be morally repugnant. Other interventions are impractical. Space exploration
provides many straightforward examples. In 1975, two probes—Viking 1 and Viking
2—were launched to conduct experiments on Mars aiming to determine whether the
chemical makeup of Mars’s soil supports microbial life. A year after launch, the probes
landed and conducted their experiments, but they returned negative or inconclusive
results. The cost of designing and constructing a new probe, the time needed to travel
to Mars, and other limitations weighed against repeating the experiments. In the end,
NASA’s next successful Mars landing wouldn’t be for another 20 years, in 1996, and
then nearly another 20 years passed before the Mars Curiosity rover became opera-
tional (at a cost of $2.5 billion). Finally, some interventions are literally impossible
to conduct because of the laws of nature. Astrophysicists and cosmologists have long
pondered the nature of black holes, which have such strong gravitational fields that
they bend the surrounding space-time, so that all light and matter spiral inescapably
into them. No one can possibly be in the right position to directly observe this, let
alone to intervene on it.
In the Lab or in the Field?

We have noted that some experiments occur in laboratories and others are field experi-
ments, occurring in the outside world. There are advantages and disadvantages to each
approach.
FIGURE 2.7 Mars Curiosity rover selfie taken on Mount Sharp (Aeolis Mons) on Mars in 2015
Laboratory experiments give researchers control over many aspects of the experiment,
specifically over any interventions performed and the direct and indirect control of many
extraneous variables. Depending on the nature of the experiments, a lab’s design features
may include constant temperature, sterile environment, special equipment to produce
unusual conditions, or, for experiments with human subjects, carefully selected lighting
and furniture, soundproofing, and experimenters’ confederates who behave in a specified
way. Those design features, and the control they provide, constitute one of the greatest
advantages of the laboratory. Laboratory conditions are designed to control extraneous
variables, to aid in detection and measurement of focal variables, and to create unique
situations that don’t often or ever occur outside the lab. These features can enable scien-
tists to discover regularities that are not easy to discern in the outside world.
The high degree of control enabled by laboratory conditions brings with it a high
degree of internal experimental validity. An experiment has high internal validity when
scientists can correctly infer conclusions about the relationship between the independent
and dependent variables with great certainty. This amounts to the absence of confound-
ing variables, achieved by direct or indirect control of all relevant extraneous variables.
A second advantage of laboratory experiments is that the experimental setup and data
analysis can follow predetermined, standard procedures, which make it easier to assess
and replicate an experimental finding.
However, there are also some disadvantages to lab research. To start with, some phe-
nomena are not easily investigated in a lab. Suppose you are investigating the effects of
climate change on large marine mammals. Specifically, you want to determine the effects
of elevated Arctic Ocean temperatures on the deep-diving behavior of narwhal whales.
Narwhals—the so-called unicorns of the sea because of their tusks—can dive as deep as
1.8 kilometers (6,000 feet) in Arctic waters. To directly investigate this phenomenon in a
lab, you will need—for starters—a huge tank of freezing salt water nearly two kilometers
deep. Good luck with that, right?
Furthermore, the same conditions that make it easy to directly and indirectly control
variables make the lab conditions different from the outside world, and that has some
disadvantages too. The artificiality of the experimental setting might mean that the results
obtained in the lab do not generalize well to real-life settings outside the lab. This is
problematic, since it’s ultimately the features of real-world phenomena that we want to
know about. Laboratories thus facilitate high internal validity, but potentially at the cost
of external validity. External experimental validity is the extent to which experimental
results generalize from the experimental conditions to other conditions—especially to the
phenomena the experiment is supposed to yield knowledge about.
External validity has two components: population validity and ecological validity.
Population validity is the degree to which experimental entities are representative of
the broader class of entities of interest. For experiments with human subjects, this is the
broader population they represent. The more representative a sample is of the broad class
or population, the more confident scientists can be of the experiment’s external validity.
Here’s an illustration of the importance of population validity. Many clinical trials test-
ing the efficacy and side effects of drugs are performed only on men, but the results are
expected to generalize to women as well. This decreases the population validity of the
results, since women and men differ in a number of medically relevant ways. There is thus
relatively limited experimental knowledge about the effects of some drugs on women, and
this may have serious consequences for health and medicine. Indeed, many prescription
drugs have been withdrawn from the market after they were belatedly revealed to pose
greater health risks for women than for men (Simon, 2005).
The second component of external validity, ecological validity, is the degree to which
experiment circumstances are representative of real-world circumstances. Experimental
settings or what subjects are asked to do can be artificial, unlike real-world circumstances,
in ways that impact the phenomenon under investigation. Consider again Milgram’s
experiment on compliance. How do you think the ecological validity of this experiment
rates? To answer this question, we need to consider how similar the situation encountered
in this experiment, administering electrical shocks to other people following instruction
from an authoritarian leader, is to scenarios in which people are usually asked to comply.
Limited ecological validity is a reason to question an experiment’s external validity, that
is, its significance for the broader conclusions we want to draw from it.
Field experiments are conducted outside of a laboratory, in the participants’ everyday

environment. Researchers still manipulate an independent variable, and they still aim
to control extraneous variables. Often, this involves indirect control, perhaps with ran-
domization if circumstances allow. Field experiments are more prevalent in the social,
behavioral, and biological sciences than in physics and chemistry. The previously described
experiment on the effects of lighting conditions on the productivity of the workers at
Western Electric’s Hawthorne factory is one example of a field experiment.
Field experiments tend to have more external validity than lab experiments because
they occur in natural circumstances. Their ecological validity is higher as a result. The
experimental subjects are also likely to be a somewhat arbitrary subsection of the
broader population of interest, which increases population validity. The downside to
these advantages is decreased internal validity. Less influence over the circumstances and
the selection of experimental subjects is also linked to decreased control over extraneous
variables and sometimes a decreased ability to intervene in the desired way. Because ran-
domization may not be feasible in field experiments, the researchers should decide how
best to divide the subjects into control and experimental groups, which may introduce
confounds. Besides decreasing internal validity, this decreased influence on experimental
design also makes it more difficult for other researchers to replicate the experiment.
Researchers conducting field experiments may also be constrained in what they can be
in a good position to observe or measure, the number of subjects they can involve, and how
long they can run the experiment. Many field experiments, for example, require special
permissions from individual subjects or from authorities that control access to areas like
nature preserves. Gaining these permissions can be difficult, and authorities can impose
limitations on researchers. Uncontrollable events like inclement weather, or warfare, can
disrupt observation or limit the length of study that’s feasible.
Let’s see how these features play out in a real field experiment. In their study entitled
‘Women as policy makers’, Raghabendra Chattopadhyay and Esther Duflo (2004) inves-
tigated how women village council leaders, or pradhan, might affect the social services
provided by councils in India. This experiment was possible because of an Indian consti-
tutional amendment in 1993, calling for one-third of pradhan positions to go to women.
Thus, the experimenters had no say in the assignment of pradhan positions to women,
as this was established by the Indian government. This also means the intervention was
not implemented by the researchers, but the law was structured so that the change in
leadership was randomly implemented across villages, mimicking a surgical intervention.
Data were collected on 265 village councils in West Bengal and Rajasthan. In each
village council, the two researchers collected the minutes of village meetings and inter-
viewed the pradhan. They also collected data from each village about social services,
infrastructure, and complaints or requests that had been submitted to the village council.
The pradhans’ policy decisions and villagers’ requests were not affected by their interac-
tions with the experimenters, since those requests and decisions were already made at the
time of data collection. It was found that women policy makers (independent variable)
had important effects on social service policy decisions (dependent variable). Women
pradhan invested more in the social goods that were more closely connected to women’s
concerns in a village: drinking water and roads in West Bengal and drinking water in
Rajasthan. They invested less in public goods connected to men’s concerns: education in
West Bengal and roads in Rajasthan.
Choices in Variables, Sample Size, and Groups

Just as how some aspects of lab experiments are better and other aspects of field experi-
ments are better, there are many trade-offs among other elements of experimental design
as well. First, experimenters choose the independent and dependent variables and then
decide how best to intervene on the independent variable. Selecting the right variables
is crucial for successful experimentation, but it’s sometimes not obvious how best to
proceed. In fact, in several scientific disciplines, including climate science, macroeco-
nomics, neuroscience, and psychiatry, the worry has been raised that experiments are
often conducted using the wrong variables. The wrong variables may not allow proper
intervention or accurate measurement. For example, current classifications of psychiatric

disorders, such as schizophrenia, are criticized for being too broad and coarse-grained.
Because psychiatric classifications lump together several different psychiatric symptoms
and variables that may have little in common, experiments based on such classifications
may not provide evidence reliable for diagnosis and treatment.
One approach to variable choice is to select variables that correspond to properties
or quantities that are well-defined targets for intervention and measurement. These are
usually easy to intervene upon and measure (things like the pressure and volume of a gas,
hours slept, and grade on math test). Another approach is to select independent variables
based on what you can intervene on ‘surgically’, manipulating their values independently
of the values taken by other variables. A third approach is to focus on macro-variables
that aggregate measurable variables in a meaningful way. In climate science, for example,
temperature and atmospheric pressures at sea level are measured at various locations
around the Earth’s surface, and then these are aggregated to form macro-variables. With
the right aggregation procedures, these macro-variables stand in relations that can be
captured in climate models, which are used by climatologists to formulate more reliable
and stable predictions than could be made about individual temperature or pressure
measurements (Woodward, 2016).
A second choice in experimental design is sample size. In describing the perfectly
controlled experiment, we described how randomization as an approach to indirect vari-
able control is successful only if the sample size is adequately large. In general, a larger
sample size increases the success of indirect control of extraneous variables, thus increas-
ing the experiment’s internal validity. In this way, larger sample sizes make it more likely
that experimental results are actually dependent on only the experimental intervention.
But these considerations must be balanced against the downsides of large sample sizes.
Large samples are more difficult to assemble, and they can be more difficult to manage
in the experiment. In some cases, data collection and analysis are also more difficult for
a larger sample. These drawbacks are practical. A different kind of problem is that a large
sample size increases the chance of spurious findings. Just as large samples make it easier
to discern the intervention’s effects on the dependent variable, they also make it easier to
discern other kinds of differences in the dependent variable. This increases the chance
of a confounding variable influencing the dependent variable in a way that impacts the
experimental results.
A third choice in experimental design regards group assignment. Randomization is
one particularly effective way to assign experimental entities to experimental and con-
trol groups. But sometimes randomization isn’t possible for practical or ethical reasons.
If you’re studying the effects of gestational diabetes on fetuses, for example, you can’t
simply assign subjects to mothers with or without gestational diabetes (the independent
variable). And it’s not ethical to randomly assign pregnant women to experimental condi-
tions aimed to increase the chance of developing gestational diabetes.
Other methods can be used to control variables when randomization isn’t feasible.
One method is to restrict participation in an experiment to experimental subjects with
the same levels of some extraneous variable. For example, suppose that age and smoking
are the two extraneous variables of greatest concern in an experiment aimed to test the
relationship between cholesterol level and heart disease. Randomization is not possible
here, or at least not ethical, but the extraneous variables of age and smoking can be
controlled by restricting admission into the experiment to subjects who are non-smokers
age 30–50. This method is simple. However, it decreases the achievable sample size and lim-
its the external validity of the experimental findings (due to decreased population validity).
Another approach is to use data about extraneous variables and their effects in order
to account for their influence on the dependent variable. For example, in a landmark
study known as the Harvard Six Cities Study, researchers investigated the effects of air
pollution on health (Dockery et al., 1993). During the 1980s and 1990s, different areas
in the US had very different levels of air pollution. The researchers studied 8,000 experi-
mental participants living in six cities in different areas, including Boston, an industrial
area in Ohio, and rural Wisconsin. Participants’ health was monitored for 20 years and
compared with air pollution measurements in the six cities. The researcher used statis-
tics regarding the health effects of socioeconomic factors, demographics, and smoking to
estimate the likely effects of those extraneous variables on participants’ health. This was
a way to indirectly control for those variables, even if there were systematic differences
in how they affected participants in different studies (the different groups in the study).
The researchers found that, taking all these other variables into account, decreased air
pollution is linked to increased life expectancy.
Yet another approach to indirect variable control is to match the members of the
experimental and control groups so the groups don’t differ in the values of known extra-
neous variables. This involves matching every subject in the experimental group with a
subject in the control group, based on knowledge of how certain extraneous variables,
such as age and smoking history, affect individual subjects. For example, researchers might
include pairs of smokers of the same age and pairs of non-smokers of the same age in
their study. One member of each pair should experience the experimental condition (say,
complete an exercise regime) and the other should experience the control condition (say,
exercise as they ordinarily would). In this way, groups of subjects can be made similar
with respect to the primary extraneous variables, thereby indirectly controlling them.
This method is often effective, but it has some limitations. It only works for extraneous
variables researchers are already aware of. It can also be time-consuming and expensive
to find matched subjects, and this may limit the sample size.
A fourth choice in experimental design concerns how many groups to include in an
experiment. So far, we have focused on experiments with two groups: an experimental
group and a control group. More complicated experimental designs include multiple
experimental groups, each of which experiences a different but related intervention. We
saw an example of this in the Harvard Six Cities study. There were six different groups,
each corresponding to a city with some measured value of air pollution. Participants
were assigned to groups simply according to which city they lived in. Including multiple
experimental groups can be enlightening but also complicates experiments, making them
more difficult to perform. They also make it more difficult to get adequately large sample
sizes for each group, which leads to the drawbacks we’ve already discussed. And finally,
multiple groups can make analysis of the results more difficult.
The Cholera Outbreak of 1854

Most of the variations in experimental design we have discussed involve compromises
away from the aim of surgical intervention and full control of extraneous variables. Let’s
move on to discuss methods of observation employed in non-experimental studies, when
intervention and variable control are significantly compromised or impossible. Most vari-
eties of non-experimental scientific study are observational studies, which involve col-
lecting and analyzing data without performing interventions or controlling extraneous
variables.
One example of an observational study is John Snow’s investigation into the source of
a cholera outbreak in London, England. Cholera epidemics ravaged London in the mid-
19th century, with notable outbreaks in 1831–1832 and again in 1849. Snow studied
these outbreaks, recording the details of dozens of cases. Because his research seemed to
indicate that cholera was transmitted from person to person, Snow wanted to find out
how it was transmitted. Previous reports suggested that cholera began with ‘an affec-
tion of the alimentary [digestive] canal’. From this, Snow hypothesized that cholera was
transmitted through the inadvertent ingesting of ‘morbid material’ from the vomit and
‘evacuations’ of cholera patients.
Then, on Thursday, August 31, 1854, cholera hit London’s Soho district. The out-
break appeared to be concentrated in certain areas. One such area was the corner of
Broad and Cambridge Streets, where more than 100 neighbors died in three days.
Three-quarters of the neighborhood residents fled within a week, but hundreds more
died nonetheless. Snow reported that ‘within 250 yards of the spot where Cambridge
Street joins Broad Street, there were upwards of 500 fatal attacks of cholera in 10 days’
(Snow, 1855).
At this intersection, there was a water pump from which locals could draw water.
Snow’s own observations of the pumped water led him to note that it looked abnormal.
Given his prior reasoning about cholera transmission, Snow began to suspect that the
FIGURE 2.8 Cholera epidemic, close-up of Snow’s Broad Street map
pumped water contained ‘morbid material’. He learned that of the 89 cases of deceased
cholera victims, 61 were known to have consumed water from the Broad Street pump.
This was suggestive evidence. However, there was an apparent anomaly—that is, a
phenomenon that deviates from the expectations of a theory or hypothesis. One detail
didn’t fit the pattern suggested by Snow’s hypothesis: very near the Broad Street pump
was a brewery, but none of the more than 70 brewers had died from cholera. This was
puzzling.
Snow had the Broad Street pump handle disabled seven days after the outbreak began.
Even though the epidemic had already begun to fade, he was convinced of having reasoned
correctly from his detailed observations:
Whilst the presumed contamination of the water of the Broad Street pump with
the evacuations of cholera patients affords an exact explanation of the fearful out-
break of cholera in St. James’s parish, there is no other circumstance which offers
any explanation at all, whatever hypothesis of the nature and cause of the malady
be adopted.
(1855, p. 54)
In other words, Snow could think of nothing else that could account for the outbreak’s
features, other than the hypothesis of contaminated water from the Broad Street pump.
Snow was right. It was later discovered that the well serving the Broad Street pump
had been dug only a few feet away from an old cesspit, which had begun to leak fecal
bacteria. The lack of cholera deaths among brewers turned out to be further evidence in
favor of Snow’s inference; the brewers only drank their own beer, which used water from
their own well, water that was sterilized in the beer-brewing process.
In this study, Snow did not perform an intervention, control variables, and study the
results. What he did was assemble a system of detailed observations and reason his way
to the one hypothesis that best explained those observations.
Case Studies and Natural Experiments

One form of observational study that is very different from a controlled experiment
is a case study, a detailed examination of a single individual or system in a real-life
context. Case studies allow researchers to gain a first-hand qualitative understand-
ing of a phenomenon as it occurs in its specific context and from various sources of
data—including perhaps observations of a person’s daily routine, unstructured inter-

views with participants and informants, letters, e-mails, social media activity, health or
archival records, and physical artifacts. Case studies are frequently employed within
the context of qualitative research in epidemiology, psychiatry, education, ethnography,
and other social sciences.
One of the most famous case studies in science is in neuropsychology. Phineas Gage
was an American railroad construction foreman. In 1848, he was helping to manage the
construction of the Rutland and Burlington Railroad, located near Cavendish, Vermont.
While he was using dynamite to blast away a rock, an iron tamping rod, measuring
1.1 meters long and almost 3.2 centimeters in diameter, was blasted through Gage’s
skull. The tamping rod entered through his left cheekbone and erupted through the top
FIGURE 2.9 Phineas Gage posing with the rod that passed through his skull
front of his head, ultimately landing about 25 meters away. The rod destroyed much of
his brain’s left frontal lobe, but Gage survived (Harlow, 1848).
In 1868, Dr. John Harlow, one of the physicians attending Gage, reported on the
patient’s mental condition after this accident. He described Gage as ‘fitful, irreverent,
indulging at times in the grossest profanity’, ‘manifesting but little deference for his fel-
lows’, and ‘at times pertinaciously obstinate’. He claimed that this was a radical change
for Gage after the accident, ‘so decidedly that his friends and acquaintances said he was
no longer Gage’ (1868, p. 277). Overall, the damage seems to have resulted in a major
degradation of, among other things, Gage’s social skills.
Since the 19th century, neurologists, neuropsychologists, and cognitive neuroscientists
have studied the case of Phineas Gage to understand the role of the frontal cortex in
social behavior. But it has been difficult to make precise inferences from this case, since
the immediate damage to Gage’s frontal cortex was so extensive, with surgical repairs
and subsequent infections complicating matters further. Another complicating factor is,
of course, that there is just one instance of Gage’s injury; a single case study creates no
opportunity for variable control or the observation of how different instances play out.
For these reasons, although case studies can provide a rich body of qualitative infor-
mation, they have limited internal and external validity. A case study’s internal validity
is limited by the lack of control over extraneous and confounding variables. Case studies
are also particularly vulnerable to bias due to the evaluation of qualitative data and no
blinding. And because the research focuses on only one individual, event, or group, results
can be difficult to replicate and to generalize.
Every now and again, nature yields a case that can play the role of an experiment. These
so-called natural experiments occur when an intervention on an independent variable
occurs naturally in real life without any experimenters doing anything. This very thing
happened in the case of Phineas Gage and also in other famous cases from the history of
neuropsychology, like the case of Louis Leborgne.
When he was about 30 years old, Louis Leborgne lost the ability to speak. He could
utter only a single syllable, tan, which he usually repeated twice in succession, giving
rise to his nickname ‘Tan Tan’. Apart from his inability to speak, Leborgne exhibited no
symptoms of physical or psychological trauma. He could understand other people, and
his other mental functions were apparently intact. After Leborgne died at the age of 51
in a hospital in Paris in 1861, the French physician Paul Broca performed an autopsy,
and found that Leborgne had a lesion in the frontal lobe of the left cerebral hemisphere
(which later came to be known as ‘Broca’s area’). This case is a kind of natural interven-
tion. The variable of interest, brain region x, was not deliberately manipulated, but there
was no evidence of any confounding variables associated with that manipulation. Broca
used this case to identify a brain region important for the articulation of speech; injure
Broca’s area, and an inability to produce speech—that is, Broca’s aphasia—would ensue.
Leborgne just happened to suffer the very kind of brain damage that could make clear
the function of that area of the brain.
Sometimes, even groups of individuals just happen to get sorted—naturally and
without any scientific intervention—into something approximating experimental and
control groups. Some natural or historical process separates them out, such that one
group but not the other can be construed as receiving an experimental treatment or
condition. The Indian councils and Harvard Six Cities studies discussed earlier are
examples of natural experiments. Their conditions approximated experiments well
enough that we described them as such, but really the experimenters were not in the
position to intervene.
Another example of a natural experiment on experimental and control groups occurred
with the separation of the Korean territory and population into two sovereign nations.
When the Korean War ended in 1953, the peninsula was partitioned in half. Many aspects
of the resulting two nations—South and North Korea—have remained similar. For exam-
ple, both nations have a shared history, and they have similar geographies, climates, lan-
guages, and cuisines. But they differed in one main respect: political regime. North Korea
adopted single-party state socialism, headed by a totalitarian military dictatorship, whereas
South Korea eventually became a multi-party liberal democracy.
The separation of the Korean population into two groups is often described as a large-
scale natural experiment, in so far as the political regime (independent variable) seems
to be related to many observable differences between the two nations. These differences
include changes in economy, infrastructure, religion, education, and health. By 2010, the
difference in infant mortality, an indicator of population health, was striking: 3.8 deaths
per 1,000 births in South Korea but 27.4 deaths per 1,000 births in North Korea. By
2011, life expectancy in South Korea was 77.5 years for men and 84.4 for women but
only 65.1 years for men and 71.9 for women in North Korea (Khang, 2013). The differ-
ences are even visible from space: the per capita power consumption in the two countries
differs greatly (South Korea at more than 10,000 kilowatt hours, North Korea at less than
750 kilowatt hours).
Yet another example of a natural experiment comes from an investigation in which

researchers tracked the development of 65 children in order to study the effects of
institutional upbringing on later attachments (Hodges & Tizard, 1989). The participants
were all 16 years old and had been living in residential nurseries and institutional care
from infancy to at least two years of age, when most of them had been either adopted
or restored to their biological parents. A comparison group was also studied, consisting of
children who had been with their families all their lives. So, the independent variable in
this study—the children’s environments—varied because of an accidental course of events.
Researchers could study the effect of this change on the children’s later social relation-
ships (the dependent variable). It was found that parental deprivation at early ages did
not necessarily prevent children from forming strong and lasting bonds to parents once
they were placed in a family. Whether such bonds developed depended on the later family
environment. Yet, because of their early institutional experience, children did have more
difficulty in socializing with peers and developed fewer close relationships.
Studies Extending Over Time

Some observational studies extend over time. These can be critical to understanding, for
example, the long-term effects of treatments. An important observational method for
such studies is the cohort study, where researchers select a group of subjects according
to set parameters, and then track those subjects over time, at set intervals, to observe the
effects of some condition they experience. Cohorts can have fixed membership, as with
the people in West Sierra Leone, Liberia, or Guinea during the Ebola virus epidemic of
2014–2015, for example, or can have changing membership, as with double majors in
public universities or state organ donor registries. In either case, the cohorts are deter-
mined by some property of interest.
Cohort studies include retrospective and prospective studies, or backward-looking
and forward-looking studies. In a retrospective study, researchers first identify a group
of subjects who have the property of interest, and then investigate their past in an
attempt to identify the cause of that property. A common use of retrospective studies is
in epidemiology, in which subjects are grouped according to their exposure status and
incidence of disease, and then compared using available data about them. John Snow’s
cholera investigation was like this. In a prospective study, researchers still identify a group
of subjects with some property of interest but then track their development forward in
time to check the effects of that property. The Harvard Six Cities study was like this.
Longitudinal research is another approach that tracks subjects over time. In a longi-
tudinal study, the same subjects are measured repeatedly over a period of time, some-
times many years, allowing the researchers to track subjects’ change. A benefit of such
diachronic studies is that they can reveal changes over time in the characteristics of a
group of subjects. The Early Childhood Longitudinal Study started in the late 1990s and
followed 20,000 American children, examining their development, performance at school,
and early school experience. Researchers also conducted extensive interviews with their
families. This study provides a lot of information about American children’s development
and family life. Analyzing this longitudinal data, the economists Steven Levitt and Stephen
Dubner (2005) showed that many things that parents do to make their kids ‘smarter’
do not seem to actually help children do well on tests. Reading to kids every day, for
example, does not relate to higher test scores. Higher test scores are strongly related to
being born to a mother over 30, but not to a mother taking time off to raise the child.
In a cross-sectional study, different subjects are measured at a single time in order to
get a sense for the prevalence of some trait(s) in the population at large. For example, a
cross-sectional approach to studying children’s development and family life would involve
assessing the kinds of variables just discussed—family characteristics, reading exposure,
test scores, and so on—at once. One advantage of cross-sectional studies is that they enable
researchers to measure and compare several variables. They are also easier to accomplish,
as there is no need to track individuals over time. But the information they provide is
correspondingly more limited and perhaps less accurate. For example, instead of assess-
ing whether kids are read to every day based on subjects’ actual experiences, researchers
must rely on their memories of earlier years.
Studies Using Big Data

A different category of non-experimental studies uses so-called big data. Big data are
very large data sets that cannot be easily stored, processed, analyzed, and visualized with
traditional methods. Big data sets are especially interesting because they can reveal unex-
pected patterns, trends, and associations relating to human behavior. A number of fields
of science use big data to understand, for example, the factors influencing the climate,
genetic disease prevention, and business trends, among many other phenomena.
Social media, business transactions, cameras, audio files, e-mails, and the internet,
more generally, have produced an ever-increasing stream of data in recent years. It’s
been estimated that humanity has accumulated 10 times the data from 2006 to 2011 as
was accumulated between the advent of writing thousands of years ago and 2006. This
amount of data is expected to get four times as large again every three years (Floridi,
2012). It ranges from the 500 million tweets per day on Twitter to the tremendous
amount of data produced each year by the extremely sensitive detectors of the Large
Hadron Collidor at CERN.
Supercomputers and machine learning techniques, also known as data analytics, are used
to manage and mine these large data sets. Machine learning techniques help researchers
compress and visualize big data sets in charts or graphs; they also help filter data sets so as
to allow researchers to draw conclusions about their characteristics. Imagine, for example,
that you want to determine general trends in food preferences, and you have a data set
containing all tweets produced in one year at your disposal. Filtering those tweets to a subset
relevant to food preferences is extraordinarily valuable, as is visualizing the data about the
popularity of various foods.
The patterns and trends uncovered by analyzing big data can give insight into relation-
ships among variables of interest and can be used to make predictions. One well-publicized
example of ambitious research based on online data is the long-term analysis of user data
from the online dating website OKCupid (Rudder, 2014). But it can be difficult to assess
big data research, and some are concerned that it’s taken more seriously than it should be.
In 2008, researchers from Google claimed that they could immediately predict what regions
experienced flu outbreaks based simply on people’s online searches. The idea was that
when people are sick with the flu, they often search for flu-related information on Google.
Unfortunately, this idea wasn’t borne out. Google Flu Trends made very inaccurate predic-
tions, significantly overestimating flu outbreaks, and was shut down (Lazer et al., 2014).
Perhaps the biggest challenge facing big data techniques is their opacity. The algorithms
used to sample, filter, and order data are often unknown to outside researchers, and the
people who create the data in the first place are generally unknown to even the research-
ers performing the investigation. This makes it difficult to assess study procedures, the
significance of the data, and the possibility of confounding variables. Another challenge
with big data techniques regards population validity (see Section 2.2). Many people in
the world don’t use any social media, so those who do may not be representative of the
broader population, and more nuanced versions of this problem exist for any particular
form of online data. There are issues with privacy too. Online data are often in the public
domain, but big data research publicizes data and reveals trends that the people respon-
sible for the data may not be comfortable with. The publication of OKCupid user data
was an instance of this issue widely discussed in the popular press.
These challenges do not erase the scientific value of big data though. And the analysis
of data can even help us better understand how science works. For example, in the field
of library and information science, bibliometrics is used to understand the dissemination
and production of literary work by analyzing big data sets of written publications. This
approach is also directed to scientific publications. Bibliometric methods, including the
analysis of networks of citations in published work, can be used to investigate the level
of productivity of a certain field of research, trends in the topics of scientific research,
and even the social dynamics underlying scientific practice. The number of citations of
a published article is an index of recognition, which is one of the primary rewards for
scientists. So, citation rates and patterns can be used to quantify scientific impact and to
predict what factors might affect the future course of science.
Other Kinds of Interventions

Most of the approaches to non-experimental studies we have discussed are observa-
tional studies. Big data studies are, perhaps, the exception. We’ll conclude this chapter by
briefly discussing another approach that, in a way, still employs some of the ideas behind
experimentation. This is the indirect study of a phenomenon by studying interventions
on something similar.
The most significant form this approach takes in science is when models are devel-
oped and studied in order to learn about phenomena of interest. This is the topic of
Chapter 3. For now, we’ll just mention one example. One extension of, or replacement for,
experimentation is accomplished with the use of computer models or simulations. Such
computer simulations can play a role analogous to experiments. Computer programs are
developed that use algorithms to mimic the behavior of a real-world system. For example,
computer simulations of the Earth’s climate represent the dynamic interactions of solar
energy, chemicals in the atmosphere, oceans, landmasses, ice, and other factors. Such simu-
lations can then be studied to yield insight into real phenomena such as anthropogenic
climate change. Interventions can be performed in a simulation of the climate system
that would be undersirable or impossible to actually perform in Earth’s climate system.
For example, climate scientists might investigate what a specific increase of the amount
of carbon in the atmosphere would do to the rate of glacier melt.
Another extension of the concept of intervention is to our rich imaginations. Thought
experiments are devices of the imagination that scientists sometimes use to learn about
reality. Thought experiments involve an imagined intervention on a system. In the right
conditions, these can be used to test a hypothesis, to show that nature does not conform
to one’s previously held expectations, and to suggest ways in which expectations can
be revised. Just like experiments in a lab or in the field, thought experiments may be
criticized because their setup is faulty or because scientists draw unjustified conclusions
from them.
Galileo used many thought experiments in his investigations of physics and astronomy.
In one instance, he wished to investigate an idea of Aristotelian physics that objects with
different weights fall at different speeds. Galileo asked his readers to assume, as Aristotle
did, that heavier objects fall faster than lighter objects. He then imagined two objects, one
light and one heavy, connected to each other by a string and dropped from the top of a
tower. If Aristotle’s assumption was correct, then the string would pull taut as the heavier
object falls faster than the light object. But, Galileo reasoned, both objects together are
heavier than the heavy object. So, for Aristotle, the two objects together should actually
FIGURE 2.10 Isaac Newton’s cannon thought experiment
fall faster than either object alone. These objects cannot simultaneously fall both faster and
slower, so the Aristotelian idea that was the starting point for this reasoning process could
not be right. Galileo’s thought experiment provided a refutation of the Aristotelian theory
of motion, suggesting that the speed of a falling body is not dependent on its weight.
Newton also used thought experiments to help show how his theory of gravitation
worked. He had readers imagine a cannon at the top of an extremely tall mountain, and
then asked what would happen if somebody loaded the cannon with gunpowder and fired.
Plausibly, Newton reasoned, the cannonball would follow a curve, falling faster and faster
because of gravity’s force, and would hit the Earth at some distance from the mountain.
But what if one used more gunpowder? The velocity of the cannonball would be greater,
and it would travel farther before falling back to Earth following a curve trajectory.
But if one used vastly more gunpowder, then, Newton suggested, the cannonball would
travel so fast that it will fall all the way around the Earth, never landing. The cannonball
would be in orbit, going around again and again just like the Moon! This is pictured in
Figure 2.10. If the cannonball went even faster, then it would escape Earth’s gravity,
heading out in space. Newton’s theory of gravitation provided the resources to arrive at
these same conclusions through mathematical calculations. Imagining this situation gives
a satisfying, intuitive sense for how an object like the Moon can stay in orbit by remain-
ing in constant free fall.
EXERCISES
2.22 Recall the ideal experiment you described in Exercise 2.14 and the three challenges
to that experiment you identified. Describe an alternative experiment that is more
practical but that still can successfully test your hypothesis.
2.23 Describe a different approach to the experiment you described in 2.22. Then list
the advantages and disadvantages of each approach, with an eye to the trade-offs
among features of experiments described in this section.
2.24 Recall, from section 2.1, the experiment when participants divide $10, with one
person offering some division and the other only being able to accept or reject
the offer. (Rejecting the offer results in neither participant getting any money.) The
finding was that people offered fairer divisions and also rejected divisions deemed
unfair even though this resulted in no money won. The researchers concluded that, in
general, people seem to be willing to sacrifice self-interest to promote fairness. In this
experiment, participants haven’t previously interacted with one another, and they
don’t interact with the same participant more than once. Let’s assume participants
are randomly selected and randomly assigned to roles.
a. Define internal validity, and assess this experiment’s internal validity, justifying
your assessment.
b. Define external validity and name and define each of its two components.
Assess this experiment’s external validity, justifying your assessment.
c. What was the researcher’s conclusion from this study? Does the experiment’s inter-
nal validity or external validity cast doubt on this conclusion? Why or why not?
2.25 What are the main advantages and disadvantages of a laboratory experiment?
How about a field experiment?
2.26 Decide whether each of the following statements is true or false. For any false state-
ment, write a new sentence, changing the original sentence so it is true.
a. A completely randomized design offers no control for confounding variables.
b. Randomization controls for the placebo effect.
c. A cohort is a group of subjects with some defining characteristic in common.
d. Longitudinal studies involve repeated observations of the same variables over
long periods of time.
e. Natural experiments occur when experimenters intervene on an independent
variable in the real life setting of their subjects.
f. In observational studies, the independent variable is under the control of the
researcher.
2.27 What are three reasons experiments sometimes cannot be performed? For each
reason, say whether it absolutely prohibits experiment or experimentation might be
possible at another time or in another way.
2.28 Briefly describe case studies, cohort studies, prospective studies, and longitudinal
studies. What features do these have in common? How do they differ?
FURTHER READING
For an introduction to the philosophy of experiments with a focus on the natural sciences,
see Hacking, I. (1983). Representing and intervening: Introductory topics in the philoso-
phy of natural science. Cambridge: Cambridge University Press.
For a historical perspective on experiment with a focus on the debate between Rob-
ert Boyle and Thomas Hobbes over Boyle’s air-pump experiments in the 1660s, see
Shapin, S., & Schaffer, S. (1985). Leviathan and the air-pump: Hobbes, Boyle, and the
experimental Life. Princeton: Princeton University Press.
For more on the experimental approach in the social sciences with a focus on economics,
see Guala, F. (2005). The methodology of experimental economics. Cambridge: Cam-
bridge University Press.
For a case study on the role of instruments and measurements in experiments and stud-
ies, see Chang, H. (2004). Inventing temperature: Measurement and scientific progress.
Oxford: Oxford University Press.
For an account of the scientific method in physics and an early statement of the problem
of underdetermination, see Duhem, P. (1954/1991). The aim and structure of physical
theory. Princeton: Princeton University Press.

For a concise treatment of qualitative research and its methodology, see Golafshani, N.
(2003). Understanding reliability and validity in qualitative research. The Qualitative
Report, 8(4), 597–606.
For more on the role of thought experiments in science, see Horowitz, T. & Massey, G.
(eds.) (1991). Thought experiments in science and philosophy. Lanham: Rowman &
Littlefield.
For more on the use of big data in science, see O’Neil, C. (2017). Weapons of math destruc-
tion: How big data increases inequality and threatens democracy. New York: Broadway
Books.
CHAPTER 3
Models and Modeling
3.1 MODELS IN SCIENCE
• Characterize models, target systems, and how they relate

• Describe how similarities, differences, and scientists’ purposes are each important
for modeling
• Give three examples of scientific models, describing their features and how they
have been used
• Describe why and when modeling can be a useful scientific approach
• Outline the three main steps that are involved in modeling and say how each works
The Bay Model

In an unassuming warehouse north of San Francisco, California, there lies an enormous
model of the San Francisco Bay and the surrounding Sacramento–San Joaquin River Delta.
This Bay Model is amazing.
The Bay Model is basically a downsized reconstruction of an area in Northern
California—an area the size of the state of Rhode Island in the US, stretching from the
Pacific Ocean inland to Stockton and almost all the way to the state capital of Sacramento.
The model is more than 1.5 acres large (over 6,000 square meters) and is made out of
286 five-ton concrete slabs pieced together like a jigsaw puzzle. If you viewed it from
above, you would see the whole Sacramento–San Joaquin River Delta, and you could
gaze directly from the Port of Oakland to the Golden Gate Bridge (about 12 miles, or
19 kilometers, away in real life). This is possible because, as large as it is, the Bay Model
is 1,000 times smaller than the actual San Francisco Bay, a large body of salty ocean
water surrounded by a large urban population living in a variety of geological terrains
and climates.
The Bay Model is a hydraulic model; it can be filled with water, just as the real San
Francisco Bay is. Pumping systems move the hundreds of thousands of gallons (1 gallon =
3.785 liters) of water in the model and do so in a way that mimics the tides and currents
of the real bay. This works in part because the model is three-dimensional and propor-
tional, so the different parts of the bay and river delta in the model are the right amount
lower than sea level, and the surrounding land is the right amount above sea level. The
Recipes for Science : An Introduction to Scientific Methods and Reasoning, Taylor & Francis Group, 2018. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/purdue/detail.action?docID=5584122.
90 Models and Modeling
FIGURE 3.1 View of the San Francisco Bay Model
Bay Model also includes many other features that affect water flow, such as rivers, canals
in the delta, wharfs, bridges, and breakwaters.
The Bay Model is not just a toy model, however. It’s a scientific model, and this has
some important implications. Scientific models are constructed and investigated in order
to learn, not just about the model itself, but also about phenomena in the real world.
This particular model is a terrific tool for learning about the San Francisco Bay and how
human activities can affect it. Teachers, students, and scientists use it to study geography,
ecology, human and natural history, and hydrodynamics. It has been used to help answer
questions about how dredging new shipping channels would affect the San Joaquin River
Delta, about how mining during the California Gold Rush changed the rivers, and about
what would happen if the system of dikes and levees in the delta failed.
Why Models?
Chapter 2 discussed the role of experiments and non-experimental studies in science,
considering especially how these are used to generate data to compare with expectations,
providing evidence for or against hypotheses. In this chapter, we will survey another
important feature of science that relates to experimentation in interesting ways: the
use of models. To uncover the roles that models play in science and to see how the Bay
Models and Modeling 91
Model in particular works, let’s look back at why that model was originally constructed.
(See Weisberg, 2013, on this case study and an overview of the use of models in science.)
John Reber moved from Ohio to California in 1907 and set up as an amateur play-
wright, dramatist, and theatrical producer in the 1920s and 1930s. Because of his work,
he enjoyed social connections with numerous businessmen and politicians. In the 1940s,
Reber became dismayed that the transcontinental railroad terminated in Oakland rather
than San Francisco, and came to believe that the bay that isolated San Francisco from
the rest of California and the United States interfered with industry. He saw that large
body of water as a ‘geographic mistake’ to be corrected.
Reber’s career was in entertainment, and he had no expertise in science or engineer-
ing. Nonetheless, Reber intrepidly proposed a grand plan to re-engineer, and then exploit,
natural features of the bay that he thought would enable more efficient use of it. He
suggested filling some parts of the bay to create additional land for things like airports and
factories and to establish two lakes to store freshwater supplied by the rivers that empty
into the bay. As freshwater has always been a limited resource in the San Francisco Bay
area, it could be valuable to repurpose the bay for potable drinking water and irrigation.
Reber’s plan was taken seriously, and the US Army Corps of Engineers decided to
test it out. An immediate problem, though, was that the corps couldn’t effectively test
out Reber’s plan in the actual bay without implementing the plan. What to do? How
could they consider the effects of the plan without going ahead and carrying it out? Such
circumstances highlight one way in which scientific models are particularly useful. When
performing an intervention on a system of interest isn’t possible, practical, or otherwise
desirable, a model of the system can be used instead.
Consider another example of a circumstance when modeling is useful. Suppose you
are playing chess against a computer and are considering moving, say, your rook. How will
that move affect the next three moves in the game? The easiest way to find out would
just be to move the rook and see what happens. But the easy thing to do isn’t always the
best thing to do. Without thinking through the consequences first, such a move might
result in a quick defeat. It would be helpful to have a second chessboard set up to be
just like the game that you’re actually playing but ‘offline’—in other words, it isn’t in
the midst of an actual game. That way, you could try out various moves and consider
moves that might be made in response. Doing so would help you anticipate how the
actual game might proceed without suffering any bad consequences in the process. The
offline chessboard might be chessboard you’ve set up beside you, or it could just be a
chessboard you imagine, or it could be another game on a computer but not in active
play. Regardless, if the second chessboard is used in this way, it is a model of the actual
chess game. You’ve set it up to have the pieces in the same places, and you can then try
to figure out what your opponent might do were you to move your rook.
This is just like the decision of how to study Reber’s plan for the San Francisco Bay. The
Army Corps of Engineers wasn’t prepared to radically alter the bay and the surrounding
river delta before knowing what the results would be. They recognized that such changes
might have unintended negative consequences for the local water supply, wildlife, vegeta-
tion, agriculture, and human population. So, like a second chessboard used to explore
possible consequences of moves in a real game of chess, the Corps of Engineers built a
hydraulic model designed to be like the San Francisco Bay in some important respects.
FIGURE 3.2 The Reber Plan
This enabled them to investigate the consequences of the changes Reber had proposed,
by this time known as ‘the Reber Plan’.
Once they were confident that their model was sufficiently similar to the real San
Francisco Bay in the important respects, scientists could make predictions about the
real bay based on what they saw happening in the Bay Model. The model could then
be manipulated—an intervention could be performed on it—to determine what would
happen in the real bay were the Reber Plan implemented. The scientists did exactly that.
They built scale models of the dams that would create the proposed lakes and landmasses,
and then they sat back to see what would happen.
It turned out that, when the Reber Plan was implemented in the Bay Model, its
unintended consequences were disastrous. The dams didn’t create lakes at all but instead
stagnant pools with poor water quality that wouldn’t support ecosystems and couldn’t be
used for drinking or irrigation. Altering the dam configuration in the model in an attempt
to solve that problem just created another problem: fast currents that again destroyed
ecosystems and made travel in the bay significantly more dangerous. When the Corps of
Engineers reported these findings, the Reber Plan was abandoned.
Similarity and Difference

The real-world system that scientists want to study using a model is often called a target
system, or sometimes just a target. Many different kinds of things can serve as a model of
a given target. The important requirement for something to be used as a scientific model
is that it is taken to stand in for, or represent, a target system. Typically, this means that
the model needs to be like the target system; that is, it should be similar to, or resemble,
the target.
Box 3.1 History of Modeling

The rise of scientific modeling began in earnest in the mid-20th century, although
scientific models were by no means new at that point. The word model originates
from the Latin modulus and was used as early as the 1st century BCE by the
architect Vitruvius and later the theologian Tertullian to describe sculptural replicas.
The use of models gained traction in the 14th and 15th centuries for various
artistic and engineering purposes; for instance, in 1576, the astronomer Thomas
Digges described Copernicus’s heliocentrism as a model of the world, and half

a century later, Francis Bacon described mental representation as model- or copy-
based. Nonetheless, the general trajectory of scientific research through the 18th
and 19th centuries aimed at naming, ranking, and classifying entities in nature,
as well as at the discovery of physical laws, causal generalizations, and math-
ematical equations by direct empirical investigation and theorizing.
It was only in the 1940s and 1950s, when the search for laws began to wane
in many fields of science, that the use of scientific models became increasingly
common. In psychology, Edward Tolman and Kenneth Craik revived Bacon’s notion
of mental models, whose structure corresponds to the structure of the world and
which we use to interact successfully with our environment. In cognitive science,

a wide range of formal and computational models have been developed to capture
specific aspects of the mind. Modeling also became common in biology, perhaps
most famously with the groundbreaking double-helix model of DNA. In philosophy,
Mary Hesse’s work on models and analogies was equally important; she showed
how models provide scientists with resources for metaphorical redescriptions of
what is being observed or experimented upon. This rise of scientific modeling
also coincided with important (and controversial) efforts to develop classificatory
models of people on the basis of racial, cultural, ethnic, and socio-economic
differences. These models not only serve scientific goals but are also used for
bureaucratic purposes and to inform public policy.
Everything is similar to everything else in at least some regards, so any old similarity
won’t necessarily result in a good model. Rockets from the US Apollo space program were
white, cylindrical, rigid bodies, which were shaped much like a parsnip, but no one uses
an Apollo rocket as a scientific model of parsnips. Scientific models need to be similar to
their targets in relevant ways and dissimilar in irrelevant ways, at least for the most part.
This is why the Bay Model replicated tides and currents and other important features of
the San Francisco Bay, but not the number of sailboats in the bay.
So, the features of a model that scientists construct should be relevantly similar to the
features of the target system they think are important. This is what makes it possible to
get accurate information about a target from studying a model. Things are a bit more
complicated, though, since relevant similarity can be achieved in different ways. In the
example of the second offline chessboard in which you try out chess moves, it wouldn’t
matter too much if you replaced the chess pieces with colored paperclips or berries of
various sizes. You could even just draw your own chessboard on a napkin. The dissimilari-
ties between these approaches and the target—the actual chess game—don’t matter, so
long as they don’t interfere with the model’s ability to represent the intended features
of the chess game. Here’s a difference that would matter: using different-sized piles of
sand on a chessboard to represent chess pieces isn’t a good idea, since these piles can’t
be easily moved like chess pieces.
Intuitively, one way to achieve relevant similarity is to construct a model as similar
as possible to the target system. But as it turns out, this is usually a bad idea. Too much
similarity between a target and a model can actually be counterproductive. Had the
Corps of Engineers tried to build a model exactly like the San Francisco Bay in all rel-
evant respects, it would have been too large for them to have anywhere to put it, and it
would have changed so slowly they would have had to wait years to find out about the
consequences of the Reber Plan. Consider constructing a map of your hometown that
is exactly like it in every respect; it is three-dimensional, the same size as the real town,
contains a full representation of every building, shrub, alley, fire hydrant, stray cat, and
so on. Even if this could be done, why even bother with the model? You might as well
just investigate the town itself!
So, scientific models need not—indeed, should not—be similar to their targets in
every respect or even in most respects. Like maps, models are incomplete and usually
simpler than their targets. They’re designed to represent selected features of the target,
the features about which scientists want to learn. Their lack of completeness is part of
what makes them useful.
But what’s the right amount of similarity then? This is an important question that
doesn’t have a general answer. All scientists who work with models regularly consider the
extent to which some particular model should be like its target and the extent to which it
should be different. The Bay Model’s different spatial and temporal scales are two features
that made it useful for learning about the real San Francisco Bay and Delta. The model is
much smaller than the real bay, with much faster tidal cycles, which allowed the scientists
to observe what would happen with a spatially distributed, long-lasting sequence of events
in a short time and without having to leave the warehouse of the Bay Model.
Some other features of the real bay that were changed or ignored either didn’t mat-
ter or would have been too difficult to accurately incorporate. For instance, the model
doesn’t have any trees or buildings, as those were unimportant for its purpose. And being
inside a big warehouse is a difference with a practical benefit: the model isn’t exposed
to changing weather like the real bay is. The model also doesn’t incorporate the oceanic
wind currents that affect the bay; it’s tricky to see how those could be replicated and
whether the outcome of doing so would be worth the effort.
The scientists thus decided which features of the Bay Model should be similar to
the real bay and which could, or should, be different. They also had to decide how to
represent changing features of the San Francisco Bay. For example, they had to decide
whether the model should be like the actual bay is during dry seasons or wet seasons or
some combination of these. They had to get all of these features right, or right enough,
for the model to give them trustworthy information about how the bay would change
if the Reber Plan were carried out. As it turned out, the model they developed was suf-
ficiently similar to the real bay not only to serve this purpose but for it to eventually
be put to other uses as well. For example, the Bay Model was also used to study how a
later plan of deepening water channels would affect water quality.
One special type of similarity is called exemplification. For a model to exemplify some
group of target systems, it must itself be one of the target systems. Such a model is called
an exemplar. Researchers can use an exemplar to represent the broader class of targets that
includes the exemplar and can thus draw conclusions about the whole class of targets by
investigating the exemplar. For example, the fruit fly (which goes by the scientific name
Drosophila melanogaster) is a common model organism in genetics and developmental
biology. Just like Mendel used pea plants to understand how certain characteristics are
passed from one generation to the next, biologists have used the fruit fly to learn how
genes influence the development of embryos from single cells to mature organisms.
Fruit flies are small and reproduce quickly, and large populations are easily maintained
in labs. In addition to fruit flies being easy to keep and work with, scientists know about
their entire genome and so can intervene on their genes in precise ways. These interven-
tions allow scientists to identify specific sections of DNA within the genome that carry
information needed to produce specific molecules like proteins, which in turn influence
characteristics like fruit-fly size and color. As a model organism, the fruit fly is used to
reason about other organisms, such as the biological mechanisms of hereditary disease
and the regularities in the inheritance of physical characteristics observed by Gregor
Mendel. Scientists might study one population of fruit flies to learn about all fruit
FIGURE 3.3 (a) Drosophila melanogaster; (b) The four chromosomes of Drosophila
Image from droso4schools.wordpress.com
flies or to learn about all insects or even about all forms of life, including human life.
The last, broadest range of target systems is surprisingly common in genetics research.
Like all models, exemplars are both similar to and different from the target systems
they represent. For example, fruit flies have genes organized into chromosomes, as do all
other living organisms. This is an important similarity for their use as a genetic model.
But fruit flies have only four chromosomes, so they are much simpler genetically than
many other organisms. Further, because they breed very quickly, they have much shorter
generations than many organisms. These features make them very good models to use in
labs, but they also make them somewhat unrepresentative of all other organisms out there.
To sum up, target systems are real-world phenomena selected for study, models are
constructed to represent target systems for particular purposes, and models are similar to
but also different from their targets in various ways. Most similarities and differences are
carefully chosen, not only so the model can be developed and studied, but also—impor-
tantly—so it can provide accurate information about the target system. Studying a model
can lead to knowledge about a target system insofar as the model can stand in for that
system.
Specification of Target Systems

Next, let’s think more carefully about the process of scientific modeling. Recall that sci-
entific experimentation tends to follow the general pattern of generating expectations,
performing an experimental intervention, and then comparing the data produced to the
predictions to confirm or disconfirm the hypothesis. There are, of course, many variations
on this pattern, and experiments are used for more than just testing hypotheses. Just like
experimental reasoning, model-based scientific reasoning comes in many different forms
and occurs in many different ways. But despite the numerous variations, there is also
a general pattern of how models tend to be constructed and used in science. That pattern
has three basic steps: (1) specification of the target system(s), (2) construction of the model,
and (3) analysis of the model. Let’s consider each of these steps, starting with the first.
At first glance, it might seem easy to specify the target system; this basically just
requires scientists to decide what it is they want to find out about using a model. Do they
want to learn about the effects of proposed changes to the San Francisco Bay? Examine
the genetic influences on some trait? Or, say, learn more about how the number of preda-
tors influences other animal populations?
But like everything else in science, things aren’t as simple as they at first seem. An
archer cannot accurately hit a target with her arrow if she doesn’t know where the tar-
get is or what it looks like. Similarly, scientists need to know quite a bit about a target
system before they can construct a model of it. This is a version of an age-old problem
called the paradox of inquiry: if you don’t already know what you’re looking for, how can
you inquire about it? The central reason to develop a model in the first place is to gain
knowledge about the target, but in order to learn about a target using a model, scientists
must already know about that target.
Scientists may initially know little to nothing about the target systems they want to
investigate—especially when those systems are very distant in space or time, or exces-
sively large or small. Yet, without some knowledge about a target, scientists can’t evaluate
whether the model is similar enough to the target, and in the right ways, to accurately
represent it. So, at the beginning of the modeling process, scientists need to be able to
conceive of what a model should be a model of and what they want to learn from the
model. This can be preliminary and partial, just enough to get the process going. For
the Bay Model, for example, the task was to evaluate the feasibility and any unforeseen
consequences of the Reber Plan for damming up the bay. Scientists didn’t know what
in particular they’d be evaluating—for example, whether strong currents would result or
excessive evaporation would occur.
In order to later construct a model that relates to the target in the right ways, scientists
must also possess more specific information about at least some aspects of the target
system. This point actually suggests two requirements: scientists need to know which
features of the target system are important, and they need to have more specific informa-
tion about those features. For example, when planning the Bay Model, scientists had to
guess that the tides and currents might be important features. And then, in order to be
able to calibrate the model to have the same tides and currents as the real San Francisco
Bay, the engineers needed access to a lot of information about these features of the real bay.
To get the needed data, 80 people took measurements at different locations throughout
the 1,424 square kilometer (550 square mile) bay every 30 minutes throughout a full
tidal cycle of 48 hours. They recorded tide velocity and direction, changes in the water’s
salinity (salt content), and the concentration of sediment. All of these data were needed
in order to even decide what features a model of the bay should have.
Constructing the Model

Once a target system has been specified, scientists can begin constructing models. Some of
their preparatory work is already accomplished in the specification of the target system,
since part of that task was to specify its important features. But there are still questions
to answer about how a model should be constructed in relationship to a target. What

kind of model is called for? What features should be designed to be similar to the target
and to what degree should they be similar? Should the model be designed to apply to
more than one target system? The answers to these and other questions are influenced
by scientists’ exact goals and the nature and extent of their background knowledge about
the target system.
For the Bay Model, the scientists elected to construct a physical replica of the target,
but, as we’ll see, there are many other approaches to models. The San Francisco Bay is
a complex system, and one advantage of a physical model is that the scientists didn’t
need to understand how changes occur in the bay to predict those changes. Instead, their
approach was to make the replica as similar to the bay as possible in all the ways they
thought might matter, and then sit back and see what happened. Still, the model required
extensive calibration—comparison with the real bay followed by adjustment—before it
was sufficiently accurate. The engineers had to tinker with the scales used for depth and
width of the bay in order to get the proper water flow. They ended up making the model
bay much deeper proportionally than the real bay, which helped. But this resulted in
water moving too quickly in shallow parts of the model. The researchers compensated
for this by adding 250,000 copper strips to the bay floor in the model to increase water
resistance. They chose how many copper strips to add to any given place by comparing
the model’s water flow with that of the real bay.
Other modeling approaches offer different advantages and involve different difficul-
ties of model construction. We’ll survey different kinds of models later in this chapter.
For now, consider an example of a different kind of scientific model. The Lotka-Volterra
Model is an influential model in ecology developed (independently) by Alfred Lotka and
Vito Volterra in the 1920s (see Volterra, 1928). Unlike the Bay Model, the Lotka-Volterra
model does not lie in any warehouse. It’s a simple, abstract mathematical model. What
this means is it uses mathematical equations to represent the interactions of predators
and their prey, like foxes and hares, lions and wildebeest, polar bears and seals, and so
on. Here are the equations:
dx/dt = αx − βxy
dy/dt = δxy − γy
One variable, x, stands for the number of prey animals (for example, seals), and another
variable, y, stands for the number of predator animals (in this case, polar bears). In this
model, both x and y represent independent variables in the target system. (Independent
and dependent variables were discussed in Chapter 2.) These equations can be used to
calculate how predator and prey population numbers change over time (represented in
the model as the derivatives dx/dt and dy/dt) from the combination of those population
numbers and a few other parameters. A parameter is a quantity whose value can change
in different applications of a mathematical equation but that only has a single value in
any one application of the equation. In this equation, α, β, δ, and γ are parameters. These
help the model take into account the prey population’s rate of growth without predation,
the rate at which prey encounter predators, the predator population’s rate of growth, and
the loss of predators by either death or emigration.
The Lotka-Volterra model represents predator-prey interactions, but there’s no straight-

forward way in which these equations are similar to animals eating other animals. Instead,
the similarity is between the numbers that solve these equations for particular values
of the variables and parameters and the change in size of predator and prey popula-
tions over time in particular circumstances. Recall that the pieces in a chess game can
be represented with paperclips or berries, so long as they can make similar moves. The
Lotka-Volterra model is like that but with even more radical a difference between the
model and the target.
The variables and parameters of the Lotka-Volterra model are both explicit parts of
the model; they are visible in the equations printed here. What doesn’t appear are the
model’s assumptions, but those are just as important a part of the model. An assump-
tion, in this sense, is a specification that a target system must satisfy for a given model to
be similar to it in the expected way—in this case, in order for the numbers solving the
equations to indicate the actual change in predator and prey population sizes. Numerous
assumptions must be satisfied for the Lotka-Volterra model to apply. For example, the
model assumes that the prey population will expand if there are no predators and that
the predator population will starve without prey. Both of these assumptions are pretty
likely to be true.
The model also assumes that prey populations can find food at all times, that predators
are hungry at all times, and that both predators and prey are moving randomly through
a homogenous environment. These three assumptions are probably not true of any target
system, that is, of any predator and prey populations. These assumptions are idealizations,
or assumptions made without regard for whether they are true, often with full knowledge
they are false (see McMullin, 1985). These and other idealizations enable scientists to
concentrate on the bare essentials of predator-prey interactions they want to focus on,
without getting lost in complicating details of real predator and prey populations. There
are many deep questions to ask about idealizations in science; for now, notice that these
assumptions are good enough, even if they are wrong, if the model’s solution matches
up with how the population size really changes. In that case, these idealizations don’t
interfere with the model adequately representing the target.
FIGURE 3.4 Visual representation of the Lotka-Volterra model
Because models can be similar to target systems in different ways, a single target
is sometimes represented by multiple models. This can be useful when the real-world
phenomenon is so complex that no single model can provide scientists with all of the
desired information. The weather is a good example of this. Any meteorological model
can only capture a few of the factors needed to generate reliable predictions about the
weather. Some meteorological models may invoke humidity, temperature, and dew point
to describe and predict certain basic weather patterns like precipitation. Other models
may invoke more specialized parameters, such as central pressure deficit, along with more
basic ones, such as wind speed and direction, to describe and predict a particular phenom-
enon like hurricanes. Sometimes meteorologists aim to make more reliable predictions
by carefully cobbling together the results of different models of a given weather system.
It’s also possible for a single model to have more than one target system. A model
might be designed to represent a repetitive activity or a type of event that occurs in many
different places. The Lotka-Volterra model is like that; it is designed to capture something
important about seal and polar bear populations, wildebeest and lion populations, and
many more. And the same meteorological models can be used to represent a number of
different hurricanes, as well as typhoons and cyclones.
Analyzing the Model

Once a target has been specified and a model selected or constructed with that specifi-
cation in mind, the model must be analyzed in order to learn about the target. Using or
manipulating a model can occur in different ways. Scientists might literally move parts
of the model or alter certain internal relationships or introduce some external condition.
This kind of physical manipulation was used on the Bay Model to test the Reber Plan.
For a model organism like the fruit fly Drosophila, scientists may alter a gene and see
how their offspring then change. Models that involve equations, like the Lotka-Volterra
model, can be mathematically analyzed with different values for parameters or variables;
these represent specific assumptions about the target populations.
Such manipulations produce data that—if all goes well—can be used to learn about
the target. This is perhaps the main purpose of analyzing a model: to draw conclusions
about the target system(s). For example, the Bay Model was eventually used to show
that freshwater lakes couldn’t be maintained in the San Francisco Bay, as the Reber Plan
called for, and that the planned dams would have disastrous unintended consequences
to the local environment. On this basis, it was concluded that the Reber Plan shouldn’t
be implemented in the real San Francisco Bay.

Another purpose of analyzing models is to use existing data to assess and improve
the extent to which a model represents its target. Recall that specifying the target and
constructing the model involves a bit of guesswork. If scientists fully understood a target
system, it wouldn’t be necessary to model it. And some of the assumptions needed for
a model might end up interfering with how well the model represents the target. For
these and other reasons, researchers may not trust that what happens in the model will
happen exactly as it does in the target. An example of this use of model analysis is the
extensive calibration of the Bay Model that we discussed earlier. This also highlights how
the different steps of modeling can come in different orders or be intertwined.
Different models with the same target are sometimes also analyzed to see whether and
to what extent the different models have the same results. This kind of analysis is called
robustness analysis. This is one way of determining which models are trustworthy for
prediction and explanation—especially when their targets are highly complex systems
like the climate or predator-prey interactions. Robustness analysis begins by generating
multiple models of a target. For example, climatologists develop several distinct models for
predicting changes in the temperature in a specific region. If multiple meteorological mod-
els with different variables, parameters, and assumptions all predict an upcoming increase
of temperature in the region, this prediction is robust (and should be taken particularly
seriously). On the basis of similar predictions from different models, scientists may be able
to find the common features of the models that give rise to the robust prediction. They
can then examine how this core structure might relate to stable relationships involved
in the complex phenomenon of interest. In this way, climatologists and other scientists
studying complex systems can learn whether and to what degree the predictions of a
model should be taken seriously.
EXERCISES
3.1 Define model and target system in your own words, and say how the two relate. For
a modeling example from this section, say what the model is, what the target system
is, how they are related, and what the model is useful for.
3.2 One very familiar kind of scientific model is a mechanical model of the solar system,
called an orrery. These models are used to represent the relative positions and move-
ments of the Sun, planets, and moons. (If you have never heard of an orrery model, then
do some research on the internet or elsewhere to get a better idea of what they are.)
a. List as many similarities and as many differences between this model and tar-
get system as you can. You should have at least six similarities and at least six
differences.
b. Order the similarities from the most important to least, and then do the same
with the differences.
c. Describe the significance of each of the two similarities and two differences that
seem to be the most important. For each, say why you think the model-builders
chose to make the model similar to or different from the target system in that way.
3.3 State in your own words the main goal of each of the three steps of modeling, as
described in this section. Then, describe how each step may be involved for some
use of an orrery (a mechanical model of the solar system).
3.4 Suppose that you want to model the interactions between predators and prey, for
example, hawks (the predator) and mice (the prey). Make a list of at least five fea-
tures of that target system you think your model should take into account. Then, for
each feature, say how it is similar or different in other predator-prey systems. For any
features that are different, can you think of a related feature that would be similar
between the systems?
3.5 What features of modeling make it a useful approach when an experiment is not
possible and why? What features of modeling make it a useful approach when a
phenomenon of interest is highly complex and why?
3.6 Chapter 2 outlined the perfectly controlled experiment, which some refer to as the
‘gold standard’ for science. However, the National Weather Service usually opts
for modeling when studying the weather and making weather forecasts. Does this
suggest the Weather Service’s results are less scientific, in so far as they don’t aim
for this ideal? Why or why not?
3.7 The National Weather Service uses lots of climate models. Each of the models
(1) represents the climate system in a different way and (2) is inaccurate in some
way. Explain each of these features with reference to information from this section.
Why do you think the National Weather Service does not rely on just one single
climate model in making its predictions?
3.8 Can you think of another complex target system that, like the weather, may require
multiple models to investigate? Name two such systems. Then, explain what makes
those systems so complex. Why do you think scientists may benefit from constructing
multiple models of these systems?
3.9 Sketch how experiments involve the three main steps of generating expectations, per-
forming an intervention, and then analyzing the resulting data. State the three main
steps in modeling, and describe the similarities between those and the three main steps
in experimenting. Then, describe how modeling and experimenting are different.
3.10 Find two different maps of your city or town, on the internet or on paper.
a. For each map, assess its (i) completeness (does it represent all/most/many or
just a few features of the city/town? Which features?), (ii) accuracy (does it pro-
vide an accurate representation of the city/town? How accurate? What does
it get wrong?), and (iii) purpose (what does it seem like people use the map
for? How is that purpose served by the attributes you identified with respect to
completeness and accuracy?).
b. In light of your analysis, say whether one of these maps is better than the other.
If so, in what way(s) is it better? If not, why not?
3.2 VARIETIES OF MODELS
• Indicate the differences between models of data and models of phenomena

• Describe the three steps to constructing a data model, using an example
• Give examples of models of these five types: scale, analog, mechanistic, computer,
mathematical
• Discuss how each of the five types of models vary along the concrete/abstract
dimension
Types of Models
As we have seen, scientific models aren’t always like toy models of airplanes or bays
filled with water. Indeed, the range of things that count as scientific models is extremely
broad. Scientific models can be concrete physical objects, such as the Bay Model or
Watson and Crick’s double helix model of DNA, which is made of metal plates. They
can also be abstract mathematical objects, like the Lotka-Volterra model of predator/prey
interaction, or mental simulations of possible sequences of events. Some models have

both concrete and abstract features, such as computer models, which include concrete
physical components as well as software components that allow us to interact with the
computer and perform tasks. Scientists often rely on computers for modeling complex
phenomena, including the weather and global climate change, the origin of the universe,
and what the world economy will be like in 20 years.
In this section, we’ll classify some types of models. This will help clarify how scientific
models differ, and the kinds of choices scientists make when they use models to investi-
gate the world. We’ll first distinguish between models of data and models of phenomena.
Everything we’ve discussed so far in this chapter has been about models of phenomena;
data models play a different role. Then, we’ll discuss five different types of models of phe-
nomena. The categories identified aren’t mutually exclusive: a single model might count
as more than one of these types. Nor are the categories jointly exhaustive, since there are
also other types of models beyond those we discuss—robot models, for example, which
are sometimes used in science and engineering to model how humans or other animals
can interact with their environment to successfully perform complex tasks.
Models of Data
A model of data, or data model, is a regimented representation of some data set, often
with the aim of highlighting whether or not the data count as evidence for a given
hypothesis. The concept of data was encountered in Chapter 2, in the discussion of
experimental and observational studies. Recall that data are any public records produced
by observation, measurement, or experiment. Video recordings of capuchin monkey
behavior, observations of the positions of planets in the night sky, readings of a ther-
mometer, participants’ answers on a questionnaire in a psychological experiment, and log
locations with GPS on phones are all examples of data. Such recordings are raw data,
which must be processed before they are useful to scientists. For instance, observations
of the positions of planets in the night sky need to be corrected for measurement errors,
organized by time and day, arranged into some scale, and put into a visual format such
as a graph or table. Only then can astronomers use those data to gain knowledge about
the behavior of the planets. This process of data correction, organization, and visualiza-
tion results in a model of the data.
Data models are a rather different kind of model from the models discussed so far.
They do fall under our general definition of a model, since they are representations that
are investigated in place of what they represent. But what is represented are not phe-
nomena—what we’ve called target systems—but data. Data models thus play a wholly
different role in scientific reasoning than models of phenomena.
The first step in constructing a data model is to eliminate presumed errors from the
data set. Consider measurements of the positions of a certain planet in the sky—say,
Mercury, over a period of days. Those measurements will be influenced by more than
Mercury’s position. They will also be affected by some combination of human mistakes,
flaws and limitations of instruments, like the telescope, and inaccuracies due to changing
atmospheric conditions. Scientists can try to identify and correct these errors in various
ways. They might calibrate the telescope or record the atmospheric conditions along
with their measurement of Mercury’s position. This additional information can guide the
decision of which data are questionable and should be eliminated. This process is called
data cleansing.
Once erroneous data are removed from the data set, the next step is to represent the
clean data in a meaningful way. Data of Mercury’s position in the sky over a period of
days may initially be visualized as points on a chart. These points will probably be used as
the basis for a curve that represents Mercury’s progression in the sky. The points represent
the scientists’ measurements. The curve, in turn, represents the scientists’ best guess for
Mercury’s continuous path through the sky. This final representation is the data model.
We can generalize from this example to other data models. Of course, it’s not always
spatial position that’s being measured. There is, though, a common progression of (1) elimi-
nating errors, (2) displaying measurements in a meaningful way, and then (3) extrapolating
from those measurements to the expected data for measurements that weren’t actually
taken. This is what happens when scientists use points on a chart to draw a curve rep-
resenting Mercury’s position, even for times and days when data weren’t collected. As
we’ve suggested, this involves some amount of guesswork.
Indeed, how to extrapolate from measurements to create a data model is a compli-
cated enough task that it has its own name: the problem of curve fitting. To get an idea
of the problem, suppose that you have data for two variables—say, air pollution and life
expectancy—and you want to figure out the general mathematical relationship between
the two. That is, you want to learn how people’s life expectancy changes as a function
of the level of air pollution where they live. The mathematical equation capturing this
relationship will describe a curve that will ‘fit’ your observations. The basic problem of
curve fitting is that data, no matter how much you collect, are always consistent with
different curves.
Put in terms of underdetermination, which was introduced in Chapter 2, the data
underdetermine which equation captures the relationship between these two variables,
air pollution and life expectancy. See Figure 3.5. So, how should scientists decide which of
the equations defining a curve passing through their data captures the real relationship?
There is no easy answer.
Finding the curve that best fits all available data, no matter what, is seldom the best
approach. Sometimes, data models can fit the data too well; this is called overfitting a
model to the data. The problem with sticking too closely to the actual data is that those
data are never perfect. There might be outliers, or values that deviate from the norm for
one reason or another. There is also the possibility of noise, or influences on the data that
are incidental to the focus, such as confounding variables.
Scientists want their data models to be better than the actual data they’ve collected.
In the end, which model of data is the right one depends on several factors, including
the goals of the scientists, their background knowledge, and considerations of how easy
the data model is to use to make predictions.
Big data approaches, discussed at the end of Chapter 2, present significant data
modeling challenges. Big data sets provide science, public policy, and business with
an impressive resource for answering important questions. Data collected from social
media, for example, can be used to understand how often the public talks about politics,
sports, and sex; to make predictions about complex political and social events; and to
explain consumer behavior. But using big data to make predictions requires finding the
right models of the data. The difficulties we have briefly surveyed here are compounded
when modeling big data sets, as the conditions for and features of the data tend to be
FIGURE 3.5 The problem of curve-fitting
less well understood. Chapters 5 and 6 elaborate on the statistical techniques scientists
employ to represent data and draw inferences from them.
Models of Phenomena
As we’ve already seen in this chapter, models of phenomena provide ways to learn about
a phenomenon indirectly by studying the model. This use of models is very different
from data models, both in model development and in the role the models play. Models
of phenomena have been the main focus in this chapter, so our focus here will be on the
contrast between data models and models of phenomena. Data models are used in experi-
ments and non-experimental studies, where the phenomena are investigated directly. In
contrast, models of phenomena are often used to indirectly investigate phenomena. In
order to do this, scientists have to first learn about the model itself. Then they have to
find a way to convert their knowledge about the model into knowledge about the phe-
nomenon being modeled.
Building a model of a phenomenon is kind of like taking apart a toaster and putting
it back together again. A great way to learn about something is to try to build it, or
something that’s like it. Physical models might literally be built; other kinds of mod-
els, like equations and computer programs, are also built, only in a more metaphorical
way. Regardless, model construction should result in a model that represents the target
system(s). Scientists then manipulate and analyze the model to learn about the target
system(s). Just as the model represents the target system, manipulations of the model
represent manipulations of the system. Depending on the type of model, though, the
manipulations might be very different from what would happen in the actual target. And
then, so long as the model is similar in the right respects to its target system, scientists
can transform the knowledge they gain about the model’s behavior into knowledge of
the target system.
Recall how data models are better, more informative, than the data themselves.
Similarly, good models can be better for study than their targets. Consider a few ways in
which this is so. A physical model can provide a more quickly changing and simplified
version of a system. A mathematical model can enable precise predictions about a system
when its equations are solved. A computer model can be run again and again with differ-
ent conditions, simulating a range of possibilities. Differences between a model and the
phenomenon that is modeled are key to the value of model-based science, or learning
about the world indirectly through models.
Recall the discussion of how overfitting—that is, corresponding too closely to the
actual data—can hamper the value of a data model. Something similar is true for models
of phenomena. Scientists can go wrong by constructing a model that builds in too many
elements of the target system or is too similar to the target system. This could make
it so that the resulting model is only applicable in very narrow circumstances or too
difficult to study, either of which limits its usefulness. If instead a model is constructed
to incorporate only the most important, or most interesting, features of a phenomenon,
then it will be useful in lots of different ranges of circumstances. We see examples of
this in what follows.
Scale Models
To illustrate this use of models and the range of forms it can take, consider some
categories of models of phenomena. To begin, scale models are concrete physical
objects that serve as down-sized or enlarged representations of their target systems.
Architectural models of urban landscapes are a familiar example; these are widely
used in civil engineering. The Bay Model also belongs to this class, since it is a three-
dimensional physical object made of concrete slabs, copper tags, and water. The spa-
tial scale of the Bay model is 1:1000 (that is, 1 foot in the model represents 1,000
feet in the real world, where 1 foot = 0.3 meters) on the horizontal axis and 1:100
(that is, 1 foot in the model represents 100 feet in the real world) on the vertical
axis. Temporally, the Bay Model is also scaled; each 24-hour day is represented as
a 14.9-minute sequence, divided into 40 equal intervals of 22.36 seconds (that is,
one minute in the model represents 1 hour and 40 minutes in the real-world target
system, the San Francisco Bay).
FIGURE 3.6 James Watson and Francis Crick’s double helix model of DNA
While the Bay Model is a scaled-down representation, other scale models are
enlarged representations of their targets. The historic discovery of the structure of

DNA by James Watson and Francis Crick in 1953 provided understanding of how genes
replicate and how parents transmit their characteristics to their offspring. Using wire
and tin plates, Watson and Crick had begun building scale models of DNA in 1951.
After several failures, the two scientists recognized from the work of Rosalind Franklin
that a model with a double helical structure had the best fit to current knowledge
about DNA. This model had a spatial scale of roughly 1,000,000,000:1. That is, an
inch (2.54 centimeters) in the double-helix model represented one-one-billionth of
an inch in a real DNA molecule. (See Chapter 8 for more discussion of the discovery
of DNA’s structure, including Rosalind Franklin’s role.)
Analogical Models
Analogical models can be physical or abstract objects, depending on whether they rely on
physical or abstract analogies to represent their target systems. Scale models like the Bay
Model can be characterized as concrete analog models, as they share several physical prop-
erties with their targets. An example of an abstract analog model is the computer model
of the mind, which is based on formal similarities between computers and minds. Like
computers, the human mind is an information-processing system that can be described
in functional terms, without talking about its actual physical composition, or ‘hardware’.
Like computers, minds can be understood in terms of the operations they carry out in
order to solve certain tasks, or in terms of their ‘software’.
Here is another example of an analogical model, located somewhere between the Bay
Model and the computer model of the mind on the concrete-abstract spectrum. Another
hydraulic model, like the Bay Model, was built by William Phillips in 1949. But whereas
the Bay Model used water flow to represent a real body of water, Phillips’s model used
water flow to represent the British economy! This model is called the Phillips machine
or Monetary National Income Analogue Computer (MONIAC). The Phillips machine
was a set of plastic tanks, each representing some aspect of the economy, which were
connected by pipes and sluices and different valves. Dyed water, representing money, was
hydraulically pumped around the machine by an old airplane motor to simulate the ‘flow’
of money in an economy. An overhead tank, representing a treasury, could be drained so
that the water inside could flow to other economic sectors, like education, health care,
infrastructure and investment, savings, and so on. Water could be pumped back to the
‘treasury’ tank to represent taxation and state revenue, with pumping speeds adjusted to
simulate changes in tax rates. Exports and imports could also be simulated by adding or
draining water from the model.
The Phillips machine was a physical model, but it is not a scale model. (The British
economy isn’t itself operated hydraulically, of course.) Unlike the Bay Model, the
Phillips machine uses water flow as an analog to money flow. Changes in water level
and flow were analogous to changes in highly complex, abstract parameters of the
British economy. In its day, this actually was an amazingly accurate tool for learning
about how changes in different economic sectors affect others (Morgan & Boumans,
2004).
Relying on analogies is a particularly useful strategy in early stages of modeling, when
scientists may have little or no knowledge of the phenomenon they are interested in. This
enables scientists to focus on the salient features of a model and to let the discovery of
analogous features guide modeling approaches. For example, the similarity of the physical
arrangement of a spiral staircase to a DNA molecule was striking to Watson and Crick,
guiding their modeling efforts of DNA toward a double-helix structure. Watson, in his
memoir, says, ‘[E]very helical staircase I saw that weekend in Oxford made me more
confident that other biological structures would also have helical symmetry’ (1968, p.
77). Spiral staircases were useful analogous models for DNA, stepping stones toward the
scale model Watson and Crick ended up developing.
As knowledge about the target develops, analogical models may give way to models
less obviously related to the target systems they represent. As we have mentioned, the
Lotka-Volterra model is a set of mathematical equations, which is hardly analogous to
FIGURE 3.7 William Phillips’s MONIAC hydro-economic model

populations of predators and prey. But knowledge about those target systems was used
to develop mathematical equations that effectively—if indirectly—represent key relation-
ships among the populations in question.
Mechanistic Models
Mechanistic models are representations of mechanisms. Mechanisms are organized sys-
tems consisting of component parts and component operations that are organized spatially
and temporally, so as to causally produce a phenomenon. Certain features of cells (like
outside cell
K+ (potassium ion)
cell membrane
ATP
ATP
Na+ (sodium ion)
inside cell
FIGURE 3.8 Visual depiction of the sodium-potassium pump
neurons), organs (like brains), and whole organisms can be seen as mechanisms. Examples
of phenomena produced by mechanisms include blood circulation, protein synthesis, and
cellular respiration. Mechanistic models represent the causal activities of organized com-
ponent parts that produce some such phenomenon. By doing so, they can help illuminate
how the target phenomenon works and, in particular, how it depends on the orchestrated
functioning of the mechanism that produces it.
Mechanistic models can be physical structures representing concrete target systems,
such as an orrery. Other mechanistic models are physical structures representing more
abstract phenomena, such as the MONIAC Phillips machine model of the British econ-
omy. But most mechanistic models are schematic representations of abstract structures
and functions and the relationships among them. For example, consider the mechanistic
model of the sodium-potassium pump in cells depicted in Figure 3.8. This is not a model
of a particular instance of a particular cell exchanging sodium ions for potassium ions.
Instead, it is a generic representation of what all such exchanges, in any living cell, have
in common.
Mathematical Models
As we have seen with the Lotka-Volterra model of predator-prey populations, mathemati-
cal models are equations that relate variables, parameters, and constants to one another.
These models attempt to quantify one or more dependences among variables in the target.
For example, the Lotka-Volterra model uses a pair of first-order differential equations to
represent changes in predator and prey populations over time. The first equation,
dx/dt = αx − βxy
describes the fluctuations of a population of prey, dx, over time, dt, where αx represents
the prey population’s exponential growth and βxy represents the rate of predator/prey
interaction. The number of mice at a given time, for example, is determined by their
population growth, minus the rate at which they’re preyed upon by hawks. By contrast,
the number of hawks is fixed by their population growth given the supply of prey, minus
their mortality rate. Hence, the second equation,
dy/dt = δxy − γy
describes the fluctuations of a population of predators, dy, over the same time interval,
where δxy represents predator population growth and γy represents the loss of predators
due to death, disease, resettling, and so on.
Another example of a mathematical model is a game theory model called the prisoner’s
dilemma. Suppose that you and your friend Dominik have been arrested for robbing a
bank, and you’ve been placed in different cells. A prosecutor makes this offer to each
one of you separately:
You may choose to confess or to remain silent. If you confess and your accomplice
keeps silent, all charges against you will be dropped, and your testimony will be
used to convict your accomplice. Likewise, if your accomplice confesses and you
remain silent, your accomplice will go free while you will be convicted. If you both
confess, you will both be convicted as co-conspirators, for somewhat less time in
prison than if only one of you is convicted. If you both remain silent, I shall settle
for a minor charge instead.
Because you are in a different cell from your friend, you cannot communicate or make
agreements before making your decision. What should you do?
Assuming that neither you nor Dominik want to spend time in prison, you face a
dilemma. Each of you will be better off confessing than remaining silent, regardless of
what the other does. Either Dominik doesn’t confess, or he does. If Dominik doesn’t
confess and you do, you go free, whereas if you didn’t confess, you’d both be charged
with a lesser crime—and going free is better than being charged with a crime. If Dominik
does confess and you do also, you get charged as co-conspirators, whereas if you didn’t
confess, you’d be charged as solely responsible for the crime—and this carries a longer
prison sentence. So, regardless of Dominik’s decision, you are better off confessing.
However, the outcome of both you and Dominik confessing is worse for both of you
than the outcome of both you and Dominik remaining silent. In the first scenario, you
are both charged as co-conspirators, while in the second scenario, you are both charged
merely with a lesser crime. Thus, the prisoner’s dilemma seems to raise a puzzle for
rationality. You are better off confessing, regardless of Dominik’s choices, but if you both
are inspired by that fact to confess, things are worse for you than if you had both kept
your mouths shut. Reasoning independently, you should confess. But, even so, both of
you employing that reasoning leads to a worse outcome than if you’d both acted in the
best interest of your conspirator.
This situation is customarily represented using the mathematical formalism of game
theory. In its simplest form, the prisoner’s dilemma is a game described by the payoff
matrix shown in Table 3.1.
Although this situation may seem contrived, many real-life scenarios can be modeled
with a generic version of the payoff matrix, as the one shown in Table 3.2. Here the
numbers are generic payoffs, or consequences for each decision. The higher the number,
the more desirable the payoff. The first number in each set of parentheses represents
Player 1’s payoff, the second number Player 2’s payoff. The players are also generic; they
might be suspects in a crime, or they might be any other people, businesses, nations,
animals, or even bacteria. Any entities that vary their behavior in response to others’
behavior are fair game.
The most basic relationship that characterizes the prisoner’s dilemma also dictates the
situations to which it can be applied. This basic relationship is that, no matter what one’s
partner chooses to do, one always does better by choosing to defect (in the original story,
to rat out your friend) rather than to cooperate (in the original story, to remain silent).
But—and this is key—players always do better if they are partnered with cooperators
than if they are partnered with defectors. (You’re always better off if your buddy doesn’t
rat you out, regardless of what you choose.) This mathematical model boils that scenario
down to simple numbers that represent the desirability of different outcomes.
The dilemma of the prisoner’s dilemma thus amounts to how to encourage cooperative
behavior, which is better for everyone, in the face of the temptation to defect into selfish
TABLE 3.1 Payoff matrix for the prisoner’s dilemma with Dominik
Dominik
Remains Silent Betrays
You Remain Silent Each pays a small fine You get 3 years of prison
Dominik goes free
Betray You go free Each gets 2 years of

Dominik gets 3 years of prison prison
TABLE 3.2 Payoff matrix for a generic prisoner’s dilemma
Player 2
Cooperate Defect
Player 1 Cooperate (2, 2) (0, 3)
Defect (3, 0) (1, 1)
behavior. The prisoner’s dilemma model has been applied in a variety of circumstances
to help account for scenarios involving cooperative behavior, ranging from symbiotic
relationships among organisms to the practice of not killing opponent soldiers that devel-
oped spontaneously in the trenches of World War I (Axelrod, 1984).
For example, consider the cleaning symbiosis. Individuals of one species, the
cleaner, remove parasites and dead skin from individuals of the other species, the
client. This happens in many pairs of species, but let’s focus on cleaner fish and cli-
ent fish. Cleaner fish have the choice of cooperating by cleaning the client fish or
defecting by eating extra skin from the client fish. Client fish have the choice of
cooperating by allowing the cleaner fish to clean safely or defecting by threatening
or eating the cleaner fish. The fish are better off if both cooperate: the client fish
gets an important cleaning, and the cleaner fish gets dinner. But there’s a benefit to
defecting for each: the cleaner fish would get a bigger dinner by eating more from
the client fish, and the client fish would get to eat the cleaner fish. The prisoner’s
dilemma has been used to represent these options and the circumstances that can
enable cooperative symbiosis to evolve.
Computer Models
Many real-world situations can be modeled as cases of the prisoner’s dilemma. But
what we’ve seen so far isn’t enough to demonstrate why business firms, gangsters,
animals, bacteria, and nations so often cooperate in real life. One important reason
is that, in most real-life scenarios, decisions about whether to cooperate aren’t made
in an isolated room, cut off from your partner, and in expectation that you’ll never
see that partner again. Real firms, gangsters, animals, bacteria, and nations interact-
ing with one another do not make their decisions once and for all, and without
communicating with one another. Instead, they might guess at what each other
might do, signal their own intentions, or interact repeatedly over time, allowing for
reputations to form.
The model of the prisoner’s dilemma introduced here does not capture these kinds of
interactions, but it can be extended so that it does. One common extension is to the iter-
ated prisoner’s dilemma, where we suppose that two agents play the prisoner’s dilemma
with each other repeatedly. This is one way in which cooperative behavior has a chance
of winning out over the selfish choice to defect.
Insight into how this can happen was provided in the 1980s by a computer game.
The political scientist Robert Axelrod invited various social scientists to submit computer
programs for a tournament of the iterated prisoner’s dilemma. Each computer program
had its own strategy governing the circumstances in which it would cooperate or defect,
and these programs were pitted against one another to see which would do the best in
the long run.
This was a computer model. Computer models or simulations are programs run on
a computer using algorithms, or step-by-step procedures, to explore aspects or changes
of a target system. Like other models encountered thus far, computer models can range
from incredibly simple to quite complex. The goal is to create insight into some target
system(s) by examining a similar set of dynamics encoded in a computer program.
It’s unusual for computer models to invite participation from other scientists, as in
Axelrod’s tournament, but by doing so, Axelrod made it so that the strategies available
weren’t limited by what he could imagine or what he thought would be successful. And,
indeed, the result surprised him. The winning strategy—that is, the strategy that accu-
mulated the most points in the iterated prisoner’s dilemma tournament—belonged to a
program named Tit-for-Tat, submitted by a psychologist Anatole Rapoport. The program
was so simple that it had only a few lines of programming code. Tit-for-Tat cooperated
in the first round of any game it played in the tournament, and then it simply mirrored the
other player’s previous action in every round thereafter. So, when Tit-for-Tat played
against generally cooperative players (other programs), it also cooperated and so reaped
the rewards of that mutual benefit. But when Tit-for-Tat played against uncooperative,
selfish players, which defected a lot, it too played selfishly after that initial cooperative
move. This protected it from exploitation by selfish programs. Axelrod’s computer simula-
tion thus demonstrated the success of a strategy of reciprocal cooperation, which is often
called reciprocal altruism (see also Rapoport, Seale, & Colman, 2015 for a more recent
assessment of Tit-for-Tat).
EXERCISES
3.11 In your own words, characterize models of data and models of phenomena, and
give an example of each. How are these types of models similar? How are they dif-
ferent from each other?
3.12 We have characterized the steps of data modeling as (1) eliminating errors, (2) display-
ing measurements in a meaningful way, and then (3) extrapolating from those mea-
surements to the expected data for measurements that weren’t actually taken. Describe
each of these steps for any example of a dataset from this section or Chapter 2.
3.13 Describe the curve-fitting problem, and indicate how it relates to the three steps of
data modeling.
3.14 List the five types of models of phenomena described in this section, and give an
example of each. For each example, indicate why it counts as a model of that type
and what target system(s) it is supposed to represent. Then, rank your examples from
1 to 5, where 1 is the most concrete relationship to the target system(s) and five is
the most abstract.
3.15 Define mechanism in your own words. Then, refresh your memory of photosynthesis.
(You probably encountered this in high school science if not since.) Consider this as
an example of a mechanism by outlining (a) the main component parts and (b) how
their operations are organized so as to constitute the mechanism’s activity. Then,
consulting the description you’ve developed, say whether you think photosynthesis
is a mechanism and why or why not.
3.16 Thomas Schelling, an American economist and Nobel Prize winner, famously devel-
oped a model of segregation in 1971 (see also Schelling, 1969). The model utilizes
a checkerboard, pennies, and dimes. Initially, squares on the board are filled ran-
domly by either a penny or a dime or left empty. Over time, pennies and dimes are
moved around the board according to a rule representing whether they were satis-
fied to stay in their current location. Schelling discovered that a movement rule rep-
resenting a preference for at least a small percentage of like neighbors would, over
time, lead to segregated patches of pennies and dimes on the board. An example
of a movement rule representing such a weak preference is the following: an occu-
pant moves if fewer than three of the eight adjoining squares have occupants the
same as the occupant (pennies if the occupant is a penny, dimes if the occupant is
a dime); otherwise, it stays.
A main application of this model is to housing segregation, where the model shows
that even a weak preference for at least a minority of neighbors to be the same as
oneself can lead to segregated patches of like inhabitants. (Importantly, this does not
show such a weak preference was in fact what led to housing segregation in any
given instance.)
a. In this application, what does the checkerboard represent, and what do the
pennies and dimes represent?
b. What does the movement rule represent? (This one is tricky.)
c. List some of the idealizations needed to use the model to represent housing
segregation. (Idealizations were discussed in 3.1.)
d. We said that Schelling’s model doesn’t show that weak individual preference
in fact led to housing segregation. What are the implications of this model for
segregated housing?
3.17 Mathematical models are among the most abstract representations of target systems.
Describe how it is that mathematical models represent target systems. You might look
back at our discussion of the Lotka-Volterra model and/or the prisoner’s dilemma
model for help.
3.3 LEARNING FROM MODELS
• Characterize the similarities and differences between models and experiments

• Identify the three features all models share
• Describe how similarity, difference, and social convention are involved in representation
• Define trade-off and give an example
Modeling as Experimentation and Theorizing

To close this chapter, let’s consider more fully the roles that models of phenomena play in
science, and the relationship to phenomena that enables them to play that role. A helpful
comparison is with the roles of experiments that we outlined in Chapter 2. In experiments,
scientists intervene directly on the target system. By contrast, modelers often intervene
on a model, as a representation of the target. Nonetheless, constructing and analyzing a
model shares similarities with experimentation. Both modelers and experimenters often
perform their interventions in order to test expectations based on some hypothesis; and
like experimentation, modeling can provide evidence for or against hypotheses about
real-world systems.
For example, animal models like Drosophila melanogaster are used to indirectly test
expectations about the genetic and molecular mechanisms of human disorders, like
Parkinson’s disease and diabetes. And interventions were made on certain features of
the Bay Model to test expectations about the consequences the Reber Plan would
have for the real San Francisco Bay. The iterated prisoner’s dilemma has been studied
to test expectations about the conditions that enable cooperative behavior to emerge
among self-interested individuals. Each of these uses of models is a way to indirectly
test scientists’ hypotheses about real-world systems. And in some cases, the results were
quite surprising.
So, models can play a role similar to experiments. One big difference is that, with
experiments, interventions are performed directly on the experimental system, whereas
with models, interventions to models are used to draw conclusions about the target sys-
tem. This is why models must aptly represent their targets. As we have seen, the work of
modeling also includes gaining a better understanding of the phenomena under investi-
gation, and then constructing models to reflect that understanding, so the models more
accurately reflect the phenomena.
Indeed, sometimes getting a model to more accurately reflect its target is the primary
task of modeling. In such cases, a model of some phenomenon can play a role similar to
a theory; a model can be a way to capture a set of ideas about what that phenomenon
is really like. When a model is proposed as a theory about what some phenomenon is
like, data gathered about the phenomenon, and perhaps about the model, can be used
as evidence to confirm or disconfirm that theory. An example of such a theoretical
use of modeling is the Lotka-Volterra model of predator-prey interactions. Given an
initial setting of parameters in the equations, one can make predictions about changes
in the sizes of a given predator population and the population of its prey, say, polar
bears and seals. Those predictions can then be tested against observations of the actual
predator-prey system—polar bears and seals living in the same broad area. When a
model behaves similarly to the expected target system(s) in more and more instances
and across different circumstances, it may become accepted as an account of how the
target behaves.
So, models can play an experimental role by providing a way to empirically investigate
a phenomenon. Or they can play a theoretical role, by positing an account of some phe-
nomenon. Sometimes the same model can even play both a theoretical and experimental
role. In Axelrod’s tournament, a computer simulation was used as a virtual environment

to test which strategies would perform best in an iterated prisoner’s dilemma game. There
was no expectation that Tit-for-Tat would succeed in the competition. However, the
outcome accorded with an existing theory in evolutionary biology, called reciprocal altru-
ism. The basic idea is that it can be evolutionarily advantageous for an organism to help
another at some cost to itself if there is a chance the favor will be returned in the future.
The success of Tit-for-Tat was consistent with this theory, for it was based on reciproc-
ity. It paid off for Tit-for-Tat to cooperate with others—but only when the others were
cooperating too. Thus, the success of Tit-for-Tat in Axelrod’s computer tournament was
taken to confirm the idea in biological theory that natural selection can favor cooperative
behavior, even when it has a cost.
Three Features All Models Share

As this chapter has made clear, there are many different kinds of scientific models, which
are used to do lots of different kinds of things. These include data models, many varieties
of models of phenomena in more concrete and more abstract forms, models used in some-
thing like experimentation, and models used as a kind of theorizing (and a combination of
both). With all these differences, one might suppose there’s nothing to say about what all
scientific models have in common. While no one single definition perfectly characterizes
what scientific models are or what uses they have, they do share at least three salient
features. We have already encountered all three features in this chapter. Here, we make
them explicit and discuss each in a bit more depth.
First, all models play some representational role, that is, they are in some sense about
their target system. This is what qualifies them as representing their target. In playing a
representational role, models represent, or stand for, something else.
It’s something of a challenge to say exactly what’s required for a model to represent a
target, but some basic components are more or less agreed upon. The model must be like
the target in the right ways, where this likeness might be understood in terms of similarity
or even a much stronger mapping relationship like isomorphism, which is minimally a
one-to-one correspondence between each feature of the model and of the target. But as
we have seen, models typically aren’t exactly like the target systems they represent. They
are often dissimilar from their targets in important ways—recall the Phillips machine, a
hydraulic apparatus that looks nothing like the British economy. Something’s needed to
overcome that gap—the differences between the model and target—in a way that enables
the model to nonetheless be about the target.
It’s increasingly believed that what fills that gap is social convention—that is, scientists’
shared practices in using and interpreting their models. Think of it this way. Scientists
intend for models to be similar to and different from their targets in certain ways. Beyond
the actual similarities and differences, it’s those intentions that set up models to relate to
their targets. Social conventions in modeling allow these intentions to be conveyed and
shared. Social conventions enable modelers to see what similarities and differences they
should expect between a model and a system, which in turn governs how the model
should be interpreted and properly used. For example, cartographers (mapmakers) need
to communicate what features of the territory their maps aim to get right. We have been
told not to trust customary maps of the Earth about the shapes and sizes of the oceans,
for example, as those are distorted in order to picture the landmasses more accurately.
These kinds of social conventions are essential to our ability to use any map in an effec-
tive way and to know which map to use for which purpose. The same goes for scientific
models. Social conventions in model construction and use help scientists understand how
a model is supposed to relate to one or more target systems; the similarities between
model and target aren’t enough by themselves.
We should also note that not all models have targets that actually exist. The Bay Model
was used to represent the Reber Plan, which, thankfully, was never implemented. The
Schelling segregation model represents how only a preference for not being too much in
the minority among your neighbors can lead to segregation, but, as we have mentioned, this
doesn’t mean that such a preference is in fact solely responsible for segregation. (It isn’t.)
And some scientific models aim to explore possibilities that are even more distantly related
to real occurrences. Regardless, those models are used to represent scenarios of scientific
interest, and the knowledge gained from them concerns natural phenomena.
Second, all scientific models are used to learn about the world. Data models represent
data in forms that advance hypothesis-testing. By constructing and investigating models
of phenomena, scientists can reason about the targets they represent in hopes of gaining
new scientific knowledge. In both cases, the models are used as vehicles for learning about
natural phenomena investigated in science.
Third, all scientific models involve abstraction and idealization. Recall that models
bear not only similarities to their targets but also differences from them. The differences
come in at least two varieties: abstraction and idealization, which are not always easy to
distinguish neatly. Roughly, in representing a target system, you may leave things out, or
you may introduce features that the system clearly does not possess. Omitting or ignoring
certain known features of the system is abstraction; including features the target system
doesn’t have is idealization.
Abstraction and idealization serve different goals. Modelers often disregard many prop-
erties of their targets to focus on a limited set of features deemed important for the
purposes at hand. The Lotka-Volterra model, for example, abstracts away from proper-
ties of prey and predators, like their speed; their size; their capacity for camouflage; their
particular senses of smell, sight, and hearing; their location; and much else. Those features
aren’t essential to how predator-prey interactions influence population size and so have
been abstracted, or removed, from the model.
Like abstractions, idealizations are a way of simplifying a model, enabling scientists to
focus on the bare essentials of the phenomenon they’re interested in, without getting lost
in complicating details. But whereas abstraction involves leaving features of the target out
of the model, idealizations are properties of the model that the target doesn’t actually
have. We encountered the concept of an idealization earlier, when the Lotka-Volterra
model was first introduced. There we defined idealizations as assumptions made without
regard for whether they are true and generally with full knowledge they are false. In mod-
eling, this results in the misrepresentation of certain aspects of the system being studied.
For the Lotka-Volterra model, idealizations include the assumptions that prey can find
food at all times, that predators are hungry at all times, and that both predators and prey
are moving randomly through a homogenous environment. Scientists don’t think these
assumptions are true. But, in many situations, the falseness of these idealizations doesn’t
interfere with the Lotka-Volterra model’s representation of the predator-prey dynamics.
To recap, the three features shared by scientific models are (1) they represent one or
more targets; (2) they are used to learn about natural phenomena under investigation
in science; and (3) they involve abstraction and idealization. These last two features are
also related to models’ representational purpose. Abstraction and idealization are features
of models that affect how they represent their targets, and the ways models represent
their targets partly determines what can be learned. Representation is, then, at the heart
of scientific modeling.
What Makes a Model Good?

A target system can be represented in many different ways. A physical model of a hydro-
logical system, like the Bay Model, represents water flow in ways that significantly differ
from how a mathematical model of fluid dynamics does. And both are different from
the computer model that eventually took over the work of the Bay Model. There’s no
one perfect model of a given phenomenon. Instead, the goodness of a model is judged by
considering what the modelers want to learn from and do with the model and, perhaps,
the ease of developing or using the model. Sometimes one model will be enough for
learning about a target system; other times, multiple models of the same target will be
necessary to gain knowledge.
Several features are desirable for models to have. These include accuracy (a model
realistically representing its target), generality (applying to a range of related target sys-
tems), precision (providing exact information), tractability (ease of use), and robustness
(stable behavior across different assumptions). Each of these features helps make a model
valuable. And each of these features comes in degrees. A model isn’t simply general or not,
or precise or imprecise; instead, models vary in the extent of their accuracy, generality,
precision, tractability, and robustness.
Attempting to create the perfect model by maximizing all of these features is futile,
since these features usually trade off against one another; gaining more of some desir-
able feature of a model often requires losing ground on some other desirable features.
For example, a model that is more general, applying to more target systems, is also often
less precise and accurate of any one target system. This is because targets differ from
one another in some regards, so tailoring a model to be precise and accurate of a specific
target makes it ill-suited to represent a different system. For related reasons, a model
that is more precise and accurate is often less tractable and robust. So, when construct-
ing models, scientists must decide which desirable features to emphasize and which to
compromise on. In the rest of this section, we elaborate on how the desirable features
of models trade off against one another. (See Levins, 1966, on the issue of trade-offs in
model-building in population biology.)
Accuracy
Models representing more actual features of a target system tend to be more descriptively
accurate, or realistic. A model representing all and only the actual components and fea-
tures of its targets, as it actually has them, would be a model that is maximally accurate.
But this ideal is seldom achieved, and it’s unnecessary for practical success; recall that
models are improved by some differences from their targets. For example, a mathemati-
cal model of drought-resistant landscaping is improved by accurately accounting for how
water-intensive different plantings are. But such a model would be unwieldy if it included
a parameter for the number of blades of grass in order to be more accurate. Even if such
parameters increased the model’s accuracy, this wouldn’t give any additional insight into
drought resistance. And it would come at a tremendous cost to tractability and generality.
Each time you had a different number of blades of grass, the model would work differently.
However, for mathematical models of which kinds of turf are the most water-intensive,
it may be entirely relevant to know how many blades of grass there are per square meter
of sod or (perhaps) differences in the water absorption rates.
So, which features are important for models to represent will depend on which
phenomena modelers are interested in. Think of the Bay Model again. The engineers
cared about salinity and how water moved in the bay but not about the color of the
bay floor or the exact number of water molecules. The features worth modeling accu-
rately are the most relevant features for the modelers’ interests. Models are benefited
by accuracy because this increases their similarity to their targets, which in turn makes
findings about the model more certain to hold of the target as well. However, some
properties of a target are best excluded from a model because their exclusion has
compensatory benefits.
Generality
A model is more general when it applies to a greater number of target systems.
Generality is a desirable feature of models insofar as it enables models to be reused
in a variety of circumstances and, more significantly, because general models make it
possible for scientists to discern what a variety of phenomena have in common with
one another. This is a step toward formulating general theories or laws about phe-
nomena of interest.
Consider the prisoner’s dilemma model again. Because it can apply to humans, bacteria,
corporations, and many other entities, this is a general model with numerous applications.
That generality also reveals something which all those types of entities have in com-
mon: repeated interactions can enable cooperation to spontaneously emerge. However,
sacrificing some generality in a model can be worthwhile, if doing so enables the model
to more accurately represent its target. A general prisoner’s dilemma model might be
supplemented with information about, say, how natural selection favors bacteria that can
coexist in close proximity to one another (a form of cooperation). The resulting model
will give more insight into bacteria cooperation in virtue of this additional detail. But
it also will be less general—it will no longer apply to humans or corporations. Which is
better depends on the modelers’ aims.
Precision
A model is more precise to the extent that it more finely specifies features of the target.
For example, a climate model that allows scientists to predict how much warmer the
global average temperature will be in 30 years within a range of ±1° Celsius is more precise
than a model that allows them to predict a ±5° Celsius range of temperature increase in
20 years. Notice that precision is different from accuracy. Whereas accuracy is a matter
of a given value’s proximity to the true value, precision is a matter of the proximity of
values in a range. Think again of an archer loosing arrows at a target. Arrows that are
scattered all around the bull’s-eye but very near to it are accurate but imprecise. Arrows
that are tightly clustered together but off-center, away from the bull’s-eye, are precise
but inaccurate See Figure 3.9 for an illustration of this.
Consequently, a model could be very precise but still inaccurate. For example, the
prediction enabled by the more precise climate model might turn out to be wrong.
Greater precision benefits a model by enabling it to give a more specific characterization
of its target and to make more specific predictions about that target. But increasing preci-
sion usually comes at the cost of a model’s generality, its tractability, and sometimes its
accuracy. Like generality, precision often trades off against accuracy. The more specific a
prediction is, the easier it is for that prediction to be incorrect.
accuracy
precision
FIGURE 3.9 Accuracy versus precision

Tractability
Tractability is the ease of developing and using a model. This could involve different con-
siderations, for example, the time it takes to run a model on a computer, or whether the
equations of a mathematical model have exact solutions. It could even involve whether a
modeler happens to already be familiar with one approach but not another. More tractable
models are easier to construct, manipulate, or analyze.
Consider, for example, that the iterated prisoner’s dilemma involves agents having repeated
encounters, and so this model is less tractable than the original prisoner’s dilemma. One
consequence of this decreased tractability is that scientists know exactly what the pos-
sible outcomes are for the original prisoner’s dilemma, but they cannot in general predict
the outcomes for its iterated version. This is why Axelrod ran a computer tournament to
explore some of the possible outcomes. For obvious reasons, tractability is never maxi-
mized though. The easiest thing to accomplish is usually nothing at all. And more com-
plicated models can result in more accurate, precise, and insightful findings. The iterated
prisoner’s dilemma reveals how repeat encounters (in certain circumstances) can over-
come the dilemma entirely, making cooperation directly beneficial.
Robustness
A more robust model is one that changes less despite variation in its assumptions.
Consequently, robustness is a measure of insensitivity to the features that differ from
the target, including the model’s abstractions and idealizations. Normally, scientists don’t
want their models’ predictions to be sensitive to such features. To be trustworthy, the
predictions should be based as much as possible on known similarities between the model
and target. But limited robustness is inevitable. Models incorporate assumptions that are
needed for them to produce the desired information, so to some extent, those assump-
tions always matter. What scientists aim to avoid is over-reliance on specific assumptions
that are unlike to be true or even known to be false.
Multiple models are sometimes used to determine how robust a model’s predictions
are. If different models, with different assumptions and details, all predict roughly the
same result, that prediction seems more trustworthy than if it had been generated by
just one model, with uncertain assumptions and parameters. Robustness analysis, which
was introduced in Section 3.1, capitalizes on this idea. Robustness analysis is possible
whenever multiple models are employed; it’s common in climate science, for example.
Trade-offs in model building

In considering these valuable features of models, notice how and why each is limited.
Increasing one feature often comes at the cost of others; this is called a trade-off. There
is no single answer to how a model should best approach these desirable features, nor is
there a perfect trade-off among the features. Instead, scientists strategically develop their
models to be tractable enough for their current circumstances; robust enough to be certain
to some reasonable degree; accurate and precise enough to make interesting, trustworthy
predictions; and general enough to be enlightening. The balance struck thus depends in
subtle ways on the phenomena under investigation, the scientists’ circumstances, and the
purposes to which the models are put.
EXERCISES
3.18 Describe the experimental use of models, and explain why models are well situated to
play this role. Then describe the theoretical use of models, and explain why models are
well situated to play that role. Can the same models play both roles? Why, or why not?
3.19 Think again about the use of the Bay Model in testing the Reber Plan. This bears
some similarity to an experiment, but it is conjectural in a way that directly experi-
menting on the actual San Francisco Bay would not be.
a. Characterize the experimental features of this use of the Bay Model: the inde-
pendent variable, the dependent variable, how the independent variable was
intervened upon, and what the findings were.
b. Describe at least one way in which the findings are less certain in their implica-
tions for the effects of the Reber Plan than an actual experiment would have been.
c. Describe at least three advantages to using this model instead of directly inves-
tigating the effects of the Reber Plan. You might consider the desirable features
of models described earlier in formulating your response.
3.20 What are the three features that we have said all models share? How do these three
features relate to one another?
3.21 In a paragraph, describe how models represent their targets. You should reference all of
the following: similarities, differences, social conventions, abstractions, and idealizations.
3.22 Define abstraction and idealization in your own words. What is the difference
between them?
3.23 Choose one of the models we have discussed in this chapter. Say which model
you’ll focus on and what target system(s) it represents. Then, formulate a list of the
abstractions involved in using that model to represent this system and a separate
list of the idealizations involved in using that model to represent this system. You’ll
need to think beyond what’s actually said about the model, considering especially
the differences between the model and its target(s).
3.24 Describe in your own words all five of the desirable features of models character-
ized in the last part of this section. Then, compare the classic game theory math-
ematical model of the prisoner’s dilemma and the computer model of the iterated
prisoner’s dilemma on each feature. For each feature, write down whether you think
one model is better and which one, if you think the two models tie, or if you don’t
have enough information to decide. In all cases, explain your answer.
3.25 Consider your answer to 3.24. Describe a purpose you think the classic game
theory model of the prisoner’s dilemma would serve better than the computer model
of the iterated prisoner’s dilemma. Then, describe a purpose you think the computer
model of the iterated prisoner’s dilemma would serve better than the classic game
theory model of the prisoner’s dilemma.
3.26 Scientists have constructed models of atoms, genetic lineages, economies, rational
decisions, traffic, forest fires, and climate change. Locate and investigate a scientific
model we have not discussed in this chapter.
a. Identify the type of model it is and what target system(s) it’s used to represent.
b. Describe how the elements of the model represent features of the target system(s).
c. Describe what scientists have learned about the target system(s) from the model.
d. Why is this model a helpful way for scientists to investigate this phenomenon?
In answering this question, think back to the challenges of experimentation
discussed in Chapter 2, the advantages of modeling discussed in 3.1, and the
desirable features of models discussed in 3.3.
FURTHER READING
For more on the use of models in science, see Weisberg, M. (2013). Simulation and similar-
ity: Using models to understand the world. Oxford: Oxford University Press.
For more on mechanistic models, see Glennan, S. (2005). Modeling mechanisms. Studies
in History and Philosophy of Biology and the Biomedical Sciences, 36, 443–464.
For a discussion of computer modeling and attention to climate change models, see Wins-
berg, E. (2010). Science in the age of computer simulation. Chicago: University of Chi-
cago Press.
For a more general discussion of computational methods in science, see Humphreys,
P. (2004). Extending ourselves: Computational science, empiricism, and scientific method.
Oxford: Oxford University Press.
For a classic treatment of scientific modeling, and especially models’ relationship to analo-
gies, see Hesse, M. (1963). Models and analogies in science. London: Sheed & Ward.
For more on how models represent target systems, see Giere, R. (2004). How models are
used to represent reality. Philosophy of Science, 71(Suppl.), S742–S752.
For an account of idealization and how it influences science, see Potochnik, A. (2017).
Idealization and the aims of science. Chicago: University of Chicago Press.
CHAPTER 4
Patterns of Inference
4.1 DEDUCTIVE REASONING
• Summarize how the minimum age of the universe is inferred

• Describe reasoning, inference, and argument and explain how they are involved in
science
• Define deductive inference, validity, and soundness
• Recognize and assess common patterns of deductive inference
• Analyze whether a criticism of a scientific inference provides logical grounds to
question that inference
How Old Is the Universe?

How old is the universe? One possible answer is that the universe is eternal. The ancient
Greek philosopher Aristotle (384–322 BCE) developed several arguments in support
of this conclusion. A plausible assumption is that everything that comes into existence
requires some underlying matter from which it comes. Aristotle then reasoned as fol-
lows. If the universe came into existence and is not eternal, then it came into existence
from some pre-existing material substratum. Now, either this material substratum is itself
eternal or it is not eternal. If the material substratum from which the universe came into
existence is not eternal, then the substratum must have come into existence from some
other pre-existing material substratum. But then the same reasoning can be applied to
this other pre-existing substratum, which is either eternal or came into existence from
some other pre-existing substratum. Hence, if the universe is not eternal, an infinite regress
arises; the sequence of reasoning never terminates. Each purported material substratum
itself requires another substratum from which it comes. Aristotle concluded that matter
must be eternal and that the universe did not have any beginning.
From the early Middle Ages (roughly the 7th century) to the end of the Renaissance
(roughly the 16th century), scholars and theologians continued to engage with ques-
tions about the age of the universe. The structure of Aristotle’s reasoning was largely
kept, but the eternality of the universe was replaced by the eternality of God in order
to fit with various creation stories. The universe itself was often estimated to have come
into existence around 4,000 BCE (that is, 6,000 years ago). The estimate was derived
126 Patterns of Inference
from arithmetical calculations based on genealogical records in various religious texts,

and remained prevalent through the 18th century. By the 19th century and early 20th
century, most scientists believed that the universe is eternal and unchanging—that is, in
a steady state.
In the 1920s, the American astronomer Edwin Hubble (1889–1953) made two discov-
eries that were inconsistent with that belief. Using a telescope with a 2.5-meter aperture at
FIGURE 4.1 Edwin Hubble at Mt. Wilson Observatory
Patterns of Inference 127
Mount Wilson Observatory in Southern California, Hubble discovered evidence that the
universe is much larger than people previously thought and that the universe is expanding.
Pointing the telescope toward the Andromeda Nebula, Hubble saw stars similar to
those nearer to Earth, only dimmer. One of those was a Cepheid variable, a star whose
brightness as seen from the Earth changes periodically. Hubble knew of the relationship
between the period of time it takes a Cepheid’s brightness to change and the luminosity
of the star, which is the total amount of energy it emits in one second. Thus, from the
period of the Cepheid, Hubble could calculate its luminosity, thereby determining how
much brighter it was than the Sun.
Light travels at a constant speed of about 300,000 kilometers per second. Over the
course of a year, light travels nearly 9.5 million gigameters (a gigameter is one billion
meters); this distance is one light-year. Furthermore, the apparent brightness of a star—
that is, how bright a star appears to be as seen from a distance—depends on the distance
to the star. Once this relationship is known, it can be used, along with knowledge of
the speed of light, to determine the distances to stars and faraway galaxies. Hubble did
just that: he used his knowledge of the relationships between light’s speed of travel, the
apparent brightness of a star, and its distance to calculate the Cepheid’s distance from
Earth. Based on the distance of that Cepheid variable, Hubble reasoned that Andromeda
was in fact a different galaxy from our galaxy, the Milky Way. This discovery, announced
in 1925, demonstrated that the universe is much larger than had been thought.
Hubble also demonstrated that the universe has not always been this large. It’s expand-
ing. His reasoning started from the claim that light, like sound, will change its frequency
depending on the relative movement of the object emitting it and the observer. An
example is the change in frequency of an ambulance siren as it moves toward, and then
away, from an observer. The siren sounds higher pitched as it approaches, and then lower
pitched once it has passed. This frequency change, called the Doppler effect, was discov-
ered in the mid-19th century by the Austrian physicist Christian Doppler (1803–1853).
It has proven useful in a number of scientific investigations. For Hubble’s purposes, the
important implication was that a star moving away from Earth appears redder, while a
star moving toward Earth appears bluer. The degree of redness of receding stars is called
redshift.
Using the technique of astronomical spectroscopy, Hubble discovered that the redshift
of starlight from any galaxy increased in proportion to the galaxy’s distance from Earth.
This indicates that galaxies are moving further and further away from Earth. In 1929,
Hubble announced these findings, which suggest that the universe is expanding. This is
now known as ‘Hubble’s Law’. According to recent estimates, the universe’s expansion
rate, known as ‘Hubble’s constant’, is about 70 kilometers per second per megaparsec
(km/sec/Mpc), where 1 megaparsec (Mpc) is approximately three million light-years—an
extremely long distance!
So, Hubble showed that the universe is not only much larger than previously estimated
but also expanding. But how do these findings bear on the question of the age of the
universe? The answer again concerns the relationship between time and the movement
of starlight through space. The simple fact that astronomers like Hubble can observe stars
from very distant galaxies indicates something about the age of those stars and thus about
the age of the universe containing them too. No star can be older than the universe. So,
we can estimate the minimum age of the universe on the basis of the age of the most
distant stars we can observe. In this way, Hubble was able to show that the universe was
at least 10 billion years old. Currently, the furthest objects that deep space telescopes
have detected are approximately 13.8 billion light-years away. Therefore, the universe
must be at least 13.8 billion years old. This finding has also been supported by convergent
evidence from sciences like cosmological physics and geochemistry.
The previous three chapters have focused in part on the importance of empirical
evidence in science. And indeed, empirical evidence is essential for developing scientific
knowledge. But for observations to lead to knowledge, scientists must assess their sig-
nificance and implications, and the relationships among them. In other words, scientific
knowledge comes not from mere observation, but from reasoning about observations.
Aristotle sought to establish that the universe is eternal by showing that the denial of
this would lead to an absurd infinite regress. Hubble combined empirical observations
with calculations of light’s travel over distances and through time to support a precise
estimate of the universe’s age. Hubble appealed to empirical evidence in ways Aristotle
did not, but both reasoned their way to conclusions.
Reasoning, Inference, and Argument

In this chapter, we focus explicitly on patterns of inference in scientific reasoning. This will
help us better see how reasoning is involved in the development of scientific knowledge
from a basis in empirical evidence.
Reasoning is a psychological process, which cognitive psychologists divide into the
operations of two cognitive systems: System 1 and System 2. As the Israeli-American
psychologist and Nobel Prize winner Daniel Kahneman characterizes them, System 1
operates automatically and quickly, with no conscious mental effort needed. System 2
operates slowly, engaging working memory and allocating attention to effortful mental
activities, accompanied by a sense of voluntary control (Kahneman, 2011).
Scientists, like everyone else, reason with both of these systems. There are famous
cases of creative ‘Eureka!’ moments in which a scientist suddenly grasped some solu-
tion or conclusion. Recall from Chapter 1, for example, Friedrich August Kekulé’s
fever dream of a snake biting its tail that led to the discovery of the ring-structure of
the benzene molecule. Nonetheless, as we have seen, science is a collaborative social
process of giving and taking reasons. It thus mainly engages System 2 processes. This
means that most scientific reasoning is slow and deliberative. Hubble labored with other
astronomers and assistants to collect and interpret data over many years. The argument
from this data set to the conclusion that the universe is expanding was also developed
and refined over time, with not just Hubble but many other astronomers contributing
(Kragh & Smith, 2003).
Scientific reasoning involves the application of broad reasoning skills to the concerns
and content of science: to greenhouse gases, light-years, molecules, ecosystems, and, as
in Kahneman’s work, even to reasoning processes themselves. We have already encoun-
tered many examples of scientific reasoning. These include, to name a few, reasoning
from large-scale carbon release during the last two centuries to the dramatic increase in
the average global temperature (Chapter 1); reasoning from the temperature of colored
lenses to the hypothesis that light colors vary in temperature (Chapter 2); and reasoning
from the results of modeling the San Francisco Bay to the rejection of the Reber Plan
(Chapter 3). Chapter 4 began by describing how scientists reasoned from the speed of
light and observation of distant astral bodies to the conclusion that the universe must be
at least 13.8 billion years old.
Deliberative scientific reasoning involves making and evaluating inferences, and
inferences are the backbone of any argument. An inference is a logical transition from
one thought to another that obeys abstract rules. Whereas reasoning, as we’ve char-
acterized it, is a psychological process, the features of inference are instead logical.
An argument is a set of statements (stated propositions) with inferential structure.
You might think of an argument as a set of instructions for performing inferences
to reason your way to some conclusion. This differs from the everyday use of the
word argument to mean bickering—a quarrel one might have with friends or family.
An important part of scientific work is reasoning from empirical evidence in ways
that involve logical inferences, and assembling arguments reflecting the structure of
those inferences.
Making inferences and assembling arguments requires being able to distinguish the
roles of premise and conclusion. The premises of an argument are statements that provide
rational support, the basis for inference. The conclusion of an argument is the statement
that is supported by the premises, the endpoint of an inference. For example, recall
Aristotle’s reasons for thinking that the universe is eternal. These can be reconstructed
into an argument as follows:
1. If the universe is not eternal, then the universe came into existence.
2. Everything that comes into existence requires some pre-existing material
substrate.
∴ 3. If the universe is not eternal, then some material substrate existed before the
universe came into existence.
4. It cannot be the case that some material substratum existed before the universe
came into existence.
∴ 5. The universe is eternal.
The argument is written as an ordered list of statements. The first four statements are
the premises of the argument; the argument’s conclusion is the last statement in the list.
Statements inferred from one or more premises are marked with the symbol ‘∴’, which
is notation symbolizing words like therefore, so, or hence. As this example shows, an argu-
ment may involve more than one inference. The inference to the third statement is made
from the first two premises, and the inference to the fifth statement—the argument’s
conclusion—is made from the third and fourth premises.
Scientific reasoning involves three main patterns of inference: deductive, inductive,
and abductive. An argument is a deductive argument when the relationship of its prem-
ises to its conclusion is purportedly one of necessitation: the premises should together
guarantee, or make necessary, the conclusion. Inductive and abductive inferences are
non-deductive; the premises do not guarantee the conclusion, but they still give reason
to infer the conclusion. Inductive and abductive reasoning play a more central role in
scientific reasoning than deductive reasoning. We discuss these patterns of inference in
Section 4.3, and they also relate to the main topics of Chapters 5–7. But for now, let’s
concentrate on deductive inference.
Conditional Statements
Statements of the form ‘if …, then …’ are crucial elements of inferential reasoning. These
if/then statements are called conditional statements because one circumstance is given
as a condition for another circumstance. As an intuitive example, imagine the parents
of a young child asking her to eat her vegetables in order to get dessert: ‘If you eat your
broccoli, then you can have dessert’. The child then knows the ticket to dessert—shovel-
ing down that broccoli!
The first circumstance, following the ‘if’, is called the antecedent. This is the condition
upon which the other circumstance is introduced. The second circumstance, following the
‘then’, is called the consequent. This is the condition that arises from or hinges upon the
introduction of the antecedent. The latter term is closely related to the word consequence,
and in the previous example, it is just that: getting dessert is a consequence the parents
commit to on the basis of the antecedent condition, eating the broccoli.
Antecedent means existing prior to, coming first in time, and also being logically prior.
But for conditional claims, only the last meaning is relevant. Nothing guarantees that
an antecedent will come before its consequent. For example, consider the conditional
statement, ‘If Piet is a dog, then Piet is an animal’. This is a true conditional, because
being an animal is a guaranteed consequence of being a dog. But unlike broccoli and
dessert, being a dog doesn’t come before being an animal. Instead, in this example, if the
antecedent is true, the consequent is simultaneously true. Time-ordering of the anteced-
ent and consequent can also be reversed. For example, ‘If you are hungry now, then you
must not have eaten enough dinner’. In this case, the consequent (not eating enough
dinner) happened before the antecedent (being hungry now). But the antecedent is
still logically prior: being hungry is the condition placed on not eating enough dinner.
A good way to think about the logical relationship between antecedents and conse-
quents is in terms of requirements and guarantees, or, more formally, in terms of necessary
and sufficient conditions. For a conditional statement to be true, the antecedent occurring
guarantees that the consequent also occurs. The antecedent is thus a sufficient condition
for the consequent. Consider again the conditional statement, ‘If Piet is a dog, then Piet
is an animal’. Piet’s being a dog guarantees that Piet is also an animal; being a dog is suf-
ficient for being an animal.
This doesn’t work in reverse. For a true conditional statement, the consequent occur-
ring doesn’t guarantee the antecedent will occur. Piet might be an animal but not a
dog; the consequent might be true but the antecedent false. Instead, the consequent is
a requirement, or a necessary condition, for the antecedent. Piet’s being an animal is a
requirement placed on Piet being a dog but no guarantee that he is one.
TABLE 4.1 Conditional statements
Standard Form Name Concept Condition Type
If A, … Antecedent Basis for a guarantee or Sufficient condition for C

requirement
… then C Consequent What is guaranteed or Necessary condition

required for A
Let’s consider a conditional statement important in Hubble’s reasoning about the

age of the universe. Recall that Hubble calculated the age of distant stars to be greater
than 10 billion years old, and he reasoned from this that the universe was at least of
that age. Put in the form of a conditional statement, the idea was ‘If a star is more than
10 billion years old, then the universe must be more than 10 billion years old’. This is the
claim that a sufficient condition for the universe to be 10 billion years old is that some
star in the universe is 10 billion years old. Put in reverse, if the universe weren’t that
Box 4.1 Conditionals

Scientific inquiry as well as everyday reasoning often involves the formulation
and evaluation of conditional statements about the relationships among objects,
states, and events. For example, if water is heated to 120º Celsius, it boils. One
complication is that conditionals can be, and often are, expressed in various
non-standard forms. Instead of if A then C, one might say, equivalently (where A
is still the antecedent and C the consequent):
C if A
A only if C
A guarantees C
Without C, A is not the case
Not A unless C
And there are still further ways to express this same conditional relationship.
One approach to navigating these non-standard forms is to understand the
meanings of their parts. Suppose somebody states that you have an identical twin
only if you have a sibling. What must be the case for this statement to be true?
What about for it to be false? Without a sibling, you can’t have an identical twin
sibling; having a sibling is necessary for having a twin. Consulting Table 4.1,
you’ll see that a necessary condition is a consequent. So, this statement was the
same as saying that if you have an identical twin, then you have a sibling. If it
were possible to have an identical twin without having siblings, then the state-
ment—either in its original form or the if/then formulation—would be false. But
every other circumstance—having a twin sibling, having a sibling but no twin, or

having no twin and also no sibling—is consistent with this statement being true.
This might aid in navigating conditional claims disguised in other formulations.
There’s also a point here about when conditional claims are true. Put abstractly,
conditional statements are only false when the antecedent A can be true while
the consequent C is false. If parents told their child that if she eats her broccoli
then she’ll get dessert, and she eats her broccoli, but then they withhold dessert,
they were lying. (Which isn’t a very nice thing to do to a little kid who wants
dessert!) This kind of conditional, the only kind we’ve discussed, is called a mate-
rial conditional.
old, it couldn’t contain any objects that old. The universe being 10 billion years old is
thus a requirement for any star to be that old. Notice, in contrast, that finding out the
universe is a certain age would not guarantee that any star is that old. It is possible that
the universe is, say, 15 billion years old and all stars are younger. The universe having a
given age is a necessary condition for there to be a star of that age, but it is not a suf-
ficient condition.
Evaluating Inferences
Scientific reasoning can be evaluated as good or bad based on the abstract rules and formal
properties of the inferences involved. The study of the rules and patterns of good and bad
inference is called logic. Logic is a subject that can, and does, fill many textbooks. We’ll
keep our discussion here as brief as possible, but some basic ideas of logic are important
for understanding successful scientific reasoning.
The evaluation of both deductive and non-deductive inferences focuses on two main
questions. First, are the premises sufficient to rationally support the conclusion? And sec-
ond, are those premises true? The first question assesses the logical relationship between
premises and conclusion, the grounds for inference. The second question assesses the
status of the inference’s premises themselves. Good inferences answer both questions
affirmatively: there is good reason to believe that all premises are true, and together,
those premises provide sufficiently good reason to infer that the conclusion is true. The
premises of a good inference should together provide a logically compelling reason for
thinking the conclusion either must be true (in deductive inference) or is likely to be
true (in inductive and abductive inference).
When the truth of the premises of a deductive inference guarantees the truth of the
conclusion, the inference has the property of being valid. This term has several differ-
ent meanings. In one non-technical use, it simply indicates something is reasonable or
understandable. In Chapter 2, we discussed the external and internal validity of experi-
ments; this is another meaning of validity. Here, in the context of deduction, validity
has a technical definition different from these other meanings. A deductive inference is
valid just when the truth of the premises logically guarantees, or necessitates, the truth
of the conclusion. In a deductively valid inference, it is impossible for the conclusion to
be false provided that the premises are true. To assess whether a deductive inference is
valid, first suppose all of its premises are true. You should imagine those premises are
the only things you know about the world. Then, ask yourself whether there is any pos-
sible way the conclusion could be false. If there is any way for the conclusion to be false
while the premises are true, say, by imagining strange things about the world, then the
inference is invalid. If not, if the truth of the premises alone guarantees the truth of the
conclusion, the inference is valid.
Any deductive inference is either valid or invalid. A valid deductive argument can-
not be made more valid, or rendered invalid, by adding more premises. This property
of deductive reasoning is called monotonicity. Reasoning is monotonic if the addi-
tion of new information never invalidates an inference or forces the conclusion to be
retracted. For this reason, deductive arguments are rock- solid; you might be wrong
about a starting point—one or more of your premises might be false—but if you have
a valid inference, you can be absolutely certain that your premises (if true) guarantee
your conclusion.
Some patterns of deductive inference are common enough to have been given names.
For example, one of the most basic patterns of deduction is affirming the antecedent of
a conditional statement (also known by its Latin name modus ponens). This is when a
conditional statement and its antecedent are used as premises for concluding the conse-
quent must be true. For example,
1. If a star is more than 10 billion years old, then the universe must be more than
10 billion years old.
2. This star is more than 10 billion years old.
∴ 3. The universe must be more than 10 billion years old.
Another elementary form of reasoning is called denying the consequent of a conditional

(also known by its Latin name modus tollens). This is when a conditional statement and
the negation of its consequent are used as premises for concluding the antecedent must
be false. For example,
1. If the universe is in a steady state, then astral bodies remain the same distance
from one another.
2. It is not the case that astral bodies remain the same distance from one another.
∴ 3. It is not the case that the universe is in a steady state.
Each of the previous two arguments is deductively valid. The premises may not be true.
But if they were true, they would logically guarantee that the conclusion must also be
true. This holds for every other instance of these general patterns of inference. No matter
how long and deep you think, you will not be able to find an instance of either pattern
that is invalid.
Affirming the consequent and denying the antecedent, as general patterns of deductive
inference, can be expressed as follows. (‘It is not the case that’ can be indicated with the
negation sign ‘¬’.)
1. If A, then C 1. If A, then C
2. A 2. ¬C
∴ 3. C ∴ 3. ¬A
Keep in mind that to have a valid argument, it is not enough to start with all true premises
and to have a true conclusion. Rather, the truth of the premises must force the conclu-
sion to be true; there must be no way around having a true conclusion (if the premises
are true). Consider another example:
1. Cats are mammals.

2. Tigers are mammals.
∴ 3. Tigers are cats.
Both premises are true: cats and tigers are kinds of mammals. The conclusion is true as
well: tigers are one kind of cat. But this is an invalid inference. Even though every state-
ment comprising it is true, the truth of the conclusion isn’t guaranteed by the truth of the
premises. To see this, substitute in ‘dogs’ for ‘cats’ in the argument. (Remember you can
do whatever you want, other than making a premise untrue, to try to make the conclusion
come out false. If you can accomplish this, the argument is invalid.) With this substitu-
tion, the two premises are still true, but the conclusion is not. The inference is invalid.
Here’s one more argument:
1. The Earth is 6,000 years old.

2. Buenos Aires is in South America.
∴ 3. The Earth is 6,000 years old and Buenos Aires is in South America.
This argument is valid. If both premises were true, then the conclusion must also be
true. There is no possible way for both premises to be true but the conclusion false. Of
course, the premises aren’t both true. Buenos Aires is in fact in South America; but the
age of the Earth is approximately 4.54 billion years. So, even though this is a valid argu-
ment, we don’t have good reason to believe the conclusion.
The previous two examples illustrate that valid arguments can have false premises and
conclusions and invalid arguments can have true premises and conclusions. The best deductive
inferences are those that combine both validity and truth. These inferences are sound. A sound
inference is a valid deductive inference with all true premises. Being valid rules out inferences
like the cats and tigers example, where the conclusion is only accidentally true. Having all
true premises rules out inferences like the Earth and Buenos Aires example, where the infer-
ence is valid but the conclusion is nonetheless false because one or more premises are false.
A sound deductive inference takes all the guesswork out of establishing proof for a claim.
If you know both that all the premises are true and that the inference is valid, then you
know that the conclusion must be true. No additional evidence or reasoning can change
that. If it does, then either you didn’t actually have a valid deductive inference, or you
didn’t actually know that all the premises are true. Thus, if scientists know some inference
is sound, they can be certain that the conclusion is true beyond a shadow of a doubt.
Uncovering Bad Arguments

Whether or not someone is persuaded by another’s reasoning is mainly a matter of human
psychology. People can fall for bad arguments, or they may not be persuaded by good ones.
But whether a deductive inference is good or bad is simply a matter of logic and truth. The
two main criticisms that can be made of a deductive argument are that (i) its premises are
false and that (ii) the conclusion isn’t validly inferred from the premises. When evaluating
a deductive argument, one should determine whether either or both of these criticisms
apply. And it is here that psychological reasoning and logical inference intersect. If you
think an argument is faulty on one or both of these grounds, you should consider whether
it can be repaired by replacing any false premises with true ones or whether additional
premises could be supplied such that there is a valid argument for the conclusion.
The valid inference patterns involving conditional statements discussed earlier—affirming
the antecedent and denying the consequent—have related invalid inference patterns
that result from confusing the roles of necessary and sufficient conditions in conditional
statements. Denying the antecedent occurs when a conditional statement and the negation
of its antecedent are used as premises for concluding that the consequent must be false
as well. Here is an argument that commits the error of denying the antecedent:
1. If a star is more than 15 billion years old, then the universe is more than 15 billion
years old.
2. No star is more than 15 billion years old.
∴ 3. It’s not the case that the universe is more than 15 billion years old.
This is an invalid argument. Even if the first two premises are true, that doesn’t guarantee
the conclusion is also true. As we have seen, the age of the oldest star is just a minimum
age for the universe. The conditional statement in the first premise reflects this, as the
consequent (the age of the universe) is a requirement for the antecedent (the age of the
oldest star). The antecedent guarantees the consequent but not the other way around. So,
denying the antecedent, as the second premise does, provides no good reason to believe
that the consequent is the case, but it doesn’t demonstrate that the consequent is not
the case either.
Affirming the consequent occurs when a conditional statement and its consequent are
used as premises for concluding that the antecedent must also be true. Here is an argu-
ment that commits the error of affirming the consequent:
1. If the Andromeda Nebula is 13.8 billion light-years away, then the universe is at
least 13.8 billion years old.
2. The universe is at least 13.8 billion years old.
∴ 3. The Andromeda Nebula is 13.8 billion light-years away.
This is also an invalid argument. Both premises are true, but they don’t guarantee the
truth of the conclusion. Some specific astral body that we can view from Earth being 13.8
billion light-years away does guarantee the universe is at least 13.8 billion years old, but
this is not required for the universe to be that old. The conclusion here is in fact false,
since Andromeda is around 10 billion light-years away.
Situations that you can describe, whether real or imagined, in which the premises of
an argument are true but the conclusion is false are called counterexamples to the argu-
ment. Counterexamples demonstrate that an argument or inference is invalid.
So far, the defects in reasoning we have seen are with the form of the inference.
But sometimes the problem with an inference is an empirical one, not a logical one.
Sometimes, even when an argument is valid, the world doesn’t cooperate with the state-
ments made about it. This is one place where the detective work of science often comes
in. Consider, for example, the following argument about atoms (recall also that the word
atom means indivisible, from the Greek a- + temnein, meaning not + to cut.)
1. The word atom means indivisible.

2. If the word atom means indivisible, then atoms are indivisible.
3. If atoms are indivisible, then atoms are the smallest type of matter.
∴ 4. Atoms are the smallest type of matter.
This is a valid argument, which involves affirming the antecedent—a valid inference
pattern. Given premises 1 and 2, it follows that atoms are indivisible; and from the
conjunction of that claim with premise 3, it follows that atoms are the smallest type of
matter. The problem, of course, is that scientists discovered particles smaller than atoms
over a century ago. Electrons were discovered in 1897, followed by the subsequent dis-
coveries of protons, neutrons, neutrinos, positrons, muons, bosons, and hadrons, which are
all smaller than atoms. These discoveries show the conclusion to be false: atoms are not
the smallest type of matter. So, the argument is not sound. Because the argument is valid,
learning that the conclusion is false also tells us something about the premises: at least
one of the three premises is also false. Can you figure out which is to blame?
We have seen that reasoning can falter because of a defect in the form of the infer-
ence, or because they accidentally contain a false premise. In other cases, the defect in
reasoning owes to an informal fallacy, which is a faulty inference pattern where the
defect in reasoning lies with the inference’s content rather than its form, and which goes
beyond just merely have false premises. Unfortunately, there is no fully unified theory
of informally fallacies, nor any universally agreed upon definition (Walton, 1989/2008);
and there are hundreds of such fallacious patterns. Here are a few that are unfortunately
common in debates about science.
The strawman fallacy involves caricaturing someone’s thoughts in order to criticize
the caricature rather than the actual thoughts. Here is an example:
1. Evolutionary theory claims that humans recently evolved from monkeys.

2. The idea that humans recently evolved from monkeys is clearly wrong.
∴ 3. Evolutionary theory is clearly wrong.
This argument seems to be an instance of affirming the antecedent, which is a valid

inference pattern. But the argument misrepresents evolutionary theory, so premise
1 is false. (Evolutionary theory instead posits, among other things, that humans and
apes share a common ancestor several million years ago.) This is an instance of the
strawman fallacy because evolutionary theory is misrepresented in order to claim it
is clearly wrong. The complexity of many scientific theories makes them easy targets
for the strawman fallacy.
Another common error in reasoning about science is called appeal to irrelevant
authority. For example, the pseudoscientific pronouncements of scientologists—a waning
religious cult from the 1950s—often appeal to L. R. Hubbard’s book Dianetics. Hubbard,
however, had no expertise in any academic subject whatsoever. Appeals to his book are
poor grounds for scientific conclusions about well-being, mind, or the cosmos. It’s some-
times difficult to assess whether some authority is legitimate. For example, sometimes
genuine experts in one scientific field make pronouncements about other fields in which
they have no authority. Uncovering appeals to irrelevant authority thus can require careful
analysis of credibility. This relates to Chapter 1’s discussion of how politicians should not
be viewed as experts on climate change science and the broader issues about expertise
introduced there.
Finally, appeal to ignorance is another informal fallacy. Arguments that commit this
fallacy conclude that a certain statement is true because there is no evidence proving
that it is not true. For example,
1. There is no compelling evidence that the pyramids were not built by extrater-
restrial creatures.
∴ 2. The pyramids were built by extraterrestrial creatures.
Plainly, this is a bad inference. Indeed, there’s a slogan that ‘absence of evidence is not evi-
dence of absence’. In other words, not having evidence that something is true isn’t necessarily
reason to think it isn’t true. For this example, we can imagine things that might provide
evidence that the pyramids were built by extraterrestrial creatures, but it’s hard to even
imagine how we could provide evidence that they weren’t. More generally, a lack of empirical
evidence in support of some scientific claim is usually reason not (yet) to believe the claim
is true. But this is generally not grounds for declaring the claim false, for the lack of evidence
may say more about the limits of our scientific knowledge than how the world really is.
The fallacy of appealing to ignorance highlights three interesting features of reasoning.
First, it is generally easier to prove that something is the case than that it is not the case.
Perhaps it would be better to examine evidence for who did in fact build the pyramids
than to simply look for evidence that it wasn’t aliens. Second, the burden of proof, or
the obligation to provide evidence in support of a belief, generally lies with the person
who makes an assertion. So, if you assert that the pyramids were built by aliens or that
genetically modified foods are risky for human health, then you should be able to provide
evidence in support of your assertion when asked to do so. Third, the more extraordinary
a statement is, the more evidence it requires. When a chemist asserts that a solution must
be acidic because the litmus paper turned bright red, there is usually little need to ask
her how she knows that the color was red. Extraordinary claims, however, such as that
all life on Earth has evolved from a single common ancestor, require a lot of evidence.
The English naturalist, geologist, and biologist Charles Darwin (1809–1882) spent years
assembling evidence for his theory of evolution and common ancestry, and many scientists
following Darwin have added and improved upon that store of evidence.
Bad Reasons to Reject Inferences

Keeping a lookout for invalid inference patterns, false premises, and informal fallacies
can help uncover bad arguments. These, as well as the valid inference patterns we have
discussed, are summarized in Table 4.2. But these logical and empirical reasons to chal-
lenge some arguments should be carefully distinguished from the negative psychological
reactions some arguments can evoke. These reactions are usually not reasons to reject an
inference, but they may inhibit the recognition of sound reasoning.

Some scientific findings and inferences can be counterintuitive or difficult to under-
stand. But this, by itself, is not grounds for rejecting the finding or inference. A person with
limited background in evolutionary theory may find it difficult to imagine how humans
could have evolved ultimately from single-celled organisms. Similarly, without training in
physics and cosmology, it can be difficult to wrap your head around the universe being
over 13.8 billion years old and expanding out from an initial Big Bang. But evolutionary
theory and cosmological research, including the Big Bang theory, provide solid grounds
for accepting the truth of both of these bewildering claims. Just as a claim’s intuitiveness
is not a guide to whether it is true, an argument’s difficulty or complexity is irrelevant
to whether the inferential structure of the argument is any good.
Likewise, whether someone finds the conclusion of an argument distasteful, offensive,

or disagreeable is irrelevant to whether that conclusion is true. The conclusion that global
warming is caused by human activity is politically inconvenient for friends of the fossil
fuel industry, including many politicians. This has motivated some of those individuals
to cast doubt on the finding, and they’ve been incredibly successful at creating public
doubt about climate change. These disinformation campaigns often have pointed to the
mere occurrence of disagreement as a reason for doubting climate change, as well as other
unwelcome scientific findings. But the evidence and structure of inferences supporting
anthropogenic climate change are incredibly strong.
Some people object to the idea that the universe is billions of years old, sometimes
suggesting this is ‘just an opinion’. Similarly, skeptics of evolutionary theory love to point
out that it is ‘just a theory’ that biological species have evolved from a common ances-
tor. These are also bad objections. Natural phenomena, and natural explanations of those
phenomena, are not simply a matter of opinion. And scientific theories are developed on
the basis of a tremendous amount of confirming evidence and careful inference. These
criticisms are not based on disagreements about evidence or the logic of arguments, but
instead appeal to the trivial fact that people have different ideas about some things. Ideas
that are supported by evidence and sound inference should be taken seriously.
EXERCISES
4.1 Define reasoning, inference, and argument, and describe how they are involved in
science (even though science is based on empirical evidence).
4.2 The following statements concern necessary and sufficient conditions. For each state-
ment, rephrase it in the form of a standard if/then conditional statement and say
whether it’s true or false.
1. Being a mammal is a sufficient condition for being human.
2. Being human is a sufficient condition for being an animal.
3. Being alive is a necessary condition for having a right to life.
4. Being alive is a sufficient condition for having a right to life.
5. Having a PhD is necessary if you want to be a scientist.
6. It’s sufficient for being awarded the Nobel Prize in immunology that one gener-
ates the cure for cancer.
4.3 Rephrase each of the following statements into standard conditional statements, and
then say whether they’re true or false.
1. P is a sufficient condition for Q if it is true that if P then Q.
2. It is true that if P then Q, but only if Q is a necessary condition for P.
3. It is true that P only if Q, but only if P is a sufficient condition for Q.
4. Not Q is a sufficient condition for P if it is true that P unless Q.
5. Something is a brother if and only if it is a male sibling. So, being a male sibling
is necessary for being a brother.
6. Something is a brother if and only if it is a male sibling. So, being a male sibling
is sufficient for being a brother.
4.4 Define deductive inference, validity, and soundness, and then answer the following
questions. Explain each answer.
a. Is every deductive argument valid?
b. Is every deductive argument sound?
c. Is every valid argument sound?
d. Is every sound argument valid?
4.5 Rewrite each of the following arguments in standard form, with numbered premises
and a conclusion. For each argument, say whether it is valid and whether it is
sound. Give reasons to justify each of your answers.
1. LeBron James must be mortal. After all, all humans are mortal, and LeBron
James is a human.
2. God is often characterized as the most perfect being. A perfect being must
have every trait or property that it would be better to have than not to have.
Since one of those properties is existence—that is, it is better to exist than not to
exist—then God exists.
3. The number 1 is a prime number, and 3 is a prime number. So too are 5 and
7. Therefore, all odd integers between 0 and 8 are prime numbers.
4. Real Madrid has won more than 17 games every year for the past 30 years.
So, you can safely bet Real Madrid will win more than 17 games this year.
5. The universe cannot be younger than 11 billion years old because the age of
the oldest known stars is 11 billion years old.
6. The term tachyon refers to a particle that travels faster than light. Therefore, it’s
not the case that nothing travels faster than light.
4.6 Come up with an example argument employing the inference pattern of affirming
the antecedent. Do the same for denying the antecedent, affirming the consequent,
and denying the consequent. For each argument, say whether it’s valid. For each
invalid argument, provide a counterexample. For each valid argument, say whether
it’s sound.
4.7 Describe the three informal fallacies outlined in this section. Give a new example of
each. Try to think of a real instance you’ve encountered, but if you can’t, it’s fine to
make up an example.
4.8 Review the passage about Hubble’s discoveries in the first part of this section. Sum-
marize the inferences that led Hubble to conclude that the universe is over 10 billion
years old.
4.9 Review the passage about Hubble’s discoveries in the first part of this section. Iden-
tify three conditional statements involved in Hubble’s inference that the universe is
over 10 billion years old. (These might not be written in the text in if-then form, and
some of the conditional claims involved in Hubble’s inference process might not
even be explicitly written out.) Write out the three statements in standard if-then form.
4.10 Review the passage about Hubble’s discoveries in the first part of this section. Sum-
marize the inferences Hubble made that led to the conclusion that the universe is
expanding. Then, put that argument into standard form, with numbered premises and
a conclusion. Are any premises needed for a valid deductive argument missing? If so,
add them, even if they weren’t explicitly stated in the description of Hubble’s reasoning.
4.11 Read the following passage, and try to understand the argument it makes.
Anybody who wants to repeat an experiment in modern subatomic physics has to

undergo many years of training. Only then will he or she be able to ask nature a
specific question through the experiment and to understand the answer. Similarly,
a deep mystical experience requires, generally, many years of training under an
experienced master and, as in the scientific training, the dedicated time does not
guarantee success. If the student is successful, however, he or she will be able to
‘repeat the experiment’. A mystical experience, therefore, is not any more unique
than a modern experiment in physics. On the other hand, it is not less sophisticated
either, although its sophistication is of a very different kind. The complexity and effi-
ciency of the physicist’s technical apparatus is matched, if not surpassed, by that of
the mystic’s consciousness—both physical and spiritual—in deep meditation. The
scientists and the mystics, then, have developed highly sophisticated methods of
observing nature which are inaccessible to the layperson. A page from a journal
of modern experimental physics will be as mysterious to the uninitiated as a Tibetan
mandala. Both are records of inquiries into the nature of the universe.
(Capra, 1975, pp. 35–36)
a. What’s the conclusion of the argument developed in this passage?

b. The passage draws an analogy between science and mysticism. What purpose
does the analogy play in the argument?
c. Assess the author’s reasoning. What are good points or inferences? What
weaknesses are there in the author’s reasoning?
d. Asses the author’s conclusion. Do you think the conclusion is right? Has the
author given adequate grounds for believing the conclusion?
4.12 Read the following passage, and try to understand the argument it makes.
An electron is no more (and no less) hypothetical than a star. Nowadays we count

electrons one by one in a Geiger counter, as we count the stars one by one on a
photographic plate. In what sense can an electron be called more unobservable
than a star: I am not sure whether I ought to say that I have seen an electron; but I
have just the same doubt whether I have seen a star. If I have seen one, I have seen
the other. I have seen a small disc of light surrounded by diffraction rings which has
not the least resemblance to what a star is supposed to be; but the name ‘star’ is
given to the object in the physical world which some hundreds of years ago started
a chain of causation which has resulted in this particular light-pattern. Similarly

in a Wilson expansion chamber I have seen a trail not in the least resembling what
an electron is supposed to be; but the name ‘electron’ is given to the object in the
physical world which has caused this trail to appear. How can it possibly be main-
tained that a hypothesis is introduced in one case and not in the other?
(Eddington, 1935/2012, p. 21)
a. What’s the conclusion of the argument developed in this passage?

b. The passage draws an analogy between electrons and stars. What purpose
does the analogy play in the argument?
c. Assess the author’s reasoning. What are good points or inferences? What
weaknesses are there in the author’s reasoning?
d. Assess the author’s conclusion. Do you think the conclusion is right? Has the
author given adequate grounds for believing the conclusion?
4.2 DEDUCTIVE REASONING IN HYPOTHESIS-TESTING
• Define hypothetico-deductive method

• Describe how an example of hypothesis-testing might be construed as an application
of the H-D method
• Describe how auxiliary assumptions complicate the H-D method
• Characterize the axiomatic method and indicate how it’s been used in science
The Hypothetico-Deductive Method

In Chapter 2, we learned that hypothesis-testing is a central part of experimental
research. Testing hypotheses requires at least two ingredients: empirical evidence and
rational inference. Empirical evidence is the primary source of justification for scien-
tists’ hypotheses about the world, but rational inference is needed to evaluate hypoth-
eses on the basis of the available evidence. One form that evaluation can take makes
key use of deductive inference. This has been described as the hypothetico-deductive
(H-D) method. The two parts of that name will be familiar at this point: hypothesis
and deduction.
In general, we have said that hypothesis-testing involves establishing expectations from
a hypothesis, and then comparing those expectations with observations. On the H-D
method, the expectations formulated on the basis of a hypothesis should be logically
implied by that hypothesis using deductive inference. Hence, if the hypothesis is true, the
expectation derived from it also must be true. Sound familiar? This is what’s required to
have a valid deductive argument.
That the truth of the hypothesis guarantees the truth of the expectations also means
there is a conditional statement with the hypothesis as the antecedent and the expecta-
tion as the consequent: ‘If H, then E’. If we’ve formulated the expectations properly, this
conditional statement will be true. We don’t yet know whether the hypothesis is true,
but we do know that if the hypothesis is true, then the expectation will be true. This
conditional statement can be thought of as an answer to the question: ‘If this hypothesis
is true, what must be the case about the world?’
After deductively inferring expectations from the hypothesis, scientists make observa-
tions, perhaps by conducting an experiment. Those observations are then compared with
the expectations. Here too the H-D method sees a role for deductive inference. If the
observation does not match the expectation, that is, if the expectation is not observed,
then this enables a deductive argument for the conclusion that the hypothesis is false.
The inference pattern is denying the consequent, which we’ve learned is always a valid
form of deductive inference:
Refutation
1. If H, then E
2. ¬E
∴ 3. ¬H
In this case, from the observations, we can deductively infer that the hypothesis is false.
In other words, the observations refute the hypothesis.
If instead the observations and expectations match, this enables the inference pattern
of affirming the consequent. Careful—that was an invalid form of deductive inference!
In this case, no deductive argument for or against the hypothesis is possible. A match
between expectations and observations is consistent with the truth of the hypothesis, but
it does not guarantee the truth of the hypothesis. If the evidence matches expectations,
the hypothesis is confirmed, but if not, it is refuted.
Confirmation
1. If H, then E
2. E
∴ 3. Probably or possibly H
Let’s work through a really simple example. Imagine the hypothesis is that all swans are
white. If it is true that all swans are white, then the swan you next observe will be white.
This is a true conditional claim: the antecedent guarantees the consequent. So, you go
out looking for swans, with the expectation that, if your hypothesis is true, you will see
a white one. Let’s say you instead encounter a black swan. This observation violates your
expectation; by denying the consequent, you’ve shown the antecedent (the hypothesis)
is false. Breaking news: it’s not the case that all swans are white! However, if the next
swan you see is white, then your observation matches the expectation. You haven’t proven
anything, but you do have a bit more evidence in favor of the hypothesis.
There is, then, a crucial difference between refutation and confirmation. Refutation is a
valid deductive argument that demonstrates the hypothesis is false. In contrast, confirma-
tion is not a deductively valid argument. The truth of the premises does not guarantee
the conclusion is true. The argument scheme for confirmation shown here reflects this by
concluding not H but ‘probably or possibly H’. An observation matching what a hypoth-
esis leads us to expect generally is taken to provide some evidence for the hypothesis.
But this isn’t always so, and it’s surprisingly tricky to articulate how this works. We will
return to this difficulty later in the chapter.
The Case of Puerperal Fever

A real instance of scientific reasoning famously used by the philosopher Carl Hempel
(1905–1997) to illustrate the H-D method is the story of Dr. Ignaz Semmelweis (Hempel,
1966). Semmelweis was a scientifically trained doctor working in the 1st Maternity
Division of the Vienna General Hospital in the 1840s, when many women delivering
babies there were contracting a serious and often fatal illness. The illness was known as
puerperal or childbed fever. (Puerperium refers to the postpartum period following labor
and delivery.) A puzzling observation was that the mortality rate in the 1st Maternity
Division was about three times higher than in the adjacent 2nd Maternity Division. These
rates are shown in Table 4.3.
Why was the rate of puerperal fever so much higher in the first clinic? An answer to
this question might provide some insight into how to decrease the incidence of puerperal
fever overall.
Semmelweis (1861) made several observations that seemed potentially relevant.
Women with dilation periods longer than 24 hours during delivery died of puerperal
fever much more often. He also observed that patients in the first clinic fell ill in a
sequential manner, one after another. The health of patients and the skill and care provided
by their caretakers did not seem related to the incidence of puerperal fever. Finally, not
only was the illness rate in the 2nd Maternity Division lower, but women who instead
TABLE 4.2 Valid inference patterns, invalid inference patterns, and informal fallacies
Some Valid Inference Patterns
Affirming the antecedent: Denying the consequent:
2. A 2. ¬ C
∴ 3. C ∴ 3. ¬ A
Some Invalid Inference Patterns
Denying the antecedent: Affirming the consequent:
2. ¬ A 2. C
∴ 3. ¬ C ∴ 3. A
Some Informal Fallacies
Strawman fallacy: caricaturing a position or argument in order to criticize the caricature

rather than the actual position
Appeal to irrelevant authority: appealing to the views of an individual who has no expertise
in a field as evidence for some view
Appeal to ignorance: concluding that a certain statement is true because there is no

evidence proving that it is not true
TABLE 4.3 Annual births, deaths, and mortality rates for all patients at the two clinics of
the Vienna maternity hospital 1841–1846
Year First Clinic Second Clinic
Births Deaths Rate Births Deaths Rate
1841 3036 237 7.70 2442 86 3.50
1842 3287 518 15.80 2659 202 7.50
1843 3060 274 8.90 2739 164 5.90
1844 3157 260 8.20 2956 68 2.30
1845 3492 241 6.80 3241 66 2
1846 4010 459 11.40 3754 105 2.70
Total Avg. 20,042 1989 9.92 17791 691 3.38
gave birth at home or elsewhere outside the clinic—even unattended on the street—were
unaffected by puerperal fever.
Semmelweis used these observations to rule out a number of proposed sources of the
illness. Puerperal fever wasn’t a city-wide epidemic. If it were, women who gave birth
outside the hospital would also suffer from the illness, but they didn’t. Nor was puerperal
fever triggered by psychological traumas during childbirth, like intense modesty from
being medically examined by male doctors (as had been proposed). If it were, surely
some women who gave birth in the streets would also experience puerperal fever, but
they didn’t. Most crucially, all proposed sources of the illness led to the expectation of
equal rates of the illness in the 1st and 2nd Maternity Wards. That expectation did not
match observations. So, reasoning in a way that is captured well by the H-D method of
refutation, Semmelweis rejected all these hypotheses about the cause of puerperal fever.
Semmelweis tried to develop hypotheses that were consistent with the observed differ-
ence in puerperal fever rates between the two maternity wards. One difference between
the wards was that the 1st Ward was staffed by male doctors and medical students, while
the 2nd Ward was staffed by female midwives. Women in the former gave birth on their
backs, women in the latter on their sides. Semmelweis changed procedures in the 1st
Ward so that all women there also gave birth on their sides. From the hypothesis that
giving birth on one’s back increases incidence of the illness, one can deductively infer the
expectation that changed birth position will decrease the incidence of the illness. Alas, this
expectation did not match Semmelweis’s observation: changing birth position in the 1st
Ward made no difference. Other hypotheses were similarly tested and similarly ruled out.
Then, at the end of March 1847, Semmelweis learned that his colleague Dr. Jakob
Kolletschka had died. Kolletschka was a professor of forensic medicine. He had been
performing an autopsy on a woman who had died from puerperal fever when a scalpel
had lacerated his finger. Kolletschka subsequently exhibited the same symptoms as the
FIGURE 4.2 Frieze at the Social Hygiene Museum in Budapest, honoring Ignaz Semmelweis
mothers and infants who had died of puerperal fever. Semmelweis was distraught by
his friend’s death, but he also saw the value of this information for the investigation of
puerperal fever. He hypothesized that the scalpel had contaminated Kolletschka’s blood
with ‘cadaverous particles’, and this caused the puerperal fever that led to his death.
Semmelweis also realized that this was supported by the observation of the difference in
illness rates between the two wards: doctors and medical students performed autopsies,
whereas midwives did not.
Semmelweis reasoned that if the hypothesis that cadaverous particles caused puerperal
fever were true, then the illness could be prevented by eliminating the cadaverous par-
ticles. To test this hypothesis, he required all students and midwives to thoroughly wash
their hands in a solution of chlorinated lime prior to examining patients. If this made no
difference, then cadaverous particles weren’t to blame, and this new hypothesis would also
be refuted. But, instead, the mortality from puerperal fever began to decrease, and the
incidence in the 1st Ward dropped to a similar level as in the 2nd Ward. Semmelweis’s
hypothesis was confirmed.
This is a good illustration of the H-D method and in particular the difference between
refutation and confirmation. Recall that, on the H-D account, refutation is decisive, as
it is the result of a valid deductive inference, whereas confirmation is weaker. It turns
out that Semmelweis’s confirmed hypothesis was wrong. Cadaverous material wasn’t
responsible for puerperal fever; it was a bacterial infection of the uterus. Luckily, chlo-
rinated lime is an antibacterial agent. Semmelweis thought the prescribed handwash-
ing worked because it removed cadaverous material, but instead, it worked because it
removed bacteria.
Some other important instances of hypothesis-testing are also well described by the
H-D method. Another example, which we encountered in Chapter 2, is the case of Arthur
Eddington’s confirmation of Einstein’s theory of relativity from the 1919 solar eclipse.
This was also a refutation of Newton’s cosmological theory. Einstein’s theory of general
relativity, as you may recall, implies that light will bend around a massive object like
the Sun. Newton’s theory also predicts light will bend because of gravity. However, the
theory of general relativity implies that light will bend twice as much as the value pre-
dicted by Newtonian physics. Measuring how much light bends around the Sun allowed
Eddington to refute Newtonian physics and provide some confirmation of Einstein’s
theory of general relativity.
Auxiliary Assumptions
The H-D method seems to accurately capture something important about hypothesis-
testing in science, namely the distinctive power of refutation. Data that fit our expecta-
tions are well and good, but we can really learn something from data that contradict our
expectations. This also accords with the importance of hypotheses that are falsifiable, as
outlined in Chapter 1. The power of refutation is also what makes the idea of crucial
experiments compelling, as we discussed in Chapter 2 with the case of Newton’s prism
experiments. Yet, the H-D method also has its limitations. We’ll close this discussion by
describing one challenge to this account of hypothesis-testing; then, later in the chapter,
we will survey two powerful alternatives based on non-deductive patterns of inference.
The challenge to the H-D method is that the inference from a hypothesis to some
expectation is never truly deductive. Or, more precisely, additional claims are needed
in order to make a deductive inference from hypothesis to expectation valid. These
additional claims include background assumptions about how the world works, what in
Chapter 2 we called auxiliary assumptions. Lurking in the background of Semmelweis’s
inference about handwashing, for example, was the assumption that handwashing would
remove cadaverous material. Beyond Eddington’s refutation of Newtonian physics were
a number of assumptions about the behavior of instruments, the properties of light, the
location of certain astral bodies, and so on.
Such auxiliary assumptions often go unnoticed, either because they are assumed to
be true or, in some cases, simply because no one has noticed them. But because valid
deductive inference requires the premises to guarantee the conclusion, these auxiliary
assumptions are essential premises for the deductive inference from a hypothesis to some
empirical expectation, a key component of the H-D method. So, the schemes we identified
earlier for refutation and confirmation on the H-D account need to be adapted as follows:
Refutation Confirmation
1. If H and A, then E 1. If H and A, then E

2. ¬ E 2. E
∴ 3. ¬ H ∴ 3. Probably or possibly H
In this new formulation, the letter A stands for statements of whatever auxiliary assump-
tions are required as additional premises to validly deduce E from H. Required auxiliary
assumptions may include background ideas about the phenomenon under investigation,
as well as assumptions about the reliability of experimental instruments and measure-
ment procedures.
Taking into account auxiliary assumptions in the H-D schemes more realistically cap-
tures the type of reasoning that underlies hypothesis-testing. But this also introduced a
new problem. The refutation scheme with ¬ H, or it’s not the case that H, as its conclusion
is no longer a valid deductive argument. This is no longer an instance of denying the

consequent. To fit that pattern, the argument’s conclusion must instead be:
∴ 3′. ¬ (H and A)
This amounts to the statement that it’s not the case that both H and A are true. In
other words, taking into account auxiliary assumptions, all you can deductively con-
clude from observations not matching expectations is that either the hypothesis is
wrong or one or more auxiliary assumptions is wrong (or both). Because of the need
for auxiliary hypotheses, the H-D method can’t provide a deductive argument that
the hypothesis is false.
This problem is known as the Duhem-Quine problem, named after the French physi-
cist, mathematician, and philosopher of science Pierre Duhem (1861–1916) and the
American philosopher and logician Willard van Orman Quine (1908–2000). One upshot
of the Duhem-Quine problem is that deductive logic alone is insufficient for successful
hypothesis-testing. In the face of refutation, scientists need to decide whether to give up
on a hypothesis or to question one or more of their auxiliary assumptions. It seems there’s
an element of choice. A scientist may well want to hold on to a hypothesis she likes and
look for another explanation for why the observations didn’t turn out as expected.
The hope of reasonably deciding whether to reject a hypothesis or an auxiliary assump-
tion isn’t entirely destroyed by the Duhem-Quine problem. Scientists typically have inde-
pendent evidence for many of their auxiliary assumptions. Instruments and measurement
procedures have been tested and employed in other circumstances, and background beliefs
about a phenomenon are often based on evidence. These considerations can be used to
help scientists decide whether, and when, to reject the hypothesis under investigation. Yet
the need for auxiliary assumptions limits the power of the H-D method of hypothesis-
testing. The Duhem-Quine problem makes clear that, just like confirmation, refutation
is messier than simple deductive inference.
Axiomatic Methods
Deductive inference plays a different kind of role in some fields of science. Progress in
scientific reasoning is sometimes achieved through formal axiomatization, a constructive
procedure by which statements are derived from foundational principles. The founda-
tional principles, called axioms, are accepted as self-evident truths about some domain.
The axioms are then used to deductively infer other truths about the domain, called
theorems.
The most venerable example of axiomatization comes from the Greek mathematician
Euclid, who lived between the 4th and 3rd centuries BCE. Book I of Euclid’s Elements of
Geometry begins with 23 definitions and five axioms. The five axioms are the following:
1. A straight line may be drawn between any two points.

2. Any terminated straight line may be extended indefinitely.
3. A circle may be drawn with any given point as center and any given radius.
4. All right angles are equal.
5. If two straight lines in a plane are met by another line, and if the sum of the internal an-
gles on one side is less than two right angles, then the straight lines will meet if extended
sufficiently on the side on which the sum of the angles is less than two right angles.
Together, these five axioms form the premises of Euclidean geometry. From these prem-
ises, one can validly deduce theorems about the congruency of figures, parallel lines, and
other results of Euclidean geometry. In turn, these theorems can be treated as premises
in new arguments aimed at validly deducing new theorems.
Euclid’s axiomatization of geometry was accepted as decisive for almost two millennia. It
was a clear example of rigorous scientific reasoning grounded in first principles, with the power
to systematize all existing knowledge of geometry. It deeply influenced Ibn al-Haytham’s
work in optics and Newton’s physical theory of mechanics. Since the 19th century, however,
non-classical geometries have been developed that diverge from Euclid’s axiomatization. Just
as Euclid’s geometry was central to earlier physics and astronomy, these non-Euclidean geom-
etries paved the way for Einstein’s radical new theories of the relativity of space and time.
This implies that the geometry of physical space itself is not in general Euclidean.
Another example of an important use of the axiomatic method concerns the founda-
tions of arithmetic. Concerned with questions about the exact nature of numbers, the Italian
mathematician Giuseppe Peano (1858–1932) employed axiomatic reasoning to give a rigor-
ous foundation for the natural numbers (0, 1, 2, 3, 4, …). Peano’s axiomatization of natural
numbers began with three primitive concepts, that is, concepts that were not defined in terms
of other concepts. Peano thought these primitive concepts were self-evident: the set of natural
numbers, N; the number zero, a member of the set N; and the successor function S. This
successor function can be applied to any natural number, and it will yield the next number
after it. For example, S(6) = 7. Likewise, S(0) = 1. From here, Peano laid down several axioms:
1. Zero is a number.
2. If n is a number, then S(n) is a number.
3. Zero is not the successor of a number.
4. Distinct natural numbers have distinct successors.
5. If 0 is an element in a set of numbers and the successor of every number is in that set,
then every number is in that set.
Given these axioms, the basic properties of natural numbers could be described and
theorems about them, including the arithmetic operations like addition and subtraction,
could be deduced. To take a simple example, the supposition that there is a number
preceding zero (S(k) = 0) would contradict axiom 3. Accordingly, the theorem that zero
has no predecessor in N can be derived from axiom 3.
EXERCISES
4.13 Summarize the H-D method. How does this method relate to hypotheses? How does
it relate to deductive reasoning? What’s the crucial difference between refutation
and confirmation?
4.14 There are supposed to be two applications of deductive inference in each H-D refuta-
tion. (a) What are those two deductive inferences, and how is each related to how we
have characterized hypothesis-testing in general? (b) Define the Duhem-Quine prob-
lem. Which application(s) of deductive inference does this problem interfere with?
4.15 Return to the description of Semmelweis’s investigation of puerperal fever. Identify
three inferences that can be described as uses of the H-D method (either refutation
or confirmation). For each, write out the inference as an argument in standard form
with premises and conclusion.
4.16 After reading the passage below, identify the hypothesis under investigation. What would
the researchers expect to find if the hypothesis is true? Finally, list five important auxiliary
assumptions required for a deductive inference from the hypothesis to the expectations.
In an influential study published in 1979, the primatologist and psychologist,

Woodruff & Premack examined whether chimpanzees can learn to deceive;
specifically, to act with the intention of causing a person to hold a false belief
about the location of food. In one of Woodruff & Premack’s tests, a chimpanzee
could see the location of food, which was placed in one of two containers.
However, because both containers were out of reach, the animal could only
obtain the food from a human. The human didn’t know where the food was, but
was instructed by the experimenters to search the container that the chimpanzee
seemed to be indicating through its orientation or by pointing. In some trials
of this experiment, the human was dressed in green and was cooperative: if
they found the food, they gave it to the chimps. In other trials, the human was
dressed in white and was competitive: if they found the food, they kept it for
themselves. Thus, write Woodruff and Premack, ‘the chimpanzee’s success in
procuring the goal depended upon his ability to convey accurate locational
information to a cooperative partner on the one hand, and suppress or convey
misleading information to a competitive individual on the other’.
(Woodruff & Premack, 1979, p. 335)
4.17 Read the passage from exercise 4.16 above. Woodruff and Premack (1979) found that
after 120 trials, each of the four chimpanzees they tested showed a reliable tendency to
indicate the container with food in the presence of a cooperative human and an empty
container in the presence of the competitive human. Say whether the hypothesis under
investigation is confirmed, refuted, or neither by this evidence. Justify your claim.
4.18 Imagine you want to estimate a rock’s age using the technique of radiometric dat-
ing. This technique allows scientists to estimate age from the known decay rate of
radioactive materials, given that traces of radioactive materials were incorporated
when the rock was originally formed. (a) What are some auxiliary assumptions
that you think are involved in a test like this? (b) You hypothesize that the rock is
3.8 billion years old, but the test results do not match your expectations. What
are some possible reasons that you could have gotten this result, even if the rock
is actually 3.8 billion years old? List at least three.
4.19 Describe the axiomatic method in your own words. How has this method been used
in science?
4.3 INDUCTIVE AND ABDUCTIVE REASONING
• Define inductive inference, indicating how it differs from deductive inference

• Characterize the problem of induction
• Define abductive inference, indicating how it differs from deductive and inductive
inference
• List the strengths and weaknesses of each of the three forms of inference: abductive,
deductive, and inductive
Flint’s Water Crisis

It is sometimes said, even if only metaphorically, that the conclusions of deductive argu-
ments are already contained in the premises. What this saying means is that the conclusion
doesn’t add any new content beyond what the premises provided. This is a consequence
of the premises’ truth guaranteeing the conclusion’s truth, the requirement for any valid
deductive argument. Deductive inferences are thus non-ampliative; the conclusion cannot
augment the content of the premises. The non-ampliative nature of deductive reasoning
limits the usefulness of deductive inference patterns. Scientific and everyday reasoning
are often not like this.
Consider the water contamination crisis in Flint, which is a city of approximately
100,000 people in Michigan (USA). About 40% of the population of Flint fall below the
poverty threshold, due in part to the downsizing of an automobile factory located there.
Flint had received its water supply from neighboring Detroit, but in 2014, major budget
deficits led the city council to change the city’s water supplier. The new supplier would
provide water from the Flint River. The problem was that this river had been badly pol-
luted for decades, and the environmental cleanup of its contaminants and toxic waste
was improperly performed.
Soon after the change in water source was made, bacteria and other contaminants were
detected in the water. Residents were instructed to boil any water before drinking it, and
water treatment changes were made. Levels of disinfectant by-products in violation of
the Safe Drinking Water Act were found, followed by buildup of what is now known to
be total trihalomethane (TTHM). Both of these have negative health effects, including
increasing the risk of cancer. But this wasn’t the worst of it.
Officials ignored federal environmental regulations that required treatment of the

water supply system with anticorrosive chemicals. As a result, lead pipes in the system
corroded, causing high levels of lead to leech into the water and soil. Lead is extremely
toxic and can have massively adverse health effects in everyone but especially in children.
Lead concentrations in children’s blood as low as five parts per billion (ppb) can result
in decreased intelligence and behavioral and learning deficits.
After anecdotal reports of pets and children becoming sick from the water supply,
frustrated residents carried jugs of brown tap water to community meetings in early
2015. City officials affirmed that the water was safe. When concerned residents began
demanding scientific testing of the water, this was shown to be false, and the magnitude
of the problem was revealed.
FLINT WATER CRISIS

safe actionable actionable considered lowest average highest
lead level lead level lead level toxic waste lead level lead level lead level
0 10 15 5,000 300 2,000 13,000 +
WHO EPA April 2015 sampling at Walters’s Home

= 15 parts per standard standard in Flint Michigan, USA
billion (ppb)
FIGURE 4.3(a) Flint Michigan water crisis, with numbers indicating parts per billion (ppb); (b)
Lee Anne Walters, the Flint citizen-scientist who initially requested water-testing
An initial test conducted by the Environmental Protection Agency (EPA) in March

2015 showed lead levels at 104 ppb in the water of one home—seven times the legal
limit in water. Levels of some other chemicals, such as iron, couldn’t be specified because
they exceeded the measurement capabilities of the instruments. The next month, the lab
team of Virginia Tech Professor Marc Edwards arrived to conduct more extensive testing.
Lead levels at the same home at which the previous test had been conducted ranged,
depending on water flow rate, from 217 ppb to a staggering 13,200 ppb. (Anything over
15 ppb is over the legal limit for water. According to the EPA, anything higher than 5,000
ppb is hazardous waste.) Across the 269 homes that were eventually sampled, 40% had
elevated lead levels above the EPA’s recommended guidelines. The evidence suggested
that Flint’s water supply is toxic.
The Flint water crisis raises a number of interesting and difficult issues,
including regarding how science relates to the public. For now, notice the general
form of the inferences that were made in determining the problem, and the extent
of the problem, with Flint’s water. The aim was to know something about the water
quality across all houses in Flint, and the path to getting there started with test-
ing water quality in one family’s house. A conclusion about the water quality in all
houses in Flint can’t be truly deductive without exhaustive testing of every house.
This isn’t the best way to proceed in the face of a public health emergency though.
Like much of scientific reasoning, the scientists investigating Flint’s water instead
proceeded inductively. Their conclusions were ampliative; they went beyond the
sampling results.
Inductive Inference
Imagine you go to the grocery store, hankering for some grapefruit. The grocer takes one
grapefruit from the top of one box, cuts it open, and offers you a slice to taste. It tastes
good! What you may not notice is that the grocer is tacitly expecting you to making the
following inference:
1. One grapefruit from this box is good.

∴ 2. All grapefruits from this box are good.
You draw three other grapefruits from the box at random. The three grapefruits look good,
like the grapefruit the grocer showed you. So, you buy that box of a dozen grapefruits.
What’s the inference you’re tacitly making?
1. The three grapefruits picked at random from this box are good.
∴ 2. The next nine grapefruits drawn from this box will also be good.
Neither of these is a valid deductive inference; the truth of the premises do not guarantee
the truth of the conclusions. Assuming the premises are true, the conclusions are at best
likely or probable. Accordingly, both inferences are inductive. An inference is inductive
when the inferential relationship from premises to conclusion purports to be one of
probability, not necessity. Even if the premise in each inference is true, the conclusion
may nonetheless be false. Perhaps not all grapefruits from the box are good, even if the
grapefruit the grocer showed you was good and even if you checked three other randomly
picked grapefruits from the box. For all you know, the rest of the grapefruits in the box
could be rotten.
Because the truth of the premises in inductive arguments does not guarantee the truth
of the conclusion, inductive inferences are always logically invalid. Nonetheless, reasoning
inductively is a primary form of making inferences in science and everyday life. Two com-
mon forms of inductive inference are generalizations and projections. Inductive generaliza-
tions are inferences to a general conclusion about the properties of a class of objects based
on the observation of some number of the objects in the class. If the conclusion applies
to all members of the class, the generalization is a universal inductive generalization. The
form of inductive generalizations is something like this:
Inductive generalization
1. O1, O2, O3 …, and On each has property P.

∴ 2. All Os have property P.
The grocer’s inference was like this; it went from the premise that one grapefruit from the
box is good to the conclusion that all grapefruits are good. In contrast, inductive projections
(sometimes called next-case induction) are inferences to a conclusion about the feature of
some object that has not been observed based on the observation that some objects of
the same kind have that feature. The form of inductive projections is something like this:
Inductive projection
1. O1, O2, O3, …, and On each have been observed to have property P.
∴ 2. The next observed O will have property P.
Your argument at the market was like this; it went from the premise that each of three
grapefruits you observed is good to the conclusion that the next nine grapefruits will be
good. These two patterns of inference are similar. The difference is that generalization
makes a prediction of some entire class of entities, whereas projection makes a prediction
of entities that have not yet been encountered.
Characteristics of Inductive Arguments

Inductive arguments have three characteristics that distinguish them from deductive
arguments: they are ampliative, non-monotonic, and have different strengths. All forms
of non-deductive inference have these features, so they’re also possessed by abductive
arguments, introduced later in this section.
We’ve already encountered the first of these characteristics. Ampliative inferences have
conclusions containing information that in some sense goes beyond, that is, amplifies,
the content present in the premises. Ampliative inferences enable us to extend beyond
that which we already know. The conclusion that all grapefruits in the box are good, for
example, goes beyond the information contained in the premise that three randomly
picked grapefruits from the box are good. Likewise, the conclusion that Flint’s water sup-
ply in general is toxic goes beyond the evidence provided by the Virginia Tech samples.
The consequence of ampliative inference, as we have seen, is that the conclusions of
inductive inferences are not necessitated by the premises. An inductive inference may
preserve truth, but it does not necessarily do so.
Second, inductive arguments are also non-monotonic, which means that whether an
inductive inference’s premises adequately support the conclusion may change with
new information. Adding new premises to the existing premises of a non-monotonic
inference with a true conclusion can render that conclusion false. For example, adding
to either of the previous grapefruit arguments the new premise that one grapefruit
from the box is rotten undermines the reasoning for the conclusion that all grapefruits
from the box (or all remaining grapefruits) are good. In contrast, recall that deduc-
tive arguments are monotonic; no additional information can possibly render a valid
deductive inference invalid.
Scientists often face surprising findings that force them to adjust and update their
ideas. This feature is captured well by inductive inference patterns because they are
non-monotonic. The addition of new information can reveal how a good inference from
true premises may nevertheless be wrong. For example, you may know that the smallpox
(variola) virus was completely eradicated in 1979. That smallpox was eradicated doesn’t
entail that the virus no longer exists, however. The World Health Organization (WHO)
permits and oversees two vaults containing variola specimens: one at the Centers for
Disease Control and Prevention in Atlanta, Georgia (USA), and the other at the State
Research Centre of Virology and Biotechnology in Novosibirsk, Russia. It was thus rea-
sonable to infer that there are no remaining variola specimens out in nature or otherwise
outside of the control of WHO—at least until 2014, that is. Then, scientists stumbled
upon some 60-year-old unsecured vials of smallpox while cleaning out a storage closet
at the Bethesda campus of the National Institutes of Health in the USA. This discovery
undermined the reasonable inductive inference from the eradication of smallpox and
WHO’s strict control of remaining specimens to the conclusion that there were no other
smallpox specimens unaccounted for. It was a good inductive argument—until a premise
was added that directly contradicted its conclusion.
Third, and last, inductive inferences are of different strengths. The conclusion that
the grapefruits in the box are good would be stronger if the grocer had let you eat two
grapefruits from the box and both tasted good. Similarly, the conclusion that all of Flint’s
water is toxic was strengthened when the Virginia Tech team sampled water from nearly
300 homes, compared to the earlier inference based only on problematic water samples
from a single home.
Good inductive inferences are strong, that is, likely to preserve truth. This means that
true premises are grounds for inferring the conclusion is probably true as well. Deductive
arguments are either valid or not, but this is not so for the strength of inductive argu-
ments. Strength comes in degrees: two arguments might be strong, but one might be even
stronger than the other. Further, any inductive argument, no matter how strong, can be
additionally strengthened. The degree of strength of an inductive inference may be mea-
sured by the probability that the conclusion is true given that all the premises are true.
Strong inductive inferences may nonetheless have false conclusions, as the smallpox
example shows. To take a more famous example, until the 17th century, Europeans
believed that all swan were white. Their belief was supported by strong evidence: no
European had ever observed a black swan, and no one they’d ever consulted had either.
However, in 1697, the Dutch explorer Willem de Vlamingh returned to Europe with two
black swans he had captured on Australia’s Swan River. The strong inductive argument in
favor of all swans being white was undermined by this development, and the conclusion
was shown to be false.
We discussed the hypothesis that all swans are white with the H-D model of hypothe-
sis-testing. The point there was to show the deductive force of refutation, or falsification, in
FIGURE 4.4 The black swan of the family (Black Australian swan surrounded by Bewick’s swans)
© Copyright Colin Smith and licensed for reuse under this Creative Commons License.
contrast to confirmation. The discussion of inductive inference here sheds additional light
on the process of confirming hypotheses. Earlier, we merely pointed out that confirmation
does not involve a valid deductive inference. What it does involve is inductive inference.
From this perspective, inferring that a hypothesis is true from some observation(s) can
be judged according to the inductive strength of the inference.
The Problem of Induction

Why do you look to the east if you want to see the morning sunrise at the horizon? Why
do physicians continue to prescribe Tylenol to reduce fevers? At first glance, the answers
to these questions might seem obvious. That’s where the Sun has always been on the
horizon every morning, and that’s what we, or any other human, have always experienced.
And Tylenol has reduced fevers almost without fail so far.
These are very strong inductive inferences. And that’s fine, as far as it goes. But what
justifies inductive inference in general? Well, you might say, inductive inference works
pretty well! (Aside from that whole business of not all white swans being white, that is.)
More seriously, we’ve said any inductive inference, no matter how strong, may be shown
to be incorrect with the addition of new information. Nonetheless, inductive inference
has led us to buy good grapefruit and other foods based on samples, to wear a coat when
we leave the house in the winter, to look for the morning sun in the east, and to take
Tylenol (or similar) as needed. It occasionally leads us astray, but by and large, inductive
inference has a very good track record.
The problem, however, is that this good track record can’t justify inductive inference.
This reasoning is itself an instance of inductive inference: because inductive reasoning
has guided us so well up to this point, we conclude that it will continue to do so. But
this just prompts the same question: what justifies the inductive inference that inductive
reasoning is justified?
This is the problem of induction. This problem was set out in the 18th century by
the Scottish philosopher David Hume (1711–1776). Hume (1748/1999) argued that
the problem of induction cannot be solved. The argument goes as follows. Consider how
we might justify inductive inference. There are two possibilities: either use deductive
reasoning or use non-deductive reasoning. Because a strong inductive inference with true
premises may still have a false conclusion, inductive inferences are invalid deductive infer-
ences, so they cannot be justified using deduction. The only other option, then, is to justify
inductive inference with non-deductive reasoning. But the claim that inductive inference
is justified requires showing that it is generally reliable, which requires nothing other than
inductive inference. So, looking to a non-deductive justification for inductive inference
leads to circular reasoning: we would need to prove inductive inference is reliable in order
to justify inductive inference. In other words, we would have to assume the reliability of
the method whose reliability we need to establish. Consequently, inductive inferences
cannot be justified using non-deductive reasoning, either. Given that deductive and non-
deductive reasoning exhaust the possibilities, Hume concluded that inductive reasoning
cannot be rationally justified.
Hume also noted that the justification for induction appears to depend on what he
called the uniformity of nature assumption. This is the idea that the natural world is suf-
ficiently uniform, or unchanging, so that we are justified in thinking our future experiences
will be consonant with our past experiences. The uniformity of nature assumption can’t
justify induction either, though, since this assumption is merely based in our past expe-
rience. We think nature is uniform because, so far, it has been. But what do we know
about tomorrow?
Philosophers of science have proposed several solutions to the problem of induction.
One possible solution begins from the observation that inductive inferences are intended
to warrant probable conclusions—not guarantees. And there are rational grounds for infer-
ring claims about the probability of something being the case on the basis of empirical
evidence. Perhaps, then, tools of statistical reasoning, which we focus on in Chapters 5
and 6, can justify some varieties of inductive inference. And statistical reasoning does have
a rational basis, provided by probability theory (an axiomatic theory).
A different approach to solving the problem of induction is simply to show that
inductive inference is the best we can reasonably hope for when it comes to making
reliable predictions (Reichenbach, 1938). Either nature is uniform, or it isn’t. If nature is
uniform and we want to make reliable predictions, then a non-inductive method like, say,
fortune-telling may or may not work. In contrast, inductive inference will clearly work.
(Remember the uniformity of nature assumption.) So, if nature is uniform, induction
will be more reliable than non-inductive methods. Now suppose nature is not uniform.
In that case, inductive inference will be unreliable, but so will any alternative methods.
Why is that? Well, suppose that fortune-telling were better than induction, that is,
that fortune-tellers were able to reliably predict the future. This success would imply
some kind of uniformity. But any uniformity in nature can be exploited by inductive
inference. You could, for example, inductively infer the future success of fortune-tellers
from their past successes. Consequently, whether or not nature is uniform, the best
approach one can take to making reliable inferences about the future or the unobserved
is inductive inference.
So, while the Duhem-Quine problem shows that deductive inference isn’t the full story
for hypothesis-testing, the problem of induction indicates inductive inference probably
isn’t the full story of scientific inference either. At the very least, both problems challenge
us to think more deeply about our grounds for inference. In the case of induction, we’ll
see in Chapter 6 how statistical inference may be able to support inductive reasoning
and make more precise its nature.
Abductive Inference
In 1915, the German scientist Alfred Wegener advanced a systematic proposal about
the geologic history of Earth. He proposed that a single landmass, named Pangaea, had
fragmented into the continents that we recognize today. Initially, Wegener’s hypothesis
was not widely accepted. At the time, most scientists accepted that the Earth’s molten
surface cooled billions of years ago and that the remnants of this cooling process are the
major landmasses that we recognize today. There were good reasons to accept the hypoth-
esis that, once encrusted, the Earth’s surface was relatively fixed and stable. But some
surprising geological features were left unaccounted for. For instance, if the continents
are fixed and stable and do not move, then the rough congruence of the shapes of some
continents (think of Africa and South America) is a puzzling coincidence; see Figure 4.5.
Further, some rocks that are now several thousands of kilometers apart have a variety of
FIGURE 4.5 (a) The Earth’s landmasses fit together a bit like puzzle pieces; (b) Marie Tharp
and Bruce Heezen
characteristics in common. And fossils of some early types of plants and animals were
distributed across continents.
Wegener hypothesized that the continents are not fixed on the surface of the Earth but
are instead very slowly drifting in relation to one another. If true, that hypothesis would
account for the puzzling observations that lacked an explanation if the Earth’s landmasses
are unchanging (see Wegener, 1929). In the 1950s, a few decades after Wegener’s initial
proposal, the American geologists Marie Tharp and Bruce Heezen were working to map
the ocean floor, when they made a fascinating discovery about the Mid-Atlantic Ridge,
an extensive mountain range running the whole length (north to south) of the Atlantic
Ocean, almost entirely underwater. They learned that at the top of that ridge, running its
full length, was a valley, and many earthquakes originated in this valley. This, too, fit with
the hypothesis of continental drift. They had, it seemed, discovered that the seafloor was
spreading, further separating the landmasses on either side of the Atlantic Ocean, the
edges of which were roughly congruent, a bit like puzzle pieces.
Continental drift, if true, would account for all of this evidence. Like the shape of the
continents on Earth, everything—all the observations—would then seem to fit together.
Various other kinds of evidence came to light in investigations carried out in a diversity
of fields of science, all supporting continental drift. Today, continental drift is part of the
accepted theory of plate tectonics.
What kind of inference pattern was used when scientists eventually reasoned, from a
variety of evidence, that the hypothesis of continental drift was true? This is clearly an
ampliative inference, in that it goes beyond what’s contained in the evidence. So, it’s not
deductive reasoning. But this doesn’t correspond very well to the pattern we’ve seen of
inductive inference either; it’s not a generalization or projection from an observation of
a certain kind, like the quality of some grapefruit or water, to the expectation of more
observations of that kind. There’s a bigger leap involved in the inference from premises
about geologic features to the conclusion that landmasses have separated and moved apart
over the course of Earth’s history.
This is an abductive inference, a type of non-deductive inference that attributes special
status to explanatory considerations. Abductive inference is also called inference to the best
explanation. The conclusion is not validly deducible from the premises, nor is it a gener-
alization or projection on the basis of the premises. Instead, in reasoning abductively to
some conclusion, one considers whether or not the conclusion, if true, would best explain
the premises. Suppose, for example, that you know your roommate Theresa had a seri-
ous accident yesterday while preparing dinner. This morning, you see her walking down
the hallway with stitches in her hand. The best explanation for the stitches seems to be
that Theresa was cut with a kitchen knife, and that, because of the severity of the cut,
she sought medical attention. So, you hypothesize that Theresa accidentally cut herself
and got the stitches from having gone to the local hospital. This conclusion might not be
true. But if it were, it would account for the available evidence. Thus, abductive infer-
ence is characterized by an appeal to explanatory considerations to conclude that some
hypothesis is true.
Reasoning that corresponds to the form of abductive inference is quite common in both
everyday reasoning and scientific contexts. Recall from earlier in this chapter Hubble’s
reasoning from empirical data to the conclusion that the universe is more than 10 billion
years old (now recognized to be at least 13.8 billion). Support for this hypothesis included
that the universe being at least 10 billion years old best explained a rich body of other
data. Among these data were the pattern Hubble observed in the redshift in the spectral
lines of distant galaxies, observations about the life cycle of stars, and the observation of
microwave cosmic background radiation. The hypothesis also coheres with fundamental
theories of physics, like the theory of general relativity, as well as various dating methods
in geochemistry. That agreement with other theories is also best explained by the truth
of this hypothesis about the age of the universe. Abductive inference from observations
and other scientific theories thus confirmed Hubble’s hypothesis.
The Dutch mathematician and scientist Christian Huygens (1629–1695) said of abduc-
tive reasoning,
One finds in this subject a kind of demonstration which does not carry with it so
high a degree of certainty as that employed in geometry; and which differs distinctly
from the method employed by geometers in that they prove their propositions by
well-established and incontrovertible principles, while here principles are tested by
the inferences which are derivable from them. The nature of the subject permits no
other treatment. It is possible, however, in this way to establish a probability which
is little short of certainty. This is the case when the consequences of the assumed
principles are in perfect accord with the observed phenomena, and especially when
these verifications are very numerous; but above all when one employs the hypoth-
esis to predict new phenomena and finds his expectations realized.
(1690/1989, p. 126)
The principles Huygens discussed are hypotheses about the nature of light that could
explain experimental results in optics. One way to interpret Huygens’s suggestion is that
hypotheses that provide good explanations of these results are probably true. The rule of
inference he is suggesting is something like the following: from a given set of observations,
infer the best explanation of those observations.
What’s Distinctive about Abductive Inference

Abductive arguments have a distinctive logical form. Abduction is, in a sense, reasoning
backwards. The American philosopher Charles Sanders Peirce (1839–1914) characterized
abduction in this way:
I call all such inference by the peculiar name, abduction, because its legitimacy
depends upon altogether different principles from those of other kinds of infer-
ence. The form of inference is this: the surprising fact, C, is observed; but if A
were true, C would be a matter of course, [and hence], there is reason to suspect
that A is true.
(1903/1940, p. 151)
The logical form of abduction Peirce described is this:
1. The surprising fact C is observed.

2. If the hypothesis A were true, then C would be unsurprising.
∴ 3. There is reason to believe that A is true.
Characterized in this way, abduction is similar to the deductively invalid inference of

affirming the consequent. Recall that the conclusion A is not a valid deductive inference
from the premises if A then C, and C. But notice this is the same pattern of inference in
the preceding argument scheme. In fact, this argument scheme corresponds to the H-D
method’s scheme for confirmation. Recall that the confirmation scheme was not deduc-
tively valid; the role for deductive inference was in the refutation scheme.
There’s an extra element present in the abductive inference scheme characterized
here though, beyond what’s contained in the pattern of affirming the consequent or
H-D confirmation. This extra element is the reference to a level of surprise regarding
the observed fact. An abductive argument can’t be used to infer that any antecedent is
true simply from the fact that its consequent is. Instead, the idea is that if the anteced-
ent accounts for a consequent that would otherwise be left unexplained, then this is
grounds for believing the antecedent is true. The power of a hypothesis to explain what
is otherwise unexplainable is a reason to infer it is (probably) true.
Abductive inference thus differs from both deductive and inductive inference.
Abduction looks like a kind of deductive inference, but it is deductively invalid. Like
inductive inference, abductive reasoning is thus a form of non-deductive inference. It is
thus ampliative and non-monotonic, and the quality of arguments is a matter of degree.
But unlike induction, abductive inference does not generalize or project from what has
been observed. The special weight abductive inference accords to explanatory consider-
ations means that its conclusions are harder to predict from existing observations.
It’s not clear how to characterize the idea of some hypothesis best explaining some set
of observations. How should a hypothesis relate to the observations in order to explain
them? Abductive inferences seem to rely on an inferential ‘leap’—a leap in the reasoning
of one or more scientists having an ‘aha!’ moment, of seeing how some new idea about
the world might explain otherwise puzzling observations. Scientists employing abductive
inference in favor of a hypothesis need to hope that their audience grasps the connection,
that their audience sees how the hypothesis accounts for the observations. It’s not clear
whether there is anything definitive that can be said about what it takes for a hypothesis
to accomplish that task.
One suggestion is that a hypothesis best explains a set of observations if it predicts
the observations, that is, if it shows why the observations were to be expected. By
itself, this isn’t enough to make for a good explanation. Just saying that the observa-
tions in fact occurred is a way to make those observations unsurprising, but it doesn’t
explain anything. Explanations must also have some other qualities. Perhaps explana-
tions should also be simple, fit with other explanations we already accept, and generate
new expectations for what we will observe. These qualities seem to make an abduc-
tively inferred hypothesis—a best explanation—enlightening, as well as a ‘bold and
risky conjecture’. We have emphasized the value of the latter periodically throughout
this book. Indeed, qualities like simplicity, coherence with other explanations, and
fecundity of new ideas have been shown to play central roles in people’s assessment
of explanatory goodness.
Like inductive inferences, the goodness of abductive inferences comes in degrees. Given
the difficulty of pinning down the definition of a best explanation, it’s worth consider-
ing what features of abductive inferences contribute to their strength. First, it seems the
number and variety of surprising observations that a hypothesis explains contributes to its
strength. The abductive inference to continental drift became stronger over the decades,
as geological observations accumulated that would be expected if continental drift had
occurred and that would be surprising otherwise. Second, the degree of an observa-
tion’s surprisingness and the degree to which the hypothesis dispels the surprisingness
contributes to the strength of abductive inference. The finding of a rift down the center
of the Mid-Atlantic Range with significant seismic activity is pretty shocking without
different parts of the Earth’s crust moving in (very slow) motion. Third, if appealing
to features of the hypothesis like simplicity, coherence, and fecundity is the right way
to characterize its value as an explanation, then the degree to which those features are
possessed by the hypothesis contributes to the strength of the grounds for the inferential
leap to the truth of this explanation.
An amazing discovery in Morocco illustrates how scientists can appeal to a hypothesis’s
explanatory virtues as evidence in support of the hypothesis. Before this discovery, fossils
from Ethiopia were commonly regarded as the first anatomically modern humans, early
representatives of our species Homo sapiens. These fossils indicated that humans evolved
relatively quickly in a specific region of Africa about 200,000 years ago. The discovery
of new fossils from an archeological site in Morocco, named Jebel Irhoud, challenged
this conclusion (Hublin et al., 2017). In Jebel Irhoud, archeologists and evolutionary
anthropologists found several specimens of stone tools and human bones, including a
remarkably complete jaw and skull fragments. The researchers used dating techniques to
determine that the remains were about 315,000 years old. If these were Homo sapiens,
this would push back the origin of our species by about 100,000 years. This would also
suggest that humans did not evolve only in eastern sub-Saharan Africa (modern Ethiopia)
but in multiple locations across the African continent.
The previously favored hypothesis that Homo sapiens evolved in eastern sub-Saharan
Africa around 200,000 years ago could explain the findings at Jebel Irhoud as remains
from some hominid species that lived prior to Homo sapiens, perhaps the Neanderthals.
The Jebel Irhoud findings also prompted a new hypothesis, though: that the Homo sapi-
ens species’ evolution was a pan-African process that occurred about 300,000 years ago.
This new pan-Africa hypothesis was simpler than the previously favored hypothesis, as
it doesn’t require positing an archaic hominid species in North Africa, later replaced by
Homo sapiens. The pan-Africa hypothesis also cohered with archeological and anatomi-
cal observations about Neanderthals and Homo sapiens. For example, the teeth found in
Jebel Irhoud better matched what would be expected for Homo sapiens than what would
be expected for Neanderthals. The morphology of the skull was almost indistinguishable
from that of anatomically modern humans. And the pan-Africa hypothesis is consistent
with geographical and ecological evidence that the Sahara was green, filled with rivers,
and hospitable around 300,000 years ago. Animals like gazelles and lions inhabiting the
East African savanna then also populated the Saharan region and migrated to northwest
Africa. In fact, remains of plants and animals indicate biological and environmental con-
tinuity between those regions.
Finally, the pan-Africa hypothesis explained a greater number of diverse observations
about human origins than the East Africa hypothesis, including the mix of anatomical
features seen in the Jebel Irhoud remains and in other Homo sapiens–like fossils from
elsewhere in Africa. It also better fits with genomic evidence collected in South Africa
that seems to indicate that the lineage split between archaic hominid species and anatomi-
cally modern humans occurred more than 260,000 years ago. Explanatory considerations,
including simplicity, coherence, and fecundity, thus favored the pan-Africa hypothesis. The
researchers involved in the Jebel Irhoud discovery concluded that ‘the Garden of Eden
in Africa is probably Africa—and it’s a big, big garden’ (Callaway, 2017).
FIGURE 4.6 The pan-African dawn of Homo sapiens
Testimony
The testimony of others plays a central role in reasoning. Many of your beliefs actually
originate from what other people think is true, and this is the same in science. Belief
in others’ testimony is a key component involved in the system of trust and skepticism
we’ve said is crucial for science. Suppose that you are a resident of Flint, Michigan. You
attend a community meeting, where the governor, Rick Snyder, reports that the city water
is safe to drink. To demonstrate this, he himself drinks some tap water. On the basis of
this testimony, you infer that the water isn’t toxic. Later, you learn about the results of
the water quality testing by the EPA and Virginia Tech scientists. This new information
undermines your earlier inference on the basis of the governor’s testimony; you no lon-
ger believe Flint’s water is safe to drink. Later still, Virginia Tech scientist Marc Edwards
reports that Flint’s water is getting better and is far less risky to drink if one uses a high-
grade water filter. You are willing, again, to update your beliefs based on testimony. You
probably wouldn’t take the governor’s word for it at this stage, but given Edwards’s role
as an outside scientific investigator, you take his word for it.
Because science is so collaborative, several scientists—sometimes even thousands of
scientists, like at CERN or NASA—typically conduct research together. In these cases,
they rely on the specific expertise and the honesty of collaborators. This trust in the testi-
mony of collaborators—believing that collaborators are also operating, like yourself, under
norms of sincerity and accuracy—is essential for many scientific projects. Reliance on the
testimony of subjects is also essential in some disciplines, such as ethnography, where

researchers doing fieldwork must rely on the word of locals to understand certain social or
cultural practices. Trustworthy sources are thus essential to the validity of their research
findings.
Because people can lie, make up fantasies, or simply be wrong, reasoning from tes-
timony is risky. This is certainly not deductive inference. But it does seem plausibly
described as a form of abductive inference. When we believe a statement is true based on
someone’s testimony, we do so because the truth of the statement is the best explanation
for why the person would say it is so. This accounts for why you are inclined to believe
the Virginia Tech scientist’s testimony about Flint’s water quality but not the governor’s.
The best explanation for the governor’s claim of safety is that he wants to reassure the
public and, perhaps, protect himself from any culpability.
Thinking about inference from testimony as a kind of abductive inference might help
distinguish the circumstances in which testimony provides sufficient grounds for belief
from when it does not. Expertise about the topic of the claim means that someone is less
likely to be wrong about the claim. The motivations of the person providing testimony
can be taken into consideration to determine the likelihood of intentional deception.
More generally, an assessment of a source’s credibility and the credibility of that person’s
claim is essential in determining when testimony provides reason for belief—and when
it should instead be regarded with skepticism.
EXERCISES
4.20 Decide whether each of the following inferences is deductive, inductive, or abduc-
tive. Provide a justification for each of your answers.
1. Disorder in a system will increase, unless energy is expended. Your home is a
system. So, disorder will increase in your home unless energy is expended.
2. The president says that human activities are not a cause of global warming.
Therefore, human activities are not a cause of global warming.
3. There is no such thing as drought in Australia. The town of Darwin is in Australia.
Therefore, the town of Darwin needn’t ever make plans to deal with drought.
4. Bread appears to grow mold more quickly in the bread bin than the fridge.
Therefore, temperature determines the rate of mold growth.
5. Over two million people on Twitter say that aliens are coming to Earth, which is
more than the number of people on Twitter who are not saying it. So, aliens are
coming to Earth.
6. All mathematicians like math. Jun is a mathematician. Therefore, Jun loves math.
7. Gravity has always operated in the universe. So, gravity will continue to oper-
ate in the universe.
8. The weather forecast indicates that tomorrow will be sunny. So, tomorrow will
be sunny.
9. My brother has black hair, as does my father. Therefore, everyone related to me
has black hair.
10. The library has millions of books. I have a book in my hand, and I just left the
library. Therefore, the book was borrowed from the library.
4.21 Assess the quality of each of the arguments in 4.20, using the proper standard for
its form (deductive, inductive, or abductive). Explain your reasoning. For any bad
arguments, assess whether they would be better arguments of a different inferential
pattern (inductive instead of deductive, for example). If so, reclassify those argu-
ments to be of the pattern they are better at achieving.
4.22 Decide whether each of the following inferences is deductive, inductive, or abductive. If you
aren’t 100% sure of your answer, you should also provide a justification for your decision.
1. Whenever it rains, the streets get wet. The streets are wet now. Therefore, it must
have rained.
2. Of the students interviewed, 65% say that they prefer Italian to French wine.
Therefore, all students prefer Italian wine.
3. A medical technology ought to be funded if it has been used successfully to treat
patients. Adult stem cells have been used to treat patients successfully. There-
fore, adult stem cell research and technology ought to be funded.
4. The murder weapon has Pat’s fingerprints on it. Therefore, Pat is the murderer.
5. Sociologists agree that global inequality has decreased because of economic
liberalization in China and India. Therefore, it must be true that global inequal-
ity has decreased.
6. Studies found a strong correlation between IQ scores and language competence.
Therefore, if a person has a high IQ score, that person has high linguistic competence.
7. The witness testified that a paisley yellow car caused the accident. Given how unmis-
takable paisley is, it’s very likely that a paisley yellow car did cause the accident.
8. These beans have been randomly selected from this 25-pound bag, and they
are black. So, it is likely that all the beans from this bag are black.
9. The best explanation of the acquisition of language is that we possess an innate
universal grammar. So we must possess an innate universal grammar.
10. Leaded gasoline and lead pipes were both used for a while but eventually dis-
continued. So, all lead products are toxic.
4.23 Assess the quality of each of the arguments in 4.22, using the proper metric for its
form (deductive, inductive, or abductive). Explain your reasoning. For any bad argu-
ments, assess whether they would be better arguments of a different inferential pat-
tern (inductive instead of deductive, for example). If so, reclassify those arguments
to be of the pattern they are better at achieving.
4.24 Define deductive inference, inductive inference, and abductive inference in your
own words, and give an example of each.
4.25 Consider each of the patterns of inference, deductive, inductive, and abductive,
as an account of hypothesis-testing. For each account, describe what features of
hypothesis-testing it captures well and at least one drawback or limitation it faces.
4.26 The conclusion that Flint’s water supply is toxic is based on substantial evidence.
Nonetheless, the inference to that conclusion is non-monotonic. What kinds of new
information could you learn that would undermine the inference to that conclusion?
Give three examples.
4.27 We have said that good inductive arguments are strong, but we haven’t said much
about what it takes for an inductive argument to count as strong. Consider what
we’ve learned about inductive inference and examples of inductive inference we’ve
encountered, as well as the features of experiments and other studies from Chapter 2.
List at least three features an inductive argument could have that contribute positively
to its strength.
4.28 Consider this passage from Darwin’s Origin of Species (1872: 421):
It can hardly be supposed that a false theory would explain, in so satisfactory a
manner as does the theory of natural selection, the several large classes of facts
above specified. It has recently been objected that this is an unsafe method of
arguing; but it is a method used in judging of the common events of life, and
has often been used by the greatest natural philosophers. The undulatory theory
of light has thus been arrived at; and the belief in the revolution of the earth
on its own axis was until lately supported by hardly any direct evidence. It is
no valid objection that science as yet throws no light on the far higher problem
of the essence or origin of life. Who can explain what is the essence of the
attraction of gravity? No one now objects to following out the results consequent
on this unknown element of attraction; notwithstanding that Leibnitz formerly
accused Newton of introducing ‘occult qualities and miracles into philosophy’.
What ‘method of arguing’ do you think Darwin had in mind? What objections to this
method of arguing does he consider, and how does he dispute those objections?
4.29 Thinking about whatever examples of science you want, from elsewhere in this book
or other sources, come up with a clear instance of inductive inference. (This should
be a more realistic example than grapefruit choosing.) Put the inference in standard
argument form with numbered premises and a conclusion (as best you can), and
then assess its strength. If you were a scientist focused on this inference, what kinds
of steps could you carry out to additionally support the conclusion?
4.30 Thinking about whatever examples of science you want, from elsewhere in this book
or other sources, come up with a clear instance of abductive inference. Why should
this be viewed as an abductive inference? Assess the explanatory strength of the
inference. If you were a scientist focused on this inference, what kinds of steps could
you carry out to additionally support the conclusion?
4.31 Describe one instance in which you would, or in fact have, taken someone’s word
for something, that is, used testimony as grounds for belief. Then, try to characterize
this as an abductive inference. Is abductive inference a good way to think about this
use of testimony? Why or why not?
4.32 Describe an instance in which you would not, or in fact have not, taken someone’s
word for something, that is, used testimony as grounds for belief. What was differ-
ent between this situation and the situation you described in your answer to 4.31?
Does consideration of the features of good abductive inferences account for the dif-
ference? Why or why not?
FURTHER READING
For a philosophical treatment of reasoning in general, see Harman, G. (2008). Change in

view: Principles of reasoning. Cambridge: Cambridge University Press.
For more on conditional reasoning, see Nickerson, R. (2015). Conditional reasoning: The
unruly syntactics, semantics, thematics, and pragmatics of “if”. Oxford: Oxford Univer-
sity Press.
For an in-depth summary of the Flint water crisis, see www.cnn.com/2016/03/04/us/flint-
water-crisis-fast-facts/index.html
For Hume’s problem of induction, see Hume, D. (1748/1999). An enquiry concerning
human understanding, ed. T. L. Beauchamp. Oxford/New York: Oxford University
Press. Sections 4–6.
For a helpful guide to Hume’s problem, see Salmon, W. (1975). An encounter with David
Hume. In J. Feinberg (Ed.), Reason and responsibility (3rd ed., pp. 245–263). Encino:
Dickenson Publishing Co.
For a different version of the problem of induction, see Chapter 3, entitled ‘The new
riddle of induction’ Goodman, N. (1983). Fact, fiction and forecast. Cambridge: Har-
vard University Press.
For more on abductive reasoning, see Lipton, P. (2003). Inference to the best explanation.
New York: Routledge.
CHAPTER 5
Statistics and Probability
5.1 THE ROLES OF STATISTICS AND PROBABILITY
• Give three new examples of situations involving statistical reasoning and describe
how statistical reasoning is involved in each
• Characterize the difference between descriptive and inferential statistics
• Define probability theory and say how it relates to statistics
Statistical Thinking as a Pillar of Health, Wealth,

and Happiness
‘Statistical thinking will one day be as necessary for efficient citizenship as the ability
to read and write!’, the British writer H. G. Wells has been paraphrased as saying. Wells
was right. Whether you realize it or not, statistics rules your life. When you are making a
prediction about how long it takes to drive from your home to your weekend destination
and whether the weekend trip will be rained out, when you are arguing with a friend about
basketball teams or wondering how your grade compares to your classmates’ grades—in
all these situations, you are using statistical tools and relying on statistical information.
Statistical reasoning is an important part of making good decisions in everyday life.
Statistical reasoning is also a staple of scientific inquiry. Statistics is related to several
topics we’ve already discussed. Recall our description of one central recipe for scientific
reasoning from Chapter 1: using a hypothesis to generate expected observations, testing
the expectations against actual observations, and using actual observations (or data) to
help decide whether the hypothesis is a good one. We’ve seen that determining exactly
what expectations follow from a hypothesis can be tricky. This is especially the case
whenever there is variation in how events unfold. Variation means that we get different
results when we repeat measurements. As the study of variation is central to statistical
reasoning, statistics provides tools to help determine what a hypothesis should lead us to
expect. It thus contributes to hypothesis-testing in science. Because statistical reasoning
can be construed as a kind of inductive reasoning, it also helps us extend from what we
think we know about the world to make predictions that we’re less certain of—as when
we predict the weather or driving time based on traffic conditions.
All of this will become clearer as we dive into statistical reasoning in this and the next
chapter. For now, simply notice that, if all of this is true, then it’s no exaggeration to say
168 Statistics and Probability
that our health, wealth, and happiness all hinge upon understanding and communicating
statistics.
Up until the 18th century, the word statistics (from Latin for ‘state’) meant any data
relevant to running a nation or country. These data included demographic and economic
information relevant to the condition of the country—for example, about birth and death
rates, individual and national wealth, and level of employment. Today, statistical reasoning
is applied to virtually any kind of data—from data concerning the performance of bas-
ketball players to data about casinos, medical diagnoses, and issues of global importance
like anthropogenic climate change.
Consider these three scenarios:
1. Arguing about basketball: Your friend says that LeBron James is a better basketball
player than Michael Jordan. You disagree. You remind her that Jordan won the NBA
Championship for the last time in 1997–1998, playing 82 games. Over that season,
he scored a total of 2,357 points, rebounded 475 missed shots, and made 283 assists.
His free throw percentage was .784. No way LeBron is better than MJ! Your friend
responds that the first time LeBron won the NBA Championship in 2011–2012, he
played 79 games, scoring a total of 2,111 points, rebounding 590 missed shots, and
making 554 assists. LeBron’s free throw percentage was .759. And he’s a monster for
stealing balls and blocking shots.
As this example shows, people often appeal to statistical evidence in sports, perhaps to
support the claim that some sports player is best or to argue that some team is likely to
win the next game.
2. Playing roulette: Imagine that you’re at a casino in Monte Carlo, eager to play roulette.
The wheel includes 37 colored and numbered pockets, of which 18 are black, 18 are
red, and one is green. If you bet €10 on red, and the winning color is red, then you will
win €20—and likewise if you bet €10 on black and black wins. Now, imagine that the
winning color has been black for 26 times in a row. You might bet on red, reasoning
that red should come up very soon since there have been so many black wins.
You are making a prediction based on past occurrences, and your prediction is based on
statistical reasoning. (This is also flawed statistical reasoning, as we’ll see.)
3. A medical test: You have a sore throat, so you go to the doctor. The doctor examines your
throat and calls for a ‘rapid strep test’. While you wait for her to return with the results,
you ponder how you should react. What if she tells you the test was negative? Does this
mean you don’t have strep throat? Not necessarily. It means there’s an approximately 95%
chance that you don’t have strep throat. If you have all the symptoms of this illness though,
your doctor may want to follow up with another test—a ‘strep culture’—to verify the neg-
ative result. There might be strep bacteria lurking there, undetected. What if your doctor
tells you the test was positive? Then you can be pretty certain you do have strep bacteria
in your throat. However, about 20% of people are carriers for strep. This means that even
if strep bacteria are present, there’s a chance this isn’t the cause of your sore throat.
Statistics and Probability 169
The rapid strep test—like most medical tests—gives you statistical data. You and your
doctor then need to decide how to interpret the data, whether they are evidence for a
particular conclusion, and what steps to take next.
Each of these three cases involves the collection, presentation, analysis, and interpre-
tation of statistical information. Reasoning with statistical information is everywhere!
Learning to reason better with statistics can thus help you make good decisions about
questions concerning your health, wealth, and happiness—and basketball too.
Populations and Samples

We have said that statistical reasoning is used in science to determine expectations from
hypotheses and is a form of inductive reasoning. These and other uses of statistics stem
from one core feature: statistics is exceedingly useful in managing and understanding
variation. There is variation across most types of things: butterfly coloration, the severity
of oil spills, how often people smoke cigarettes, and the grades of students, to name a
few. How can one know what to expect, given all this variation? How much and in what
ways will things vary, and in what ways will they stay the same? What are the meaning-
ful patterns and regularities, and what is meaningless variation? These are all tasks that
statistics helps scientists address.
When scientists interpret their experimental results, they regularly need to discern
between what we might call background variation and the variation between experimen-
tal and control groups due to an intervention. Did the students who attended the study
session really perform better on the test, or was that just background variation due to
chance differences between the students?
Scientists also regularly need to generalize from the groups they’re familiar with to
another group, a population. The term population is often used by statisticians, scien-
tists, and others to refer to the target systems of experiments and studies. In statistics, a
population is a large collection of things that share some characteristic. For example, the
population of Indian people share a common geographic origin, literature, genetic heritage,
and linguistic history. Some populations consist of people, but others consist of bacteria,
stars, or more abstract objects like companies, households, homicides, and free-throws.
As previously discussed, data are public recordings of observations, such as measure-
ments, that are elicited from the real world and used to evaluate hypotheses about a
target system. In most cases, it is impossible to collect data about each individual in a
population of any of these kinds—think about surveying all the people in India on some
question or how difficult it would be to collect data about all stars of the Milky Way
galaxy. For this reason, scientists regularly obtain data about a subset of the population
they are interested in. This subset is a sample of the population, and the data concerning
individuals in this subset are sample data.
Descriptive and Inferential Statistics

There are two main kinds of statistical reasoning. The first kind, which will be a main
focus of this chapter, involves the use of statistics to describe features of data sets. This
is descriptive statistics: summarizing, describing, and displaying data in a meaningful
way. For example, finding a class’s average score on an exam or quiz is a common use of
descriptive statistics. Finding patterns in sets of data—averages, extent of variation, and

trends over time, for example—and visually representing them are all forms of descrip-
tive statistics.
The second kind of statistical reasoning, which will be a main focus of the next chapter,
involves the use of statistics to make inferences based on data. This is inferential statistics:
using statistical reasoning to draw broader conclusions from data, such as the generaliza-
tions and projections discussed in Chapter 4. For example, from data about a sample
of American citizens, inferential statistics can be used to estimate what proportion of
Americans who smoke marijuana every day will develop one or more mental health prob-
lems. Likewise, test scores from a subset of students can be used to infer what distribution
of scores we should expect in the class as a whole. Inferential statistics uses patterns in
existing data sets to inform our expectations when we do not have data about a population
or a system we want to learn about. With the help of inferential statistics, existing data
sets can be used to make predictions about larger groups, different groups, and the future.
The idea of inferring conclusions that go beyond what is already known calls to mind
inductive reasoning. Recall from Chapter 4 that an inference is inductive when the nature
of the relationship between the premises and conclusion is one of probability rather than
necessity. Inferential statistics can be understood as a specific type of inductive inference,
which is especially useful in the face of variation. An inductive inference counts as sta-
tistical when it uses the tools and follows the rules of statistics.
There are two main uses of inferential statistics: (1) either inferring properties of a
population based on sample data or inferring properties of a sample based on informa-
tion about a population and (2) testing hypotheses about a population by performing an
experiment or observational study on a sample. In this latter use of statistics, the sample
often includes experimental and control groups. The role of these groups in experimenta-
tion was discussed in Chapter 2.
Statistics and Probability Theory

Statistics is an important use of mathematics in science, and the mathematical power
of statistics comes from probability theory. Probability theory is a mathematical theory
that has been developed to deal with randomness—that is, with outcomes that are indi-
vidually unpredictable but that behave in predictable ways over many occurrences. We
have already said that statistics excels at dealing with variation. The trick behind this is
thinking about variation as a kind of randomness. In the context of statistical reasoning,
randomness does not mean haphazard or lacking aim or purpose. Instead, randomness is
a measure of uncertainty of an outcome and applies to concepts of chance, probability,
and information.
The simplest examples of randomness are things like coin tosses and dice throws. In
a normal roll of a standard die, you can’t possibly know whether you’ll roll a one, two,
three, four, five, or six. But you do know that if you roll that die 500 times, or roll 500
dice, you probably won’t roll a six every time. The word probably is important there.
Probability theory actually enables us to calculate what that probability is; it can tell us
exactly how unlikely it is to roll a six 500 times in a row.
These kinds of probability calculations are put to work in statistical reasoning. For
example, suppose you use probability theory to work out the chance of all possible
different outcomes for 500 dice rolls. In working out these probabilities, you assume the
die is fair; that is, you assume each possible outcome—one, two, three, four, five, six—is
equally likely on each roll. If you then roll a die 500 times, and six comes up 200 times,
you can use those probabilities to infer that this is a somewhat improbable, or unlikely,
outcome. You can use statistics to decide, based on the level of improbability, whether
something’s fishy—whether, perhaps, your die isn’t fair after all.
So, statistical reasoning relies on mathematics and in particular on probability theory.
But it doesn’t just boil down to running calculations. It is much more important to
understand the meaning of the numbers, probabilities, and equations behind the statistics.
Acquiring this understanding will help make you a stats-savvy person, someone who can
critically examine claims based on statistical reasoning in science and in everyday life and
who can better handle the barrage of statistical information that fills our lives. This will
be our focus in this chapter and in Chapter 6 as well. In this chapter, we’ll work through
some basic concepts of probability theory, and then discuss descriptive statistics. Then, in
Chapter 6, we will turn our focus to inferential statistics.
EXERCISES
5.1 First, describe the difference between a sample and a population. Second, state
whether the following statements refer to a sample or to a population:
a. Researchers found that 2% of the Americans they interviewed believed they had
seen a UFO.
b. Based on their survey data, the researchers concluded that one in three of all car
crashes in the country are linked to alcohol impairment.
c. Two-thirds of the butterflies we observed were pink.
d. After reading four essays, the teacher expects that 85% of the class will pass the
exam.
e. Twenty-five percent of the planets in the Solar System have no moons.
f. More than one billion people in the world live on less than one dollar a day.
5.2 What is the difference between descriptive statistics and inferential statistics? Indicate
whether each of the following statements is based on descriptive or inferential statis-
tics, and explain why.
a. As of 2017, the director Quentin Tarantino has received a total of two Academy
Awards.
b. Students with an undergraduate GPA of 3.00 are expected to have a starting
salary of $30,000.
c. In 2016, the population of São Paulo, Brazil, was 12,038,175.
d. The mean grade in the class was B+.
e. A study stated that British adults are nearly 12 kilograms (26 pounds) heavier
now than they were in 1960.
f. Economists say that mortgage rates may soon drop.
g. The gross national income per capita in South Sudan in 2013 was $2.
h. According to World Health Organization data published in 2015, life expec-
tancy in Bangladesh is 71.8 years.
5.3 Describe what probability theory is in your own words. Then, looking back at the defi-
nitions of descriptive statistics and inferential statistics, describe how you think statistics
relies upon probability theory. Name three everyday situations where probability
theory is used.
5.4 Find a news article or opinion column published in the past month that uses statistical rea-
soning of some kind. After citing the source, write a paragraph describing the following:
a. The main point of the article or column
b. What statistics are provided
c. How the author makes use of statistics in his or her reasoning
d. How good this use of statistical reasoning seems to be and why (or why not)
5.5 Statistical reasoning pervades our lives, often in ways we don’t realize. After reflect-
ing on your daily routine, write out a list of 10 ways in which variation, statistical
reasoning, and probability are part of that routine, either explicitly or implicitly.
5.2 BASIC PROBABILITY THEORY
• Define these seven terms: random variable, outcome space, mutually exclusive, collec-
tively exhaustive, total probability, statistical independence, and conditional probability
• Calculate the probability of multiple outcomes occurring (together or individually)
based on the probabilities of individual outcomes
• Calculate conditional probabilities
Random Variables
The number rolled on a die and whether a coin lands on heads or tails are both random
variables. Random variables have different values that are individually unpredictable
but predictable in the aggregate. You can’t predict whether a coin will land on heads or
tails, but you can predict that lots of coin tosses will give you roughly equal numbers
of heads and tails. The set of all values a random variable can have is called its outcome
space, or sample space.
Let’s work through these ideas using the simple coin-toss example. The random vari-
able involved in a coin toss is the figure shown on the top of the coin. We can refer to
this variable with a capital letter, say, X. The set of possible values of X—its outcome
space—is heads and tails: these are all the values our random variable can possibly take.
To distinguish the variable from its possible values, we will refer to the values of a ran-
dom variable with small letters, in this case, say, h and t. We can now define the outcome
space of a coin toss as follows:
X = {h; t}
(The symbols ‘{’ and ‘}’ are curly braces, which is the conventional notation used to indi-
cate a set, that is, any abstract grouping of items.)
Random variables are the building blocks of probability theory and, in turn, of statisti-
cal reasoning. Probabilistic reasoning begins with the observation of how probable it is
for a random variable to take on any given value. For our coin-toss example, there’s a
100% chance that the coin lands on either heads or tails (since this is the whole outcome
space). Probabilities vary between 0 (maximally improbable) and 1 (maximally probable),
so we write this as follows:
Pr(X=h or t) = 1
No matter how many values a random variable can have, that whole set of values—its
whole outcome space—has a probability of 1. This means it’s guaranteed that the variable
will take on one of those values. The total probability of an outcome space is always 1.
The outcomes in any outcome space have two important properties: they are mutually
exclusive and collectively exhaustive. Mutually exclusive outcomes occur when no more
than one of the outcomes can occur at any given time. On a single coin toss, you might
get heads or tails, but you will never get both. Heads and tails are mutually exclusive
outcomes. Collectively exhaustive outcomes occur when at least one of the outcomes
must occur at any given time. For a successful coin toss, the coin must land heads up
or tails up—there is no third option. This means that heads and tails are collectively
exhaustive outcomes.
Now, if the coin is fair, then the probability of the coin landing on heads will equal
the probability of it landing on tails. That is, for a fair coin, Pr(X = h) = Pr(X = t). Since
we already know the probability of the whole outcome space together is 1, and there are
two equally probable outcomes in that outcome space, we can calculate that:
Pr(X=h) = Pr(X=t) = 1⁄2, or .5, or 50%
Because there are two equally probable outcomes, each outcome has a probability of ½
(.5 or 50%). That’s just the total probability for the outcome space (which is always 1)
divided by the number of possible outcomes (which is two, in this case). To generalize,
for any random variable with equally probable outcomes, the probability of one of those
outcomes is one divided by the number of possible outcomes. So, for a fair, six-sided die,
the probability of rolling any one number is one divided by six, or 1⁄6.
A random variable that is not fair is biased in favor of one or more outcomes. This
means one or more outcomes are more likely—have a higher probability of occurrence—
than other outcomes. French roulette is fair, but American roulette is not—at least not
in the statistical sense of fairness. This is because a French roulette wheel has 37 pockets,
numbered zero through 36, whereas an American roulette wheel has 38, two of which
are zeroes. In the latter case, the roulette is biased toward zero, because zero will occur
more often than any other number on the wheel if we spin the roulette over and over
again. More precisely, in American roulette, the probability of getting any number from
one to 36 is 1⁄38, while the probability of getting zero is 2⁄38, or 1⁄19.
There is another way in which a roulette, or any series of outcomes, might be unfair.
A series of outcomes might have ‘memory’, in the sense that previous outcomes might
influence future outcomes. In one of the scenarios described at the beginning of the
chapter, we imagined a person who thought roulette wheels work in this way. This person
reasoned, ‘Red should come up very soon since there’s been so many black wins. So, I’ll
bet on red!’ If the roulette were unfair because it had memory, this might be good rea-
soning: the roulette might change to red because there had been lots of black wins. But
fair roulette spins have independent outcomes: the probability of each outcome is not
influenced by past outcomes. So, in order to be fair, roulette spins and any other random
variables must be independent of one another.
To summarize, a random variable must be unbiased and its outcomes must be inde-
pendent for the random variable to be fair. Coin tosses, dice throws, and French roulettes
are all examples of fair random variables.
Lots of random variables are unfair. For example, LeBron James’s free throw success
is a random variable. Let’s call this variable Y. There are only two possible outcomes:
LeBron either misses the free throw or scores. So, this random variable has an outcome
space of Y = {miss; score}. So far, this is simple. The problem is that the chance of LeBron
scoring versus missing is probably not 50⁄50. There is a bias in favor of the outcome of
scoring; for LeBron James, this is more likely than missing. The outcomes might also fail
to be independent: missing a shot might make LeBron more, or less, likely to score on
the next free throw.
It’s much more difficult to calculate probabilities for unfair variables like free throw
success. So, for now, we’ll stick with fair random variables, like coin tosses and dice throws.
The Addition Rule

The larger the probability associated with an event, the more likely it is to occur. One
assigns probability 1 to events that are guaranteed to happen or to statements that we are
entirely certain are true. And we assign probability 0 to events that are guaranteed not
to happen or to claims that are certainly false. For instance, the probability that you roll
a seven on a single, regular die is 0. This cannot happen, since no side of the die shows
seven dots or marks. Assuming you successfully roll the die, the probability that you roll
some number between one and six is 1. Between 0 and 1, a higher number—a larger
probability—means an outcome is more likely.
For a fair die, rolling any number between one and six is equally probable. In our offi-
cial notation, we could write this as Pr(D = 1) = Pr(D = 2) = . . . = 1⁄6. But we might also
wonder about the probability of other possible outcomes. For example, on a single die
roll, how probable is rolling an even number? How about an odd number? How about
any number greater than one?
These probabilities can be found using simple addition. Consider the example of roll-
ing an even number. This can be expressed as: Pr(D = 2 or D = 4 or D = 6). We already
know that each of those three outcomes has a probability of 1⁄6. The probability that
any of those outcomes occurs on a given roll is just the probability of each outcome, all
added up together as follows:
Pr(D=2 or D=4 or D=6) = Pr(D=2) + Pr(D=4) + Pr(D=6)

= 1⁄6 + 1⁄6 + 1⁄6 = 3⁄6 = 1⁄2
Beware! Adding probabilities in this way only works for mutually exclusive outcomes.
If we wanted to ask about the probability of rolling an even number or a five, we could
just add in another 1⁄6, yielding 4⁄6 or 2⁄3 as the probability. But this doesn’t work if someone
asked us about the probability of rolling an even number or a six. Because six is one of
the even numbers on the die, the outcomes of rolling a six and rolling an even number
are not mutually exclusive. You can’t simply add up the different probabilities to find
the answer. In the case of rolling an even number or a six, the probability is the same as
it was for rolling an even number (since rolling a six is one way to roll an even number).
The probability is still ½.
This way of calculating probabilities is called the addition rule. This rule says that
the probability that any of a series of outcomes will occur is the sum of their individual
probabilities. It’s very important to ensure that the requirement of mutually exclusive
outcomes is met. If not, addition will lead you astray.
The Multiplication Rule

A different rule of probability uses multiplication to calculate the probability of all of
a series of outcomes occurring. For example, what is the probability of rolling two sixes
when you roll two dice? Put another way, this question is asking for the probability of
rolling a six on one die and also rolling a six on a second die. The probability we are
looking for is thus Pr(D1=6 and D2=6), where D1 and D2 are the two dice. Of course,
there’s a 1⁄6 probability of a six for any given die roll. So:
Pr(D1=6 and D2=6) = Pr(D1=6) × Pr(D2=6) = 1⁄6 × 1⁄6 = 1⁄36
The probability of 1⁄36 is a lot closer to zero than to 1⁄6. That’s why rolling two sixes or
two ones—‘snake eyes’—is exciting. It seldom happens!
Beware though! There’s an important condition for multiplying probabilities as well.
They must satisfy the independence condition. This means that the probability of each
outcome must be independent from one another. Each outcome must not influence the
probability of the other outcomes. Think of it this way. If, instead of calculating the prob-
ability of rolling two sixes on two dice, we wanted to calculate the probability of rolling a six
on one die roll but also a one on the very same die roll—Pr(D1 = 6 and D1 = 1)—we can’t just
multiply 1⁄6 × 1⁄6. These outcomes aren’t independent. In fact, they are mutually exclusive:
if one occurs, the other is guaranteed not to occur. This means the probability in question is
maximally improbable: it’s zero. So, we can only use multiplication to find the probabilities
of a series of outcomes all occurring if the outcomes in question are independent.
According to the multiplication rule, the probability that all of a series of outcomes
occurs is the result of multiplying their individual probabilities. Again, if the requirement
of independent outcomes is not met, then multiplication will lead you astray. When two
events are not independent, the probability that both happen depends on the nature of the
connection between the events. Simple multiplication won’t work.
Let’s take a moment to compare the multiplication rule with the addition rule. We
saw that the addition rule is used to calculate the probability of any of a series of mutu-
ally exclusive outcomes occurring. You could ask about the probability of getting a six
or a one on a given roll. (They have to be on the same roll to be mutually exclusive
outcomes.) To calculate this, we would add 1⁄6 and 1⁄6 to get 2⁄6, or 1⁄3. The multiplication
rule is instead used to calculate the probability of all of a series of independent outcomes
TABLE 5.1 Addition, multiplication, and subtraction rules and their conditions
Rule Language Function Condition Result
Addition rule Any Disjunction (or) Mutually Probability always

exclusive increases
Multiplication rule All Conjunction (and) Independent Probability always

decreases
Subtraction rule Not Negation (not) Collectively Probability can be

exhaustive large or small
occurring. You could ask about the probability of getting a six on one roll and a one on
a different roll. (They have to be different rolls or different dice to be independent out-
comes.) To calculate this, we would multiply 1⁄6 and 1⁄6 to get 1⁄36.
In these two examples, notice that the addition rule led to a larger probability (closer
to 1) and the multiplication rule led to a smaller probability (closer to 0). This will always
happen. Addition will always increase probability, and multiplication will always decrease
probability. This is because probabilities are always positive numbers between 0 and 1,
and multiplying two numbers in that range (such as two fractions) always yields a smaller
number while adding two positive numbers of any kind always yields a larger number.
This can provide a quick way to remember when to add and when to multiply. Do
you expect the probability to get larger or smaller for the occurrence you’re calculating,
compared to the outcomes that generate it? It’s easier (more probable) to get any of a
one, two, or three on a die roll than each one of these numbers individually: use addi-
tion. Any, or, addition, and larger probabilities go together. And the outcomes linked with
the word or need to be mutually exclusive. It’s harder (less probable) to get a six on all
the first roll, second roll, and third roll than on a single roll: use multiplication. All, and,
multiplication, and smaller probabilities go together. And the outcomes linked with the
word and need to be independent. This is all summarized in Table 5.1.
The Subtraction Rule

Here’s one more mathematical relationship among probabilities. Recall that the total
outcome space—all of the available possibilities—always has a probability of 1. The sub-

traction rule makes use of this fact: you can calculate the probability of some outcome
by subtracting the probability of all other outcomes in the outcome space from 1 (the
total probability).
For example, what is the probability of getting anything but a two on a single die roll?
The total probability is 1, and the probability of rolling a two is 1⁄6 (as it is for any other
number from one to six). So, the probability of getting anything but two, or Pr(D = not 2), is:
Pr(D=not 2) = 1 − Pr(D=2) = 1 − 1⁄6 = 5⁄6
The subtraction rule is only for collectively exhaustive outcomes. Just as with the require-
ments placed on the addition and multiplication rules, the requirement of collectively
exhaustive outcomes is crucial. This is what makes probabilities sum to 1. This require-
ment is most easily satisfied with the use of the word not—rolling a two and not rolling
a two, rolling an even number and not rolling an even number, rolling two sixes in a row
but not rolling two sixes in a row. Each of these pairs is collectively exhaustive; any pos-
sible outcome would fall in one or the other category.
So, the main word to prompt you to use the subtraction rule is not, which is
one way of guaranteeing collectively exhaustive outcomes. This is summarized in
Table 5.1.
Conditional Probability
There’s one final probability concept we need to discuss: conditional probabilities.
Sometimes it can be useful to know how the probability of some event changes in light
of other events occurring. The conditional probability of an event is the probability of
its occurrence given that some other event has occurred. In the notation we’ve been
developing, we can write the conditional probability of a random variable Y taking the
value y, given that a variable X takes the value x as Pr(Y=y | X=x). The symbol ‘|’ can be
read as given that.
Notice that, for two independent events, the conditional probability of one event given
the other’s occurrence will be the same as the original probability of the event. Indeed,
the concept of conditional probability enables us to more exactly articulate what inde-
pendence amounts to. Two random variables X and Y are statistically independent when
Pr(Y=y | X=x) = Pr(Y=y) and Pr(X=x | Y=y) = Pr(X=x). This means that the outcome x
occurring doesn’t make the outcome y any more or less likely, and the outcome y occur-
ring doesn’t make the outcome x any more or less likely.
If an event y is not statistically independent from an event x, then the probability of
y occurring goes up or down if x occurs. In extreme cases, one event can result in the
probability of another event becoming 1 or 0. For example, the probability of a die roll
resulting in an even number is ½. But the probability of an even number given that you
roll a two is 1, since rolling a two is one way of rolling an even number. The probability
of an even number given that you roll a three is 0, since three is odd. (In both cases,
we’re assuming there’s only one roll of the die.) That is:
Pr(D1=2 or 4 or 6 | D1=2) = 1
Pr(D1=2 or 4 or 6 | D1=3) = 0
In other cases, the statistical dependence is subtler. The probability of an event might
be raised or lowered by the occurrence of another event, but not all the way to 0 or 1.
Consider again the probability of getting two sixes when two dice are rolled, which we
previously calculated to be 1⁄36. We can also ask what the probability of getting two sixes
on two rolls is, given that the first roll yielded a six. The chance of getting two sixes has
gone up if one roll is a six, but it’s still not guaranteed.
Figuring out the conditional probability in cases like this one requires calculation. For
x and y, the values of two random variables, the probability of y occurring given that the
other event x occurs can be calculated using the following conditional probability formula:
Pr(Y=y|X=x) = Pr(Y=y&X=x) / Pr(X=x)
This calculation only works when the probability of x is greater than 0. Think of this for-
mula as a two-step procedure for finding the probability of y given x. First, you limit your
attention only to cases when x occurs. This is the role of Pr(X = x) as the denominator
(the bottom) of the equation. Second, you look within those cases of x occurrences for
occurrences of y. This is the role of Pr(Y = y & X = x) as the numerator (the top) of the
equation. The basic idea is that if the outcomes are restricted to only those cases when
x occurs, this becomes the new outcome space for the variable Y.
Let’s try this out to find the probability of getting two sixes in two dice rolls, given
that the first roll is a six. (To make the scenario more intuitive, perhaps imagine that you
decide to roll the dice one at a time and you’ve rolled the first but not yet the second.)
Plugging this example into the formula gives us:
Pr(D1=6 & D2=6|D1=6) = Pr((D1=6 & D2=6) & D1=6) / Pr(D1=6)
Before moving on, take a moment to figure out why this equation is the right version of
the formula for calculating conditional probabilities.
We can solve this equation by plugging in the probabilities we already know and doing
some simple math. Notice that Pr(D1 = 6 & D2 = 6) and Pr((D1 = 6 & D2 = 6) & D1 = 6)
will be the same probability; in the second, D1 = 6 is just listed twice. The reason why it
shows up twice is because the first roll had to be six (D1 = 6) in order for it to be pos-
sible for both rolls to be sixes. So, plugging in the probabilities:
Pr(D1=6 & D2=6|D1=6) = (1⁄36) / (1⁄6) = 6⁄36 = 1⁄6
One nice thing about starting with this simple example is that we can check the answer.
What is the probability of rolling two sixes given that you’ve already rolled one six? This
is the same as the probability of getting a six on one roll, since that’s exactly what needs
to happen if you are to get two sixes, given that you already have one six. And we know
the probability of getting a six—or any other number, one through six—in a single die
roll is 1⁄6. So, our calculation of the conditional probability gave us the right answer.
Let’s try our hand at finding a slightly more difficult conditional probability for dice
throws. What’s the probability that you roll a number that is less than four on a die throw,
given that you roll an odd number on that throw? This is the same as asking about the prob-
ability of rolling a one, two, or three (the outcomes less than four) given that you roll a one,
three, or five (the odd outcomes). Applying our conditional probability formula, this yields:
Pr(D=1 or 2 or 3|D=1 or 3 or 5) = Pr((D=1 or 2 or 3) & D=1 or

3 or 5) / Pr(D=1 or 3 or 5)
Notice that the probability of rolling a one, two, or three and rolling a one, three,
or five is the same as the probability of rolling a one or three. Why? Because those
Pr(D<4|D=odd)
Pr(D<4) Pr(D=odd)
FIGURE 5.1 Visualization of the conditional probability of rolling a number less than four given
that you roll an odd number
are the only two ways of rolling both an odd number and a number less than four;
see Figure 5.1. Using the addition rule, the probability of rolling a one or three is
1⁄6. We can also use the addition rule to get the probability of rolling a one, three,
or five; this is 3⁄6. Plugging these into the formula above yields (1⁄6)/(3⁄6), which is
equivalent to 2⁄3. The probability of rolling a number less than four given that you’ve
rolled an odd number is 2⁄3; consulting Figure 5.1 should help you convince yourself
that this is the right answer.
Conditional probabilities are a core part of statistical reasoning. Recall the medical test
example at the beginning of this chapter; this involved conditional probabilities. We talked
about the chance that you have strep throat given that the rapid stress test was positive
and also the chance given that the test was negative. We stressed that a positive result
doesn’t guarantee strep throat, and a negative result doesn’t guarantee its absence. So:
Pr(strep = yes | test = positive) < 1

Pr(strep = yes | test = negative) > 0
In particular, data suggest that:
Pr(strep = yes | test = positive) = .95

Pr(strep = yes | test = negative) = .2
We don’t know these probabilities exactly, so we must estimate them from data. This
is because the variables of test (rapid strep test result) and strep (having strep or not)
are like LeBron James’s free throws: the outcomes aren’t equally likely, so the variables
aren’t fair. For this reason, we can’t know the probabilities of the outcomes for rapid
strep tests, having strep throat or not, and free throw success without crunching the data.
But conditional probabilities still give us a way of thinking about your chance of having
strep throat given your test result.
Conditional probabilities also give us a way of formulating our earlier question about
whether LeBron’s free-throw outcomes are independent. We can ask whether this is so
by determining whether the following equality holds:
Pr(shot2 = score | shot1 = score) = Pr(shot2 = score)
This equation states that the probability of LeBron scoring on a second free throw is the
same whether or not he scored on his first free throw. This may not be true; the success
of earlier free throws may well influence the success of later free throws.
Box 5.1 The Gambler’s Fallacy

Recall the case of playing a fair roulette described at the beginning of this chapter.
The reasoning went like this: ‘Since this is a fair roulette wheel, black and red
come up equally often. There has been a streak of black. Therefore, red will come
up very soon’. This reasoning is called the gambler’s fallacy because it is typical
among aficionados of casinos. Unfortunately for gamblers, this reasoning involves
the mistaken idea that the outcomes of games of chance are not statistically
independent. Put another way, those seduced by the gambler’s fallacy believe
that the conditional probability of some later outcome (in this case, landing on
red) increases given that there recently has been a statistically unusual streak of
a different outcome (in this case, landing on black). In fact, each spin of the fair
roulette is statistically independent, like dice throws and coin tosses; there is no
way for the outcome of one spin to influence the outcome of the spin thereafter.
So, the gambler is wrong that red is more likely than black given the recent black
streak; red and black are still equally likely on the next spin.
The gambler’s fallacy applies to much more than games of chance. A family
with three girls still has a roughly 50% chance of a fourth child being a boy, all
else being equal. You are just as likely to get heads as tails when tossing a fair
coin, even if your past 99 coin tosses have all landed heads (although this might
be evidence that your coin isn’t actually fair).
We are now set up with some basic concepts and tools for thinking about probabilities.
This is pretty much all you need to know about probability theory in order to understand
the basics of how statistical reasoning works.
EXERCISES
5.6 Define the following seven terms in your own words: random variable, outcome
space, mutually exclusive, collectively exhaustive, total probability, statistical inde-
pendence, conditional probability.
5.7 The outcome space for a standard die, D = {1, 2, 3, 4, 5, 6}, is both collectively
exhaustive and mutually exclusive. For each of the following sets of outcomes, indi-
cate whether the set is mutually exclusive, collectively exhaustive, both, or neither
(for a single die roll).
a. The outcomes 1 and 6
b. The outcomes even and not 6
c. The outcomes odd and even

d. The outcomes not odd and 4
5.8 Define statistical bias in your own words. Then, decide whether each of the follow-
ing situations involves statistical bias, and explain your answers.
a. Throwing a loaded die more likely to roll six than the other numbers
b. Tossing an old, rusted copper penny
c. Publishing research carried out only by Australian scientists
d. A lottery drawing
e. LeBron James’s scoring his next free throw
f. A poll about the outcome of the next elections in the Philippines
g. Surveying how students feel about a new dress code that only business students
must adhere to
5.9 For a fair die with just five sides, state the following in proper notation. (You may
need to use the addition, multiplication, and subtraction rules and the conditional
probability equation; show any calculations where possible.)
a. The outcome space
b. The total probability of the outcome space
c. The probability of getting a four on a single roll
d. The probability of getting anything but a four on a single roll
e. The probability of getting a four on at least one roll if you have three rolls
f. The probability of getting a four on all three rolls (when you have only three rolls)
g. The probability of getting a four given that you rolled an even number on that roll
5.10 Consider an ordinary deck of 52 playing cards, which has the usual sequence of 13
cards in each of four suits (hearts, diamonds, spades, and clubs). You draw a card
at random from the deck. Answer the following questions. For each, explain your
reasoning or show your calculation.
a. What is the probability that the card is the queen of diamonds?
b. What is the probability that the card is any queen?
c. What is the probability that the card is either a queen or a red ace?
d. What is the probability that the card is not a queen or a red ace?
e. Given that you have drawn a queen, what is the probability that the card is a
diamond?
f. If you draw four cards, what is the probability that all will be black?
5.11 You have three bags, each containing two marbles. Bag #1 contains two white
marbles, bag #2 contains two black marbles, and bag #3 contains one white mar-
ble and one black marble. You pick one of the bags at random and take out one
marble. It is white. Given this, what is the probability that the remaining marble from
the same bag is also white? Explain your reasoning.
5.12 Consult the definition of statistical independence. For each of the following pairs of
events, indicate whether you expect them to be statistically independent, and say why.
a. One roulette spin and the roulette spin that comes next
b. Rolling a six and, on that same roll, getting an even number
c. What you ate for breakfast today and what Kate Middleton ate for breakfast
today
d. What you ate for dinner last night and what you will eat for dinner tonight
e. A coin toss and another coin toss

f. Smoking and lung cancer
g. Getting an A on your math exam and getting an A on your history exam
h. You having your birthday in September and your friend having her birthday in
December
i. Purchasing a Ferrari and purchasing a Rolex watch
j. Owning a dog and growing a beard
5.13 Suppose the probabilities of having a boy or a girl child are the same—both are
50%. Answer the following questions about a couple that has two children, showing
any calculations.
a. What is the probability that the couple has two girls?
b. The eldest of the two children is a girl. Given this, what is the probability that
the couple has two girls?
c. At least one of the children is a girl. What is the probability that the couple has
two girls?
Your answers to a, b, and c should all be different numbers. Explain why each is
different from the others. (Hint: (b) specifies that the older child is a girl, while (c)
simply states that one of the two children is a girl without indicating whether it’s the
older or younger child.)
5.3 DESCRIPTIVE STATISTICS
• Define and calculate for a data set three measures of central tendency: mode, median,
and mean
• Define and calculate for a data set three measures of variation: range, variance, and
standard deviation
• Describe and draw a visualization of a data set based on raw data or from information
about its shape, central tendency, and variation
• Evaluate the direction and strength of correlation using a scatterplot, regression
analysis, or correlation coefficient
Variables and Their Values

Imagine a world in which people are equivalent in all respects: they wear the same clothes,
listen to the same music, eat the same food; they all have the same height and weight
from birth until death; they all go to the same school and have the same grades; their
genes do not differ; and they all have the exact same personality. In this world, nothing
varies. Everything you may want to know about people in general can be easily found
out from learning about a single person, and anything you might want to know about an
individual you can know just by considering the way people are in general.
Fortunately, our world is not like that. The gender of people varies, as do their heights,
weights, personalities, and interests. Variability is an essential feature not only of people
but of everything else that surrounds us too. Rocks and spoons come in different sizes and
shapes, weather changes, and biological populations vary in countless ways. Such varia-
tion is, ultimately, what makes statistics vitally important for understanding our world.
In Chapter 2, we defined a variable as anything that can vary, change, or occur in
different states and that can be measured. By measuring a variable, we obtain the value
of the variable. A variable can have different values in different instances. For example,
different people can have different weights, and one person can have different weights
over time.
Many variables have values that are numbers, amounts, or magnitudes; these are called
quantitative variables. The values of quantitative variables like height, temperature, and
points scored in a season can be expressed with numbers, or ‘scores’. For example, the
number of points scored by a basketball player in an NBA season might be 0, 99, or
4,029. Quantitative variables can be discrete or continuous. The values of discrete vari-
ables, like points scored in a season, can be counted. One can score 99 points or 100
points in a basketball season but not 99.5 points. Variables like temperature or height are
continuous, because the set of possible values they can take consists of all real numbers
in some interval. The temperature might be somewhere between 36 and 37º Fahrenheit;
the only limitation on the number of decimal places is how precise the readings of your
thermometer are.
Not all variables have values that can be captured numerically. Instead, some must be
described with labels and categories. The values of such qualitative variables, such as
gender, blood type, and sport, are descriptive categories. For example, some values of the
variable blood type include A-positive, O-positive, and AB-negative. What are some values
of the variable sport or dietary restriction?
Collecting data is the process of eliciting information from the target phenomenon—
usually by observing or experimentally intervening—and then recording the values of
variables. In our variable world, data can be collected by measuring, counting, interview-
ing, and in many other ways. The proper method of data collection depends on whether
a variable is quantitative or qualitative, along with other considerations. Data about dif-
ferent individuals (people, firms, countries, bacteria, and so on) can give us insight into
differences among those individuals. In contrast, data collected at different times enable
researchers to study changes in the value of a variable over time. Both are important.
We might study how point-scoring varies among NBA players or how LeBron James’s
performance changed over the years.
Notice that it’s possible to get quantitative data about all variables, even qualitative
variables. This is because you can count how many individuals or how often over time a
variable takes on some value or falls within some range of values. Blood type is a qualita-
tive variable, but it’s a quantitative observation that 38% of people are O-positive (the
most common blood type). Philosophy major is another qualitative variable, the value
of which can be either yes or no. But we can go on to consider what percentage of all
students are philosophy majors.
Quantitative data about how often a variable takes on different values enable it to be
treated as a random variable. Recall from earlier in this chapter that we can use prob-
abilities to reason about random variables. The different values a variable can take on
constitutes its outcome space, and we’ve already learned how to reason about the prob-
abilities of different outcomes within that outcome space. Further, recall that the values
of random variables are individually unpredictable but predictable in the aggregate. Such
a variable might be actually random, in the sense that it’s purely a matter of chance
what value it has in a given instance; perhaps coin tosses are like this. Or a variable may
instead be random just in the sense that we don’t know why it has the value it does in
a given instance. No one thinks that LeBron’s scoring so many points is pure chance, for
example, but we can still treat the number of points scored by an NBA player in a season
as a random variable.
Visual Representation of the Values of Variables

Random variables are individually unpredictable but predictable in the aggregate, and
descriptive statistics is all about finding patterns in the values that variables take on. As
mentioned in Section 5.1, descriptive statistics involves summarizing, describing, and
displaying data in a meaningful way. The main question of descriptive statistics is this:
how should data sets—that is, data about the values of some variable—be described?
Numbers and visual representation can both be useful approaches, and we will discuss
both. We’ll start with visual representation.
Charts, tables, and graphs provide visual statistical representations. This is a common
form of descriptive statistics in scientific work, as well as in newspapers and magazines.
With the graphical representation of statistics, the key to success is a simple, clear, and
appropriate presentation of the data. Different kinds of visual representations are helpful
in different circumstances, depending in part on the kind of data and the needs of the
audience.
Pie charts are circles divided into different-sized slices, like a pie, to show how much
of the outcome space falls into different categories. The area inside each slice represents
the percentage associated with some value of the variable: the bigger the percentage, the
bigger the slice of the pie. The categories in a pie chart should be mutually exclusive and
collectively exhaustive so the pie slices don’t overlap and so they add up to a whole pie
(100%). A pie chart is a useful way to represent, for example, the percentage of a coffee
shop’s sales for different beverages—say, americanos, cortados, cappuccinos, chai tea lattes,
espresso macchiatos, and smoothies. See Figure 5.2a.
Pie charts are most effective when representing a variable that can take a small num-
ber of distinct values. The values must be distinct so you know where to draw the
lines between them, and too many narrow slices make the chart difficult to understand.
Besides coffee shop beverages, pie charts can also be used to display the votes received
by each candidate in some election or the distribution of quiz grades earned by students.
Determining the proper variable can sometimes be tricky; in the latter example, the vari-
able is not student but grade (outcome space = {A, B, C, D, F}).

For larger numbers of values—for instance, annual beer consumption in different
countries, as shown in Figure 5.2b—it is better to use a bar chart. Bar charts use
bars of different heights to show amounts for different values of some variable. The
values are typically placed horizontally and equally spaced; then vertical bars are used
to represent the size of each value. This size can correspond to an absolute number,
like the number of students who got an A, B, C, D, or F on the most recent test, or
a percentage, like the percentage of the class who got each grade. Like pie charts,
bar charts are great for categorical variables that can take discrete values. The values
should still be mutually exclusive, but they don’t need to be collectively exhaustive.
caffè macchiatos
7%
smoothies
19% americanos
36%
americanos
cortados
cappuccinos
chai tea
lattes
chai tea lattes
11% smoothies
caffè
macchiatos
cortados
3%
cappuccinos
24%
160
142
140
120 114
105 105 104
98 97
100
80
60
40
20
0
Czech Seychelles Austria Germany Namibia Poland Ireland
Republic
Consumption per capita in 2014 (litres)
FIGURE 5.2 (a) Pie chart of a coffeeshop’s sales; (b) Bar chart of per capita national beer
consumption
(For a pie chart, the categories need to add up to a whole pie, but bar charts are not
limited in this way.)
When dealing with a quantitative variable with values that don’t obviously fall into
discrete categories, a histogram is better. Like bar charts, histograms are graphical displays
of data that use bars of different heights. Unlike bar charts, the bars of a histogram are not
distinct categories. Instead, the values are grouped into numeric intervals. One example
of a variable for which this works well is the height of students in your class. It is up to
you to decide what numeric intervals to use—whether to group together 5 foot (152.4
centimeters) to 6 foot (182.88 centimeters) tall or to consider 20 centimeter groupings,
like 150–170 centimeters tall. Your decision will partly depend on the range of values, that
is, the difference between the largest and smallest values you consider. If everyone in your
class is between 5 and 6 feet tall, grouping that range together will result in an uninforma-
tive histogram. Like bar charts, bar height may reflect the total number in each interval or
their percentage.
Histograms and bar charts display visually some important features a data set can
have. If a histogram has a single peak, this shows that one value is the most common.
The most common value is called the mode, and a data set with just one such peak is
called a unimodal distribution. If there are two different peaks of similar heights, cor-
respond to the two most common values of the variable, then the histogram displays
a bimodal distribution. A histogram for class grade percentages nicely illustrates the
difference between these distributions. A common distribution of grades is unimodal:
it has one peak where the most common grades occur—often somewhere in the range
of B to C. Math and logic courses tend to have bimodal distributions instead: they
have two peaks, one near the top of the grading scale and the other in the middle or
lower part of the scale. See Figure 5.3 for example histograms of unimodal and bimodal
25
20
Number of Students
15
10
0
0 1 2 3 4 5 6 7 8 9 10
Test Score
FIGURE 5.3 (a) Histogram of a unimodal grade distribution; (b [below]) Histogram of a bimodal
grade distribution
18
16
14
Number of Students
12
10
8
6
4
2
0
0 1 2 3 4 5 6 7 8 9 10
Test Score
FIGURE 5.3 (Continued)
distributions. Finally, if the height of the different bars in the histogram is the same
for all values, then it shows a uniform distribution where all values are equally likely.
Grading distributions are rarely uniform. In contrast, 1,000 dice throws should have an
approximately uniform distribution across the individual outcomes of one, two, three,
four, five, and six.
Apart from examining the number of peaks in a histogram or bar chart, it can be
useful to determine whether the graph is symmetric—that is, whether its right and
left portions are the same. Symmetric graphs can have (1) a uniform distribution, (2) a
ᑌ-shape, or (3) a ᑎ-shape. A ᑌ-shape is a bimodal distribution, where large and small
values are the most common. A ᑎ-shape is a unimodal distribution with the most com-
mon values clustered around the middle, with decreasingly common outcomes as the
values get higher and lower. This ᑎ-shaped distribution, called a bell curve or normal
distribution (also known as Gaussian distribution), is especially important in statistics.
We’ll discuss normal distributions further in Chapter 6. For examples of these three
shapes of symmetric distributions, as well as three examples of asymmetric distribu-
tions, see Figure 5.4.
Measures of Central Tendency: Mean, Median,

and Mode
Let’s move on to quantitative characterization of data sets. Data sets usually exhibit
two important features: central tendencies and variability. Consider a ᑎ-shaped
histogram, like in Figure 5.4(c). There is one peak at the center. This peak shows
the central tendency of the data set—the most common group of values of the
7
Number of Students
5
0
0 1 2 3 4 5 6 7 8 9 10
Test Score
9
8
Number of Students
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
Test Score
16
14
12
Number of Students
10
0
0 1 2 3 4 5 6 7 8 9 10
Test Score
FIGURE 5.4 Examples of (a) uniform, (b) ᑌ-symmetric, and (c) ᑎ-symmetric distributions;
(d [below]) Examples of asymmetric distributions
14
12
Number of Students
10
0
0 1 2 3 4 5 6 7 8 9 10
Test Score
9
8
Number of Students
7
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
Test Score

variable. There are also two ‘tails’, the values that are less and less common the fur-
ther away they are from the most common group in the middle. The distribution of
occurrences across all these values, even the uncommon ones, is the variability of the
data set.
The central tendency and variability of a data set can both be measured. This enables
us to summarize the main features of a data set with just a few simple numbers. Let’s
consider measures of central tendency first.
Imagine this situation. Your instructor has just returned the class’s first quiz. You
see that your grade is 6⁄10; your percentage of correct answers is 60%. How should
TABLE 5.2 Imagined data set and central tendencies for 17 student scores on 10-point Quiz 1
Student Score Student Score
#1 0.0 #10 8.0
#2 5.0 #11 8.0
#3 5.0 #12 8.5
#4 6.0 #13 8.5
#5 7.0 #14 9.0
#6 7.0 #15 9.0
#7 7.0 #16 9.0
#8 7.0 #17 10.0
#9 7.5
you react? Perhaps you judge that you performed poorly—a 60% is quite low, isn’t
it? Yet another reaction might be to withhold judgment until you have additional
information. You might want to compare scores with your classmates or inquire how
the class performed as a whole or ‘on average’. This additional information about the
distribution of scores would help you know whether you did poorly on the quiz and,
if so, how poorly.
Imagine the students’ grades are as shown in Table 5.2. Your instructor can provide
you with three different answers to the question of how the class on average did on the
quiz. These correspond to three different measures of the central tendency of a distribu-
tion (of grades, or anything else). These measures are: the mode, median, and mean of
the distribution.
The mode is the most frequent or most numerous value in the data set. As you can
see from the data in Table 5.2, the mode of the class’s scores is 7⁄10. Four students scored
a 7, which makes this more common than any other score.
The mode can be informative, and for some qualitative variables, it may be the only
measure of central tendency that can be employed. However, even for a unimodal bell
curve, the mode may not reflect the central tendency of a distribution well. Notice
from the list of ordered scores that a 7, although the most frequent score, is lower
than more than half of the students’ grades. It is also possible for there to be more
than one mode in a distribution, which also limits the ability of a mode to capture
the central tendency. What if four students also had scored a 9? Finally, if all values
were different, then none would be more common than any other; such a distribution
would have no mode at all.
The median is the middle value in a distribution when the values are arranged from
the lowest to the highest (or from highest to lowest). The median value splits the dis-
tribution exactly in half: half of the values are on one side, the other half on the other
side. In our example, the median will be whatever score was earned by the student who
did the ninth best/worst. There were 17 students total, and so eight scored above that
ninth student while eight scored below. Student #9 earned a 7.5 on the quiz; this is the
median quiz score. When the distribution has an even number of values, the median is
halfway between the two middle values.
The median is often the preferred measure of central tendency when the distribution
is not symmetrical. This is because the median is not strongly affected by outliers,
that is, by data values that are remarkably different from the rest (like student #1,
who scored a 0 on the quiz). But that strength is also a weakness, depending on the
nature of the information you want to capture. You might want the central tendency
measure to be different when some students bombed the quiz instead of all having
scores grouped around the middle value. Unlike the mode, the median cannot be
identified for qualitative variables, since the values of these types of variables cannot
be ordered from lowest to highest. (It makes little sense to ask whether cappuccino is
lower or higher than cortado.)
The mean, also called the average, is the sum of all values in the data set divided by
the number of outcomes. All together, the students’ scores sum to 121.5; dividing that
sum by 17, the number of students, gives us a mean grade of 7.1 on the quiz. Like the
median, the mean cannot be calculated for categorical or qualitative data, as such values
cannot be used in addition. Unlike the median, the mean is affected by outliers; the
mean is pulled in the direction of the distribution’s longer tail. The student who scored
a 0 on the quiz pulled down the mean score of 7.1 by nearly half a point, compared to
the median score of 7.5.
When a distribution is unimodal and perfectly symmetrical, the mode, median, and
mean coincide, and they are all exactly in the middle of the distribution. Asymmetric
and multimodal distributions can lead these measures of central tendency to be radically
different from one another.
Measures of Variability: Range, Variance,

and Standard Deviation
We’ve said that data sets exhibit variation; this is at the heart of what it is to be a ran-
dom variable. Like central tendency, variation can be measured. Measures of variation
provide us with a summary of the ‘spread’ of the values in a data set—that is, the degree
to which they vary.
Information about variation can differentiate data sets that have the same mean, median,
and mode. Let’s return to our simple imaginary example of quiz grades. Suppose the next
quiz has the exact same mode, median, and mean as the scores shown in Table 5.2. This
suggests the class did equally well on this next quiz. And, on average, they did. But this
isn’t the whole story; compare the quiz grades in Table 5.3 to those in Table 5.2. What
differences do you notice?
TABLE 5.3 Imagined data set and central tendencies for 17 student scores on 10-point Quiz 2
Student Score Student Score
#1 4.0 #10 7.5
#2 5.0 #11 8.0
#3 5.0 #12 8.0
#4 6.0 #13 8.0
#5 7.0 #14 8.5
#6 7.0 #15 8.5
#7 7.0 #16 8.5
#8 7.0 #17 9.0
#9 7.5
This second data set has the same mean, median, and mode as the quiz scores from
Table 5.2 but with much less variation in score. There is more variation in the grades on
Quiz 1 than Quiz 2. Consider that the lowest score on Quiz 1 was a 0 and the highest
grade a 10, while the lowest score on Quiz 2 was a 4 and the highest grade a 9. Visualizing
the two data sets with a histogram makes it easier to spot the difference; see Figure 5.5.
As this illustrates, measures of central tendency don’t capture all the information about
a distribution; you also need measures of variability.
There are three primary measures of variability: the range, variance, and standard
deviation of a distribution. The range is the difference between the smallest and largest
values in a data set. For Quiz 1, the range was 10, since the lowest score was 0 and
the highest score was 10; for Quiz 2, the range was 5, since the lowest score was 4 and the
highest score was 9.
Range does not take outliers into account very well, since it doesn’t specify anything
about the distribution of scores within the range. In other words, range won’t tell you
whether the distribution’s tails are skinny or thick—the ‘spread’ of the data. This can be
done with a measure of the distance of values from the mean, which is what the measures
of variance and standard deviation do. These measures of variation summarize the spread,
or how close the various values are to the mean.
Population variance (σ 2) is the average of the squared differences of values from the
mean, that is:
σ 2 = ∑(value − mean)2 / n
The capital symbol sigma (‘∑’) indicates that you should sum all instances, and n is the
number of values in the data set. Let’s discuss this formula in more detail.
7
Number of Students
0
0 1 2 3 4 5 6 7 7.5 8 8.5 9 10
Test Score
9
8
7
Number of Students
6
5
4
3
2
1
0
0 1 2 3 4 5 6 7 7.5 8 8.5 9 10
Test Score
FIGURE 5.5 (a) Histogram of the Quiz 1 grade distribution in Table 5.2; (b) Histogram of the
Quiz 2 grade distribution in Table 5.3
Notice that calculating variance requires knowing the mean of a data set. After cal-
culating the mean, the first step to finding the variance is to find the difference of each
value from that mean; this is the distance between the mean and each value in the data
set. Finding this difference will show whether the values tend to vary a lot or only a little
from the mean. Next, each difference is squared. (Otherwise the differences on either side
of the mean would cancel each other out, since the difference for values greater than the
mean is positive and the difference for values less than the mean is negative.) Finally, find
the average of those squared differences by adding them together (∑) and then dividing
by the number of values (n).
Let’s find the variance for the scores on Quiz 1 and Quiz 2. For each, the mean is 7.1
(rounded to one decimal point).
For Quiz 1, the population variance calculation looks like this:
(0.0−7.1)2 + (5.0−7.1)2 + (5.0−7.1)2 + (6.0−7.1)2 + (7.0−7.1)2 +

(7.0−7.1)2 + (7.0−7.1)2 + (7.0−7.1)2 + (7.5−7.1)2 + (8.0−7.1)2 +
(8.0−7.1)2 + (8.5−7.1)2 + (8.5−7.1)2 + (9.0−7.1)2 + (9.0−7.1)2 +
(9.0−7.1)2 + (10.0−7.1)2] / 17 = 5.0
And for Quiz 2:
(4.0−7.1)2 + (5.0−7.1)2 + (5.0−7.1)2 + (6.0−7.1)2 + (7.0−7.1)2 +

(7.0−7.1)2 + (7.0−7.1)2 + (7.0−7.1)2 + (7.5−7.1)2 + (7.5−7.1)2 +
(8.0−7.1)2 + (8.0−7.1)2 + (8.0−7.1)2 + (8.5−7.1)2 + (8.5−7.1)2 +
(8.5−7.1)2 + (9.0−7.1)2] / 17 = 1.9
Comparing the variances for the two quizzes makes it clear that the scores on Quiz 1
had more variation than those on Quiz 2: 5.0 to 1.9, respectively.
The final measure of variation we will discuss is standard deviation (σ), which is
calculated directly from the variance. The standard deviation is just the square root of
the population variance:
σ = √[∑(value − mean)2 / n]
The standard deviation for Quiz 1 is 2.2 (the square root of 5.0). For Quiz 2, it is 1.4
(the square root of 1.9).
The standard deviation provides us with a sort of yardstick for measuring variation.
It is a number against which you can assess individual values or groups of values to see
how far they are from the mean, relative to total variation in the data set.
If the histogram describing our data set is bell-shaped (unimodal and roughly sym-
metric), then around 68% of the values fall within one standard deviation of the mean,
and around 95% of the values fall within two standard deviations of the mean—that
is, fall within the distance that’s twice as long as the standard deviation value. And
virtually all of the values lie within three standard deviations of the mean. Look at
Figure 5.6. This shows the locations of one, two, and three standard deviations for a
probability distribution with a bell-shaped histogram. The standard deviation distances
will, of course, change depending on the spread of the data. The standard deviation
value reflects this by being a relatively large number (lots of spread) or a relatively
small number (little spread).
The scores in our quiz examples are roughly normally distributed. So, using the stan-
dard deviations we have calculated, we can say that, for Quiz 1, roughly 68% of the quiz
scores are between 4.9 and 9.3 (7.1, the mean, ± 2.2, the standard deviation). Roughly
95% are between 2.7 (the mean minus two standard deviations) and 10 (the maximum
score, which is less than the mean plus two standard deviations). For Quiz 2, in contrast,
FIGURE 5.6 Standard deviation in a normal distribution; the values within one standard devia-
tion of the mean account for 68.27% of the values in the data set, while those within two
standard deviations account for 95.45%, and those within three standard deviations account
for 99.73%
roughly 68% of the scores are between 5.7 and 8.5, 95% between 4.3 and 9.9. This is a
smaller range than for Quiz 1. The standard deviation nicely captures the difference in
score variation between these two quizzes.
Mean and standard deviation are the most commonly reported summary statistics for
a data set. Together, the mean and standard deviation capture the central tendency and
variability around that central tendency in a way that is informative and—as we will see
in the next chapter—central to statistical inference.
Correlation
Most research in the natural and social sciences is concerned not just with variables but
also with the relationships among them. For instance, some years ago, French researchers
studied whether people drink more alcohol when they hang out in loud bars. They found
a positive correlation between the variable decibel level in bar and the variable alcohol
consumption. If you ask whether level of marijuana consumption is different in different
states in the US, you are interested in the relationship between the variable marijuana
consumption and the variable state. Or if you wonder whether being able to read at a
younger age predicts salary level in adulthood, you are again asking about the correlation
between the values of two variables.
Recall our earlier definition of statistical independence. When two variables are sta-
tistically independent, the value of one variable does not raise or lower the probability
of the other variable taking on any given value. Variables that are not statistically inde-
pendent are correlated variables: the value of one raises or lowers the probability of the
other having some value. For example, the correlation found by those French researchers
between loud bars and alcohol consumption means that a person going into a loud bar
is more likely to have, say, five alcoholic drinks than is a person going into a quiet bar
(Guéguen et al., 2008).
When greater values for one variable are related with greater values for a second
variable, these are said to be directly or positively correlated. The decibel level of a bar
and alcohol consumption were found to be positively correlated. When greater values
for one variable are related with smaller values for a second variable, these are said to
be indirectly or negatively correlated. Perhaps level of alcohol consumption on a given
evening is negatively correlated with waking up early the following morning; the more
alcohol someone drinks, the less likely that person is to wake up early.
For quantitative variables, scatterplots can provide a visual representation of whether
they are correlated and how. Scatterplots are graphs in which the values of one variable
are plotted against the values of the other variable. For example, the horizontal axis of
the plot, the X-axis, may indicate the decibel level in different bars, and the vertical axis,
the Y-axis, the average number of drinks consumed in those different bars, such as in
Figure 5.7.
A scatterplot that shows a positive correlation between variables will have dots that
tend to form an upward-sloping line from left to right. As the values of one variable
get larger, the values of the other variable also tend to get larger. However, there can be
exceptions—dots that vary from that general pattern. Some very quiet bars may serve a
lot of drinks, and some very loud bars may serve few drinks. But this doesn’t eliminate
the general correlation between decibel level and alcohol consumed.
Average number of drinks consumed per

person as a function of loudness (in dB)
10
No. of drinks consumed per person
9
8
7
6
5
4
3
2
1
0
0 20 40 60 80 100 120
Loudness of a bar in decibels
FIGURE 5.7 An imagined scatterplot of the relationship between alcohol consumption and
decibel level in bars
A scatterplot that shows a negative correlation between variables will have dots
that tend to form a downward-sloping line from left to right. As the values of one
variable get larger, the values of the other variable tend to get smaller. Of course, there
can be dots that vary from this pattern as well without interfering with the negative
correlation.
What would you expect for a scatterplot of two variables that aren’t correlated? Well,
there won’t be an upward sloping line, and there won’t be a downward sloping line. What
you usually see are dots all over the place, with no pattern between the values of one
variable and the values of the other variable.
Measures of Correlation
One way to summarize the relationship between variables is called regression analysis.
The basic idea is to find the best-fitting line through the points on a scatterplot.
Modern regression analysis was invented by Sir Francis Galton (1822–1911). Galton
was the half-cousin of Charles Darwin, and he had many interests: he was a geographer,
meteorologist, tropical explorer, inventor of fingerprint identification, eugenicist, and best-
selling author. Galton was obsessed with measurement (he even tried to measure physical
beauty). In 1875, Galton began to investigate heredity: Why do successive generations
remain alike in so many features? And how do offspring vary from their parents? One of
his projects was to measure the diameter and weight of thousands of mother and daughter
sweet pea seeds (see Table 5.4).
After plotting his results, Galton hand-fitted a line to his data as best as he could. He
wanted to find the line that best fit his data. Intuitively, this is the line that runs closest
to the points scattered on a plot. Galton aimed to draw a line that minimized the sum
of the distances of the points on the plot from that line, while still maintaining a straight
TABLE 5.4 Average diameter of parent/offspring sweet pea seeds measured in 1/100ths
of an inch (Adapted from Galton, 1889, p. 226)
Seed Mother Daughter

#1 15 15.3
#2 16 16.0
#3 17 15.6
#4 18 16.3
#5 19 16.0
#6 20 17.3
#7 21 17.5
18.5
Diameter of daughter sweet pea seeds

18.0
17.5
17.0
16.5
16.0
15.5
15.0
14 16 18 20 22 24
Diameter of mother sweat pea seeds
FIGURE 5.8 A regression analysis of Galton’s data on the diameter of sweet pea seeds
line. In this sense, it can be considered the best fit. Figure 5.8 shows the best-fitting
straight line for Galton’s data.
As with the dots of a scatterplot, when there is a positive correlation, the best-fitting
line will have an upward-sloping trajectory as it moves right, and when there’s a nega-
tive correlation, the line will have a downward-sloping trajectory as it moves right. The
size of parent and offspring sweet pea seeds are positively correlated: the slope of the
line goes from the bottom left to the top right of a scatterplot. In contrast, speed and
accuracy in carrying out a task are negatively correlated: as speed increases, accuracy
decreases. In this case, the slope of the line goes from upper left to lower right of the
scatterplot.
A regression analysis also gives information about the correlation strength: how
predictable the values of one variable are based on the values of the other variable.
The closer the dots are to the best-fitting line, the stronger the correlation, that is, the
more linked the values of the two variables. (Notice that the slope of the line is not
related to correlation strength; the slope only gives information about how the values
of the variables tend to relate to each other.) A maximum strength correlation, often
called a perfect correlation, will have all the dots directly on the regression analysis
line. A very weak correlation will have dots that almost look uncorrelated; they fall
all over the place, far from the line, but there’s just a hint of a relationship between
the values of the two variables. In Figure 5.9, you can see examples of very strong
and weaker correlations with the same relationship among variables and so identical
regression analysis lines.
strong positive correlation, weak positive correlation, no correlation,

nearly +1.0 roughly +.5 roughly 0
weak negative correlation, strong negative correlation,

roughly –.5 nearly –1.0
FIGURE 5.9 Scatterplots depicting correlational strength and direction
From his regression analysis, Galton saw that as the size of a mother sweet pea seed
increased, so did the size of its daughter sweet pea seed. However, the daughter seeds
tended to be less extreme in size compared to their mother peas: they ‘regressed’ back
toward average pea size. Extremely large mother seeds grew into plants whose daughter
seeds tended not to be as extremely large, and extremely small mother seeds grew into
plants whose daughter seeds tended not to be as extremely small. Galton called this loss
of extremity the regression to the mean. It can be explained as just an effect of variabil-
ity: if a variable has an extreme value, then most other values that variable can have are
less extreme. So, even though mother and daughter pea sizes are positively correlated,
extreme-sized peas tend to have less extreme-sized daughter peas (but even this is some-
thing that can vary). The same also holds true in reverse: extreme-sized peas usually have
less extreme-sized mother peas.
Galton also determined a correlation coefficient for mother and daughter pea size. A
correlation coefficient provides information about the direction and strength of correla-
tion. It has two parts: a positive (‘+’) or a negative (‘−’) sign to indicate positive or nega-
tive correlation respectively and a number between 0 and 1 to indicate the strength of
the correlation. This is a measure of the dispersion of the points on the scatterplot. The
stronger the relationship between the two variables, the closer the correlation coefficient
FIGURE 5.10 Francis Galton
is to 1, when the value of one variable is a perfect predictor of the value of the other
variable. A value of 0 means that the points on the plot are randomly scattered, and the
two variables are statistically independent: the value of one gives no information about
the value of the other.
Box 5.2 Galton’s Work on Intelligence and Bias in Science

Francis Galton was a Victorian polymath who left a legacy of intellectual work
in many scientific fields—geography, meteorology, statistics, and biology, among
others. Galton devoted much of his time to exploring variation in human popula-
tions and its implications. He focused on multiple ways in which humans vary:
intelligence, height, facial features, and fingerprint patterns. He sought to discover
the degree to which such human traits were inherited, passed on from parents to
children. For example, if human intelligence is hereditary, he reasoned, then a
notable man’s closest male relatives were the most likely to exhibit greater intel-
ligence in comparison to unrelated men in the general population. Based on his
investigations of genealogical records and biographical sources, Galton concluded
that human intelligence is hereditary. He also pioneered the modern field of
behavior genetics with his use of twin studies. Galton wanted to find out whether
twins who were similar at birth differed in their mental and physical abilities when
raised in dissimilar environments.

In its time, Galton’s work on heritability was at the cutting edge of statistical
reasoning. Still, there are many troubling features of his investigations. First, there
were problems with his methods. His findings based on genealogical records relied
on anecdotal and unsystematic evidence and ignored the possibility of environ-
mental influences on traits. His statistical work improved on this but was beset by
one glaring problem: there is clear evidence that at least some of Galton’s statistical
evidence was fabricated. These and other methodological problems were related
to an even bigger issue. Galton’s findings tended to confirm the superiority of
white, wealthy Englishmen. Women were omitted from his genealogical analysis
because he maintained that notable achievement was principally a male preroga-
tive. The shortcomings of his methods allowed Galton’s science to simply confirm
his biases and expectations—that white, upper-class men are superior to the poor,
to people of color, and to women.
Worse yet, this shaky research was used to promote unethical programs. Galton
coined the term eugenics in 1883 (from the Greek for ‘well-born’), which was
initially a social philosophy aiming to improve the genetic pedigree of societies
by pairing certain individuals and not others. Galton advocated eugenic marriages
and the use of social incentives to encourage ‘able’ couples to have children. In
the 20th century, eugenics movements in the US, Britain, and other countries
adopted various policies that restricted human liberties while threatening human
dignity. These included forced birth control, marriage restrictions, racial segrega-
tion, compulsory sterilization, and even genocide (see Gillham, 2001). Distress-
ingly, this is just one example from a long string of scientific research that has
been misused to justify racism, sexism, and classicism. The role of values in science
and the role of science in society are a main focus of Chapter 8.
Summary
Let’s end with a summary of the measures of central tendency and variation introduced in
this section. You calculate a population’s mean by summing all the values, xi, and dividing
by the total number of outcomes, n, in your data set.
mean = (x1 + x2 + … + xn) / n
To find the mode or the median, you should begin by ordering all the values in the data
set from smallest to largest. To find the mode, count how many times each value occurs;
the value that occurs most often is the mode. There is no mode if no value appears more
often than any other, and there are two (bimodal) or more modes if two or more values
occur most often. To find the median, search for the value in the very middle of the list;
the middle value is the median. If there is an even number of outcomes, then the average
of the two values in the middle is the median.
To find the range, a measure of variation, you also need to begin by ordering the values
from smallest to largest. Then, simply subtract the smallest value from the largest value
in the data set.
You can calculate the population variance, σ2, by first finding the mean. Then, for each
value in the data set, subtract it from the mean and square each result. Finally, calculate
the average of those squared results (that is, sum the results and divide by the number
of values):
σ 2 = ∑(value − mean)2 / n
To find a population’s standard deviation, σ, compute the square root of the population
variance:
For a normal or bell-shaped distribution, 68% of outcomes fall within one standard devia-
tion and 95% of the outcomes fall within two standard deviations. Virtually all (99.7%)
fall within three standard deviations.
EXERCISES
5.14 Define the concepts of central tendency and variability in your own words, and
describe the importance of each.
5.15 List three measures of central tendency and three measures of variability. For each
measure, describe its advantages and any drawbacks or limitations.
5.16 Divide the following list into qualitative variables and quantitative variables. For
each quantitative variable, say whether it is discrete or continuous.
a. The height of a mountain
b. The color of starfish
c. The breed of a dog
d. The winner of Wimbledon
e. The population of a city
f. The outcome of a throw of a die
g. The GDP (gross domestic product) of a country
h. Type of pizza
i. The number of pizzas one person eats per week
j. The amount of salt in the Atlantic Ocean
5.17 Label the type of visualization found in each of the following figures as a bar chart,
scatterplot, pie chart, or histogram. Then, for each, describe the data portrayed,
including variable(s), characteristics of the distribution, and anything notable or
surprising about the data.
5.18 Look back at Table 5.4, a data set used by Galton in his studies of heredity. Calcu-
late the mean, median, and mode of this data set. Next, calculate the range, vari-
ance, and standard deviation.
City/Town
Special District 19%
7%
TIF
8%
County
17%
Schools
42%
Library
4%
Township
3%
Township Library County TIF Special District City/Town Schools

FIGURE 5.11
(a) Average expenditure per dollar of Indiana property tax, 2013
(b) Composite score GRE and academic major based on college graduates who tested
08/01/11–04/30/14
Chart 2015 Philosophy at University of New Orleans; Data 2014 ETS
team A =
team B =
team C =
number of digs performed
number of amphorae unearthed

(c) Iris petal length
(d) Number of digs performed and amphorae found
5.19 Consider the following three data sets:

A = {4, 10, 11, 7, 15}
B = {10,10,10,10,10}
C = {1, 10, 19, 1, 22}
a. Decide which data set has the largest standard deviation without doing any
calculations.
b. Calculate the mean, mode, and median of each data set.

c. Calculate the range, variance, and standard deviation of each data set.
5.20 If the standard deviation of a data set is zero, what can you conclude about the data
in that set? What can you conclude about the data if the mean is zero?
5.21 Draw scatterplots with five points each showing correlations of +1.0, –1.0, and 0.
Draw a regression line on each. Finally, draw a second scatterplot showing a cor-
relation of +1.0 but with a different regression line.
5.22 Consider Table 5.5, which describes the Titanic’s passengers and their fates.
Describe any advantages, disadvantages, or problems with visualizing this data
using each of a pie chart, bar chart, histogram, and scatterplot. Then, choose which
approach is best for this data and draw that type of visualization of the data.
Explain your choice of visualization type.
5.23 Answer the following questions based on the data from Table 5.5 and/or your visu-
alization of the data in exercise 5.22.
a. What is the percentage of survivors for each class, gender, and age group?
Note that you’ll need to find the mean for one grouping across the other group-
ings to calculate these percentages.
b. Which group had the highest mortality rate in this disaster? Which group was
most likely to survive?
c. Write out the different values for each of the variables: class, gender, and age
group. Order these from those that correlated most with survival (either posi-
tively or negatively) to those that correlated the least with survival. For each,
indicate whether the correlation was positive or negative.
d. Can you guess from the data anything about the code of conduct on the Titanic
for who should be saved first in a life-threatening situation?
TABLE 5.5 Data on Titanic survivors
Age/ Class/Crew Number Number Number Percentage Percentage

Gender Aboard Saved Lost Saved Lost
Children First Class 6 5 1 83% 17%
Second Class 24 24 0 100% 0%
Third Class 79 27 52 34% 66%
Women First Class 144 140 4 97% 3%
Second Class 93 80 13 86% 14%
Third Class 165 76 89 46% 54%
Crew 23 20 3 87% 13%

Men First Class 175 57 118 33% 67%
Second Class 168 14 154 8% 92%
Third Class 462 75 387 16% 84%
Crew 885 192 693 22% 78%
Total 2224 710 1514 32% 68%
<https://en.wikipedia.org/wiki/RMS_Titanic#Survivors_and_victims>
FURTHER READING
For more on the importance of probability and statistics in your life, see Gigerenzer, G.
(2002). Calculated risks: How to know when numbers deceive you. New York: Simon &
Schuster.
For an introduction to probability theory, see Olofsson, P. (2007). Probabilities: The little
numbers that rule our lives. Hoboken: Wiley & Sons.
For a concise discussion of graphs that badly visualize a data set with several real-life
examples, see the website URL = <www.statisticshowto.com/misleading-graphs/>
For more on misleading or ‘spurious’ correlations, see Vigen, T. (2015). Spurious cor-
relations. New York: Hachette Books. Several fun examples of spurious correla-
tions are also available at the following website URL = <www.tylervigen.com/
spurious-correlations>
For additional information about Francis Galton and the birth of eugenics, see Gillham,
N. W. (2001). Sir Francis Galton and the birth of eugenics. Annual Review of Genetics,
35, 83–101.
CHAPTER 6
Statistical Inference
6.1 GENERALIZING FROM DESCRIPTIVE STATISTICS
• Describe how probability theory extends the reach of statistics

• Distinguish frequency distribution from probability distribution
• Define normal distribution and characterize one based on mean and standard deviation
• Describe how statistical methods can be used to make estimates about a population
from a sample
• Assess samples, statistical estimation methods, and the conclusions of statistical
inference
From Description to Inference

Descriptive statistics—a primary focus of Chapter 5—enables scientists to summarize and
represent data sets in meaningful ways. We’ve seen how to do so visually, with charts, plots,
and graphs, and also with numbers—including means, standard deviations, and correlation
coefficients. Describing data sets, however, regularly falls short of what scientists—and
everyone else—are interested in.
Two weeks before the 2016 US presidential election, Democratic candidate Hillary
Clinton was ahead of Republican candidate Donald Trump in nearly all polls, ranging from
a 14% lead for Clinton to a 1% lead for Trump. Many were thus surprised when Donald
Trump secured an Electoral College victory and became the next president of the United
States of America. The tools of descriptive statistics provide many ways to represent the
data from these pre-election polls. For instance, we have represented them numerically
using percentages. They also can be represented visually—say, in a bar chart showing the
comparative proportion of votes for Clinton, Trump, and third-party candidates in a given
poll or averaged across polls.
But this wasn’t really what anyone ultimately wanted to know. No one cared much
about the voting intentions of the individuals who happened to be polled or about how
those potential voters felt two weeks before the election. What everyone was interested
in knowing was something about all US voters: how they would actually cast their bal-
lots. Everyone wanted to use the data collected in these pre-election polls not just to
describe but to make predictions about how the election itself would turn out. But we
can’t get answers like that using descriptive statistics. It requires looking beyond the data
208 Statistical Inference
we have, using it to make inferences about a larger group or about new observations. For
these kinds of interests—predicting the future and generalizing from a sample—we need
inferential statistics.
Inferential statistics is an important form of inductive reasoning that extends the reach
of descriptive statistics with the use of probability theory. Some of the basic principles and
concepts of probability theory were discussed in Chapter 5. Recall that random variables
have values that can’t be predicted individually but that can display patterns over many
instances. Despite the name, the values that random variables take on may not truly be
random. Coin tosses, dice throws, LeBron’s free throws, voting intentions, temperatures
on the days of September, and the decibel level in a bar can all be treated as random
variables. For any of these variables, inferential statistics allows us to analyze relevant
data sets to predict yet-to-be-measured values of those variables. For example, one might
assess from a sequence of heads and tails whether the coin is fair, predict from LeBron’s
past record whether his free throw success will improve over time, infer the efficacy of
a medical drug from observed treatment effects, or predict from an opinion poll which
candidate will win an election.
In brief, statistical inference is a form of inductive inference that employs probability
to better understand the real-world phenomenon underlying a known data set. It allows
scientists to formulate expectations about what they would observe in a new data set or
in the larger population and to assess how confident they can be about those expectations.
Frequency Distributions and Probability Distributions

The starting point for using inferential statistics is a properly organized data set. Frequency
distributions offer one way to organize a given set of raw data before we can use it to
make predictions about new observations. Frequency distributions are lists that include
every possible value of a variable and the number of times each value of that variable
appears in the data set, often organized into tables—Tables 5.2 and 5.3 in the previous
chapter are examples of frequency distributions of students’ grades. Relative frequency
distributions are frequency distributions that record the proportion of occurrences of the
value of a certain variable instead of the absolute number of occurrences. By using rela-
tive frequency distributions, we record how often different values occur for the variable
under consideration, relative to the total number of values in the data set. For example,
suppose you have a bag containing 35 M&Ms of different colors {brown, red, yellow,
green, blue, orange}. Table 6.1a organizes the data set to display the distribution of the
frequencies of each color. This frequency distribution can be turned into a relative fre-
quency distribution as in Table 6.1b, which displays the proportions of the colors of the
M&M’s in your bag. These proportions can also be put as percentages. Insofar as every
M&M is one of six colors, the six proportions shown in Table 6.1b should sum to one;
that is, the percentages of the different colors add to 35⁄35, or 100%.
Relative frequency distributions can be used to estimate the probability distribution
for the variable—that is, how probable it is for different values to occur in general. For
example, the relative frequency distribution of the colors of M&Ms in your bag can be
used to estimate the probability distribution for the colors of M&Ms in any bag of M&Ms.
Based on your sample bag of M&Ms, you may estimate that if you take a different bag of
M&Ms, open it, and choose one M&M at random, the probability of getting a blue M&M
Statistical Inference 209
TABLE 6.1a Frequency distribution of a bag of 35 M&Ms
Bag of 35 M&Ms
Color Frequency
Blue 1
Orange 3
Yellow 4
Red 5
Green 5
Brown 17
TABLE 6.1b Relative frequency distribution
Bag of 35 M&Ms
Proportion Percentage
1/35 2.86%
3/35 8.57%
4/35 11.43%
5/35 14.29%
5/35 14.29%
17/35 48.57%
is about 3%. Your estimate may or may not be very good. It may be that distributions of
M&M colors are quite similar across bags, or they may deviate quite a bit.
As this simple example illustrates, probability distributions indicate the probabilities
of occurrence of different values of a random variable within a given range. Probability

distributions are the engine of statistical inference. Let’s work up to this idea by first
considering the probability distribution for the simple random variable of a coin toss,
which was our main example from the discussion of probability theory in Chapter 5. In
particular, let’s consider the probability distribution of the number of times heads are
expected to come up over 100 coin tosses.
The range of possible outcomes is 0 to 100. Heads might come up as few as zero times
and up to a maximum of 100 times. In other words, these are the possible values of the
variable heads per 100 coin tosses. In theory, we could calculate the probability of each
outcome in that range using the probability theory developed in the previous chapter.
Notice that Pr(heads = 0) is equivalent to Pr(tails1 and tails2 and … tails100). If every
coin toss comes up tails, then there are zero instances of heads. What is the probability
of that ever happening, assuming the coin is fair? Recall that the probability of getting
tails on a given throw is always .5, or ½, and that we multiply when calculating the prob-
ability of multiple independent events all occurring. The result would be a really, really
tiny number: ½ × ½ × … × ½ for all 100 throws, or ½100. This is also the same as the
probability that heads comes up 100 times.
In between 0 and 100, the calculation for the probability of each value of the number
of times heads comes up is much more complicated. We won’t carry out those calcula-
tions, but considering how they would go gives us a sense for how the probability changes
for intermediate numbers of heads. To begin, notice that there is only one way to get
zero heads and only one way to get 100 heads; in the former case, the coin never lands
on heads, and in the latter case, the coin lands on heads 100 times in a row. In contrast,
there are 100 different ways to get heads on only one toss; it might be the first toss or
the second toss or the third toss or the 37th or any other single toss. So, Pr(heads = 1) is
equivalent to Pr[(heads1 and tails2 and tails3 and … tails100) or (tails1 and heads2 and tails3
and … tails100) or …] and so on, up until the circumstance of getting heads only on the
100th toss. Using our calculation from earlier, and because we add when calculating the
probability of one of several mutually exclusive events occurring, this is ½100 + ½100 + … +
½100, or 100 × ½100. This is still a really, really tiny number, but it’s 100 times bigger
than the probability of heads coming up no times. Notice also that the same calculation
gives us the probability that heads comes up 99 times. So we’re building our probability
distribution from both ends at the same time.
There are even more ways for heads to come up twice (or 98 times) and even more
ways than that for heads to come up three times (or 97 times). Each time we add another
outcome of heads, the calculation becomes more complicated, and the probability of get-
ting that number of heads increases. Further, the increasing probability of each of these
outcomes isn’t linear; the increase gets bigger each time.
We already know the distribution is symmetric, since the calculation is the same
whether the number of heads = 0 or 100, whether the number of heads = 1 or 99, and
so forth. The middle of the distribution, the most probable outcome, is thus 50: that you
get heads on 50⁄100, or ½ of the coin tosses. Figure 6.1a shows a histogram of the whole
probability distribution. Notice that the shape of the histogram approximates a bell curve.
With even larger numbers of coin tosses, the distribution becomes closer and closer to a
bell curve. A bell-shaped curve or normal distribution, briefly introduced in Chapter 5, is
a perfectly symmetric, unimodal distribution for continuous variables, like in Figure 6.1b.
Again, this is also called a Gaussian distribution, after the German mathematician Carl
Friedrich Gauss (1777–1855).
Normal distributions—unimodal and symmetric distributions—are especially impor-
tant for statistical reasoning. Like coin tosses, the behavior of random variables over many
repeated, independent trials tends to have a probability distribution that is normal. This
result depends on what is known as the central limit theorem, a statistical theory that
samples with a large enough size will have a central tendency approximating that of the
population (Le Cam, 1986). As a result, the probability distribution of random variables
is a normal distribution or bell curve. What varies for different random variables is the
central tendency and variability of the normal distribution, which—as we saw in Chapter 5—
can be described with mean and standard deviation. Whereas the mean value of heads
FIGURE 6.1 (a) Probability distribution of heads for 100 coin tosses—the mean of this distribution
is 50 and its standard deviation is 5
FIGURE 6.1 (b) Example of normal distribution for a continuous variable
on 100 coin tosses is 50, the mean value of 6s on 100 dice rolls is 16.67 (1⁄6 × 100).
The standard deviation, but not the mean, is also influenced by the number of trials. You
are more likely to get none or all heads in five coin tosses than in 100 coin tosses; the
standard deviation is larger for the former.
Probability distributions for coins, dice, and roulettes (so long as they are fair) can be
calculated directly from the probabilities of the individual outcomes, as sketched earlier
for 100 coin tosses. This is not so for variables like success rate with respect to free throws.
This is why the frequency distribution observed in some data set, as characterized by its
mean and standard deviation, is important for many random variables (such as success at
free throws or number of blue M&Ms in a bag). In these and many other cases, the relative
frequency distribution can be used to estimate the probability distribution. The predicted
probability distribution can then connect the frequency distribution of an observed data
set to expectations for some new, relevantly similar data set.
Estimating from a Sample

One important use of inferential statistics is to predict or estimate the value of a feature
or parameter of interest in a population on the basis of a set of observed data concerning a
sample of the population. This is what happens with political polling. The task here is partly
a matter of describing what you already know using statistical tools and partly a matter of
reasoned conjecture to what you want to know. What do you expect of the population (for
example, how the population of US voters will apportion its votes among candidates) given
what you know about the sample (candidate preference among those polled)? Studying a
sample and generalizing from the sample to a population is a very common way to learn
about the features of a population, especially for populations that are very large.
The basic idea behind generalizing from a sample to a population is using the observed
frequency distribution for some feature of the individuals in a sample as the basis for
estimating the probability distribution for the range of values of that feature in the gen-
eral population. We will often use examples of human populations, but the populations
might instead be of different organisms or other entities—tree frogs, nations, marbles,
corporations, and so on. Anything is fair game, so long as it has a variable for which we
can create a frequency distribution.
Measures of central tendency, such as the mean and median, and measures of vari-
ability, such as the standard deviation and variance, familiar from descriptive statistics are
also important measures in statistical estimation. But in this use, those numbers represent
the probability distribution associated with the feature of interest in the population. The
sample mean is the most likely average value of the feature in the population; in other
words, the sample mean is the estimate of the population mean. It is called sample mean
because this estimate is based on the mean of the observed sample. In other less strict
contexts, someone might use the term sample mean to refer to the mean of a sample,
but again, the sample mean is really a prediction about a feature of the population. The
predicted mean might not in fact turn out be the mean value in the population, but it’s
the most likely value and thus our best guess.
Imagine scientists have a sample of 100 university students, and they want to use that
sample to estimate the range of political views among all students at that university. They
might administer a questionnaire to the individuals in the sample, with each individual’s
TABLE 6.2 Imagined questionnaire scores of 100 university students
Questionnaire Score Number of Individuals
1 0
2 2
3 5
4 7
5 10
6 15
7 22
8 18
9 13
10 8
responses scored between 1 and 10, where 1 is most politically conservative and 10 is
most politically liberal. Imagine the questionnaire scores are as shown in Table 6.2. From
this data, scientists can estimate the mean degree of liberalness (or, equally, conservative-
ness) in the full population of university students.
Now, remind yourself of how to calculate the mean. You can do this by adding all
the scores (or, multiplying each score by the number of students who got that score and
adding those up) and then dividing by the total number of students. It turns out that
the mean score is 6.82—college students tend to be a rather liberal bunch, on average.
This score is the mean value of the sample. If the 100 students in the sample have been
appropriately selected—an idea we’ll unpack later—this score is also the sample mean:
it’s most likely to be the average value in the population of university students.
The sample mean plus the individual scores and the sample size n can be used to
calculate the standard deviation. In Chapter 5, we characterized the standard deviation
as a measure of the ‘spread’ of the values of some variable within a data set and defined
it as the square root of the variance:
This would be what we would use to find the standard deviation from data about an entire
population. But, with estimation from a sample, scientists have data from a sample only
and not from the entire population. So, they estimate the population standard deviation
from the standard deviation of an observed sample. This estimate is called the sample
standard deviation (s instead of σ) and is calculated in a slightly different way:
s = √[∑(value − mean)2 / (n − 1)]
The important change in this formula is n − 1 instead of n, where n is the number of data
points. This is a way to correct for systematic underestimation of the population mean.
We won’t ask you to perform this calculation here, but the sample standard deviation of
scores on the questionnaire works out to be 1.98.
Like sample mean, the terminology of sample standard deviation might be confusing.
In descriptive statistics, mean and standard deviation are used simply to summarize the
central tendency and variability of an actual frequency distribution. In inferential statistics,
sample mean and sample standard deviation instead provide estimates of the central ten-
dency and variability of the probability distribution for a random variable. The probability
distribution is the ‘middleman’, enabling a prediction of the characteristics of interest in
the population. So, the sample mean and sample standard deviation are not descriptive
measures of the sample but predictions about the population.
In making these predictions, a helpful rule of thumb for getting a rough probability
estimate of a characteristic of interest is called the 68–95–99.7 rule. This rule can be used
to remember the percentages of values expected to lie within a certain range around the
mean in a normal distribution. It says that about 68%, 95%, and 99.7% of the values lie,
respectively, within one, two, and three standard deviations of the mean (Pukelsheim,
1994) (the other 32%, 5%, and 0.3% being equally scattered on either side of these ranges).
Applying the rule to our example of political views of university students indicates
that any given student at the university has a 68% probability of having a score between
4.84 and 8.80 on our conservative/liberal scale. This is the mean (6.82) ± one standard
deviation, calculated by subtracting and adding the standard deviation (1.98) to the mean
(6.82). Given that 5.00 is the dividing line between liberal and conservative, a student
thus has roughly a 68% chance of being more liberal than conservative. Any given student
FIGURE 6.2 Diagram of the 68%–95%–99.7% rule for standard deviations
has a 95% probability of having a score of 2.86 to 10.00 (within two standard devia-
tions of the mean). We can be much more confident that some student will fall within
this range, but it is also a wider and less informative range. The only thing this tells us
is that most (95%) college students are predicted to be outside the most conservative
part of the scale.
Let’s return to the polls leading up to the 2016 US presidential election. The well-
regarded statistical blog FiveThirtyEight determined that of 22 pre-election polls, the
mean gap between the two main candidates was a 5.3% lead for Clinton, with a 3.6%
standard deviation (Enten, 2017). This means, assuming there are no underlying problems
with the polls and no voters changed their minds in the remaining two weeks before the
election, there was a 68% chance that Clinton would get between 1.7% and 8.9% more
votes than Trump. There was a 95% chance that Clinton would get somewhere between
1.9% fewer votes than Trump and 12.5% more votes. Clinton in fact got 2.1% more of
the popular vote than Trump, comfortably within the 68% interval. Nothing about the
election outcome was (statistically) surprising. This oversimplifies the situation a bit, and
there is much more to know about polling and statistical analysis of the 2016 election. But
this gives you a rough sense for how inferential statistics was employed in this context.
There is one more complication we should mention about the sample mean as a
prediction of the population: the sample mean can vary from sample to sample. In our
example of the questionnaire about political views, we sampled a specific 100 students
and found a sample mean of 6.82. If we sampled a different 100 students, we may get
a slightly different sample mean just by sheer chance. We wouldn’t be surprised to get
a sample mean of 6.56 for this new group, but we might be baffled if our new sample
mean were 4.22. Inferential statistics gives us a way to think about these possibilities
as well.
Imagine we repeatedly invite samples of 100 students to take this questionnaire about
political views, and we calculate the sample means for each. This results in the sampling
distribution of the sample mean. The standard deviation of this distribution provides an
estimate of the variation of the sample means. This estimate, the standard deviation of
the sampling distribution of the mean, is called the standard error and is calculated from
the sample standard deviation and sample size as follows:
SE = s / √(sample size)
The standard error is a measure of the precision of the sample mean, or the uncertainty
about the estimate of the mean of a population. The standard error, and hence uncer-
tainty about the sample mean, decreases as the sample size increases. This is because a
large sample size helps control for chance variation in the traits of sample. You can be
more certain about average political views among all students with a sample of 100 than
with a sample of 10.
Representative Samples
There’s a lingering issue about estimation from a sample that we’ve brushed under the
rug. Recall that inferences about the sociopolitical views of a student population depended
on whether the sample of students interviewed was ‘appropriately selected’. The issue
is that sound estimation via statistical inference requires that the sample be representa-
tive: the sample should accurately reflect the target features in the general population.
Samples chosen in ways that make some individuals in a population less (or more) likely
to be included than others will introduce bias in the inferences made about the popula-
tion based on the sample. This bias may lead to incorrect predictions. A poll that only
solicited the views of Republicans couldn’t accurately predict the outcome of a general
election. Similarly, if you are interested in studying the political opinions of the general
population of India, studying only the wealthiest Indian individuals may well lead you
astray, as this group may have some political views that are much less common in India’s
entire population.
Incorrect conclusions resulting from non-representative sampling are called sampling
errors. Here’s a historically significant case of a serious sampling error. In 1936, a magazine,
Literary Digest, sent out 10 million postcards asking Americans how they would vote in
that year’s presidential election. They received almost 2.3 million back, which is a very
large sample. In that sample, Alfred Landon had a decisive lead over Franklin Roosevelt:
57% to 43%. The Digest did not gather information that would allow it to judge the rep-
resentativeness of its sample. A young pollster, George Gallup, estimated from a much
smaller sample of 50,000 (which is still larger than most modern political polls). His
sample was representative, and it predicted Roosevelt winning by a landslide. That was,
of course, the eventual outcome of the election. The Literary Digest closed down soon
after, and Gallup’s name lives on in the well-known Gallup poll approach to measuring
public opinion based on surveying a sample (Squire, 1988).
The requirement of representativeness was discussed in Chapter 2 in the context of
forming experimental and control groups with similar ranges of values for any extrane-
ous (or confounding) variables. Also discussed there was how representative groups can
be achieved by random assignment to groups. The similar step for statistical inference
is called random sampling, where the individuals composing the sample are selected
randomly from the population. This protects against bias. Our discussion about estima-
tion has presumed random sampling. In Chapter 2 we also discussed the importance of
sample size for representativeness. The upshot was that larger samples can be expected to
be more representative than smaller samples, helping to control for the unwanted effects
of possible confounds. As we saw in the above discussion of standard error, the tools of
statistical inference enable us to explicitly take this into consideration.
How Estimation Can Go Wrong

Statistical analysis presumes random sampling, but truly random sampling is difficult to
accomplish in many circumstances. For example, in a telephone poll of voter preference
prior to an election, the phone numbers dialed can be randomly selected. But who picks
up the phone, whether a person hangs up immediately or answers the questions, and even
who has a phone and who doesn’t are all non-random influences on the people sampled.
And those influences may be confounding variables, since they might correlate with voter
preferences. As we’ve seen, these are called sampling errors.
There’s evidence of sampling error in polls leading up to the 2016 US presidential
election. Here’s one example. In general, Trump did particularly well with white voters
without college degrees. The FiveThirtyEight blog showed that Trump performed better
than the polls predicted in states with a greater concentration of voters in that demo-
graphic. This would seem to suggest that white voters without college degrees were under-
represented in pre-election polls. FiveThirtyEight also showed how Clinton did significantly
worse in the Midwest than polls predicted. This might in part be due to sampling error,
such as the underrepresentation of white voters without college degrees in the polls. But
there are also other factors. For one thing, election results in midwestern states are highly
correlated across those states. So, a statistical error affecting one of those states is likely
to impact the others, magnifying its effect on the election.
Using statistical inference to estimate features of a population from existing data about
a sample is a powerful extension of statistical description. We have characterized some
of the central features and uses of this form of reasoning. As with much else in science,
there are ideal methods that can be described in the abstract—in this case, involving large,
random samples—and then there are the real-world considerations that regularly lead to
deviations from these ideals. The tools of inferential statistics can also help us in assessing
the impact of such deviations.
EXERCISES
6.1 Define frequency distribution, and describe in your own words how mean and standard
deviation in descriptive statistics relate to frequency distributions. Then, define probabil-
ity distribution, and describe in your own words how inferential statistics makes use of
probability distributions, including how mean and sample standard deviation relate.
6.2 In class, or with a group of classmates, find a coin for each person and carry out
the following steps, with each person recording the answers individually. To instead
do this exercise individually, perform multiple series of four coin tosses on your own,
recording the outcomes of each series of tosses separately.
a. Determine your expectations about how frequently any given coin will land heads
up. How many heads would you expect on four coin tosses?
b. Each person should toss his or her coin four times, recording each result as either
heads or tails. Summarize your individual result as the ratio of heads to tails. This
will be either 0:4, 1:3, 2:2, 3:1, or 4:0.
c. Record how many people in your group got each of the possible ratios: 0:4, 1:3,
2:2, 3:1, and 4:0. Draw a histogram showing these results.
d. Some people didn’t get a ratio of 2:2. Why not?

e. Should we expect each person to get the same ratio of heads to tails on the next
four coin tosses as he or she did on the first four? Say why or why not, appealing
to concepts from Chapter 5 and Chapter 6 as needed.
f. Each person should toss his or her coin four more times, recording each result
and summarizing as a new ratio of heads to tails.
g. Add these ratios to the data set of the first round of coin tosses. There should now
be twice as many series of four coin tosses. Draw a new histogram describing the
complete data set.
h. Compare the histogram from (c) to the histogram from (g). What has changed?
Why?
6.3 Convert the following data about the outcomes of 11 rolls of a die to a relative fre-
quency distribution:
First roll: 4; second roll: 3; third roll: 4; fourth roll: 6; fifth roll: 2; sixth roll: 1; seventh
roll: 6; eighth roll: 4; ninth roll: 5; 10th roll: 5; 11th roll: 6
Now, draw a histogram showing this relative frequency distribution.
6.4 From what you know about probabilities for fair die rolls, construct a histogram show-
ing the probability distribution for how many times a four is rolled on five die throws.
Show the calculations you used to generate the histogram (which will involve the ad-
dition and multiplication rules of probability).
6.5 Draw a histogram depicting a normal distribution. Draw a small line on the x-axis
where the mean is. Then, draw a second histogram depicting a non-normal distribu-
tion. Describe what feature(s) of the non-normal distribution distinguish it from the
normal distribution.
6.6 Figure 6.3 shows four histograms depicting normal distributions and four pairs of
means and standard deviations. Match each histogram with the mean and standard
deviation it depicts, and briefly explain your choices.
1. mean = 4, standard deviation = 2
2. mean = 4, standard deviation = 0.5
3. mean = 23,standard deviation = 10
4. mean = 50, standard deviation = 5
6.7 Consider each of the inferences 1–4, then answer the following questions (a–d) about each.
1. Almost all Italian football players are good, so those two Italian football players
are probably good.
2. All Italian football players I’ve seen have been good, and I’ve seen at least 10.
So the next Italian football player I see will be good.
3. Approximately 12.4% of women will be diagnosed with breast cancer sometime
during their lifetimes. Of a group of 100 randomly selected women, it’s likely that
approximately 12 will develop breast cancer at some point during their lives.
4. Among your classmates, 89% have seen the most recent Tarantino movie. So
almost all people in town must have seen that movie.
a. Is the inference from sample to population, from population to sample, or
from sample to sample?
b. What is the exact conclusion of the inference?
c. Describe the sample size and representativeness of the sample.
d. Assess the quality of each inference, attending to the strength of the conclu-
sion, the sample size, and how representative the sample seems to be.
6.8 There are 3,000 people at a party. (It’s a very large party!) 100 are interviewed at
random, and it is discovered that 80 are philosophers, 10 are geologists, and 10 are
artists. The sample standard deviation is ±12%.
a. What’s the percentage of philosophers in this sample of 100 party guests?
b. What’s the probability that the percentage of philosophers at the party is in the
range of 68–92%? (Hint: consult the discussion about the 68–95–99.7 rule.)
c. Within what range does the percentage of philosophers lie with 95% probabil-
ity? How about with 99.7% probability?
FIGURE 6.3 (a, b, c, d) Four histograms of roughly normal distributions
6.9 Consider the following statements in light of the data provided in Exercise 6.8. For
each, say whether the data support the conclusion. Describe your reasoning for
each answer with reference to the information provided and what you know about
statistical estimation.
a. It’s highly probable that the majority of party guests are philosophers.
b. Eighty percent of the people at the party are philosophers.
c. It’s more likely than not that at least 8% of the guests are non-philosophers.
d. It’s highly likely that the geologists are outnumbered at this party.
e. It’s highly probable that most people in the world are philosophers.
6.10 Describe in your own words how statistical methods are used to make estimates
about a population from sample data and some ways in which this can go wrong.
Come up with a simple example to illustrate these ideas.
6.11 Find an article in a newspaper, magazine, or reputable online source that draws
conclusions from a poll. Alternatively, your instructor may provide one article for the
whole class to use for this exercise. Answer the following questions; if you can’t find
the answer, say so, and provide your best guess if possible. If you selected your own
article, please submit a copy or printout of it with your responses.
a. What variable was under investigation? What were the researchers interested
to know?
b. What was the sample size? How was the sample selected?
c. Is the sample likely to be representative? Why or why not?
d. What data did the researchers collect about the sample?
e. What conclusions about the population did the researchers draw from the sample?
f. Assess the poll, the results, and the researchers’ conclusions. Are there any
problems with any of these? How could the poll or the conclusions be improved?
6.2 USING STATISTICS TO TEST HYPOTHESES
• Describe the steps of statistical hypothesis-testing

• Determine levels of significance and compare those with data to decide whether to
reject the null hypothesis, supporting your reasoning with statistical considerations
• Define statistical significance, p-value, type I and type II errors, and effect size, and ana-
lyze an example with regards to these features
Statistical Methods and Scientific Discovery

In 2012, scientists at CERN (European Center for Nuclear Research) in Geneva,
Switzerland, proclaimed the discovery of a new boson, a very tiny particle whose existence
is crucial to our understanding of the fundamental structure of matter. Using CERN’s
Large Hadron Collider (LHC), a 27-kilometer (17-mile) tunnel straddling the Swiss-
French border, scientists had been repeatedly observing the outcomes of collisions among
protons (one kind of particle in the nucleus of atoms). These collisions produce a shower
of new particles, most of which are unstable and decay into other particles in a tiny fraction
of a second. Decay products can give insight into the particles that had been created,
as each particle type has its own signature decay products. In particular, the trajectory,
energy, and momentum of these new particles can be detected in this way. Detecting
their mass is particularly relevant to distinguishing between different types of particles.
However, because subatomic particles are so small, it’s a challenge to distinguish the
signature properties of new particles from background events.
In the summer of 2012, scientists recorded a ‘bump’ in their data corresponding to
a particle with a mass between 125 and 127 GeV/c2 (one gigaelectronvolt, or 1.783 ×
10−27 kg). This is about 133 times heavier than protons. It was thought that this recorded
‘bump’ could provide evidence of a new particle—perhaps of the long-sought Higgs
boson. The data were consistent with hypothesized properties of a boson, but consistency
alone is not strong enough justification. Using statistical reasoning, the scientists calcu-
lated that this bump would occur by chance, emerging from only background events in
the collider without the presence of a boson, only once in three million trials. So, the
scientists rejected the idea that the bump occurred by chance. Instead, they concluded
that the data indicated the discovery of the Higgs boson.
Scientists all over the world were thrilled with this news. Its discovery could lend
additional support to the ‘Standard Model’ of particle physics. The discovery of a Higgs
FIGURE 6.4 Fabiola Gianotti, project leader and spokesperson for the ATLAS experiment at
CERN involved in the discovery of the Higgs boson in July 2012
CERN Creative Commons http://cds.cern.ch/record/1326962
boson indicates the existence of the Higgs field, an invisible energy field postulated by
the Standard Model as present throughout the universe and the source of other particles’
mass. The hypothesized Higgs boson was supposed to be like the glue of the universe—
what joined everything together and gave it mass. And now, it seems, this hypothesis was
tested and confirmed (Chatrchyan et al., 2012).
The groundbreaking discovery of the Higgs boson is just one example illustrating
how fundamental statistical reasoning is to scientific inquiry. Inferential statistics lever-
ages probability theory to enable sophisticated forms of inductive inference. In the last
section, we discussed its use in estimation. Another primary use is in hypothesis-testing,
that is, in deciding whether the available evidence confirms or disconfirms a hypothesis.
This was the form of statistical inference used when scientists rejected the possibility that
the bump in data was due simply to chance and instead posited the presence of a Higgs
boson as the reason those data were observed.
The Steps of Statistical Hypothesis-Testing

Good statistical reasoning requires informed judgment about which hypotheses to test,
which statistical methods to use, and what conclusions to draw from the findings.
The first step is to formulate two competing statistical hypotheses (or statistical models).
One is called the null hypothesis, which is a kind of default assumption—a reasonable expec-
tation about how the world is. For the scientists at CERN, this was the idea that a Higgs
boson was not responsible for the bump in their data or—put another way—that the bump
in data was generated by background chance encounters among known subatomic particles.
The other hypothesis is called the alternative hypothesis, since it is posited as an alternative
to the default assumption and is the speculative hypothesis of focal interest. This will be
a bold and risky conjecture, the value of which we have periodically emphasized. For the
scientists at CERN, this alternative was the hypothesis that a Higgs boson, a particle never
previously detected, was responsible for the ‘bump’ in their data. The null hypothesis is usu-
ally just a negation of the alternative hypothesis—the idea that the bold conjecture isn’t true.
The null hypothesis and alternative hypothesis are related in this way in the hopes of
performing the following kind of inference. The null hypothesis—the default expectation—
leads one to expect a certain range of possible outcomes. When the actual data collected (by
experiment or observational study) are pretty far outside that range, scientists can reason
that such data would be overwhelmingly unlikely if the null hypothesis were true. And so,
the data provide grounds for rejecting the null hypothesis. The data instead support the
alternative hypothesis—the bold conjecture.

This is basically a statistical version of the H-D method encountered in Chapter 4:
expectations are derived from a hypothesis, observations are made, and if the observations
don’t match the expectations, the hypothesis is disconfirmed. In statistical hypothesis-
testing, expectations regard the likely values of a random variable given that the null
hypothesis is true. If a value far enough outside the expected range is observed, then the
null hypothesis is refuted, and, in its place, the alternative hypothesis can be tentatively
accepted. When scientists fail to reject the null hypothesis, their test results are deemed
inconclusive; this doesn’t show that the alternative hypothesis is wrong.
Let’s look more closely at this reasoning process. The second step, after formulating the
alternative hypothesis and the null hypothesis, is to develop the expectations one should
have if the null hypothesis were true. This is where statistics comes in. Inferential
statistics can be used to generate a probability distribution for possible outcomes on
the basis of the null hypothesis. The scientists at CERN set out a protocol for sta-
tistical analysis before gathering any data that specified, for any bump in data they
might observe, how probable it would be given the null hypothesis that no boson
was responsible.
The third step is to gather data about the outcome of the random variable in question
using experiment or observation. These data can be used to evaluate whether and to what
degree the observed data violate the probabilistic expectations based on the null hypoth-
esis. The CERN scientists observed an unexpected bump in data, and they used their
statistical analysis protocol to determine that the probability of this observation would
be extremely low (about one in three million) if the null hypothesis were true—that is,
if no boson were responsible.
So, probabilistic expectations are developed from the null hypothesis with the use of
inferential statistics, and actual data are compared with those expectations. The fourth
and final step of statistical hypothesis-testing is to draw a conclusion from that com-
parison. This final step is always a judgment call. Scientists have to decide how unlikely
the data should be, given the null hypothesis, before they have sufficient grounds to
reject the null hypothesis. If the data are not too far from what is expected given the
null hypothesis, then scientists have no reason to reject the null hypothesis, the default
expectation, in favor of the alternative hypothesis. If, however, the observations do violate
expectations, then this provides a reason to reject the null hypothesis in favor of the
alternative hypothesis. This was the exciting scenario encountered at CERN: faced with
a bump in data that would have been exceedingly unlikely without a boson responsible
for it, the scientists rejected the null hypothesis and declared that they had evidence
for the alternative hypothesis. That is, they declared that they had discovered the long-
sought Higgs boson.
These steps of statistical hypothesis-testing are summarized in Table 6.3. There, we
have emphasized how these steps conform to the basic recipe of developing a hypothesis,
TABLE 6.3 Summary of statistical hypothesis-testing and its relationship to general

hypothesis-testing
Step Procedure
Hypothesis Formulate alternative hypothesis (the bold conjecture) and corresponding

null hypothesis (the default expectation)
Expectations Determine probability distribution for the range of outcomes expected if the
null hypothesis is true
Observations By experiment or observational study, determine one or more actual outcomes
Conclusion Evaluate whether the actual outcome is unlikely enough given expectations
from the null hypothesis to provide grounds for rejecting the null hypothesis
generating expectations, comparing those to observations, and drawing a conclusion out-

lined in Chapter 1. As we have discussed, there are also parallels between statistical
hypothesis-testing and the H-D method outlined in Chapter 4.
More on Step 2: Developing a Probability Distribution

But let’s back up. How did CERN scientists determine that the probability of the data
they collected was so miniscule if the null hypothesis were true? That question is at the
heart of statistical hypothesis-testing.
To answer this question, let’s turn our attention to a classic experiment on the
tasting of tea, as witnessed by Ronald Fisher (1890–1962), a British geneticist and
one of the designers of classical statistics (see Fisher, 1956). Imagine a friend of
yours asks you, when preparing tea for her, to add milk to the cup first and only
then add the tea. She claims she can discriminate by taste the order in which the
milk and the tea were poured into the cup—she thinks the tea tastes better when
the milk is added first.
Intrigued, you decide to test this claim. According to the steps outlined in the
previous section, summarized in Table 6.3, you should start by formulating the null
and alternative hypotheses. In this circumstance, what would you choose for each of
these? The bold and speculative conjecture is that your friend really can discern by
taste the order in which ingredients were poured into a cup. This would be surprising!
The null hypothesis is simply that your friend cannot do this. This is, probably, your
default assumption.
You set up to run an experiment on your friend. You prepare a cup of tea out of
view, tossing a fair coin to determine whether you pour tea or milk first. (This random-
izes the order, which is a way to control for extraneous variables; recall the discussion
of randomization in Chapter 2.) Then, your friend has a sip. Immediately, she declares
that you added the tea first. She’s right, but this isn’t terribly impressive. After all,
there was a .5 probability of her guessing correctly by accident, since either the tea was
poured first in the cup or the milk was, and because of the coin toss, you were equally
FIGURE 6.5 R. A. Fisher
likely to do each. So, on to the second cup of tea. Your friend guesses correctly again,
and this is only slightly more impressive. The probability of guessing correctly purely
by chance for both cups is the probability of a correct guess for the first and a correct
guess for the second, so .5 × .5, or .25 (by the multiplication rule for probabilities
introduced in Chapter 5).
Notice that we have been describing steps 2 and 3 from Table 6.3 in tandem: first, we’ve
described an observation (your friend’s guess), and then we’ve determined the expecta-
tion given the null hypothesis by calculating the probability of obtaining the observation
if your friend is merely guessing at random. So far, the observations conform perfectly
to the alternative hypothesis that your friend really can tell the order the tea and milk
were added—she has guessed right twice in a row. But this isn’t sufficiently unlikely to
rule out the null hypothesis that she’s merely guessing at random.
At this point, you decide to make the test more rigorous by finding a way to make it
very unlikely for your friend to guess correctly by chance. You prepare eight new cups
of tea at once, tossing a coin to determine milk-first or tea-first for each. You put the
cups of tea in front of your friend and ask her to say of each whether the milk or the
tea was added first.
What does the null hypothesis lead us to expect? If your friend is merely guessing
at random, she is most likely to be right about four of the eight cups. This is the mean
expected outcome. The way to think about the calculation of this is the number of trials
multiplied by the probability of success on each trial:
mean = # trials × Pr(O=success) = 8 × .5 = 4
In this context, the mean is the most likely outcome. If your friend were to make repeated
guesses about the eight cups, and the null hypothesis were true, the most common out-
come would be for her to be right about four cups and wrong about the other four. Since
your friend is only making guesses about one series of eight cups, the mean indicates the
most likely outcome. But this outcome is not assured, even if the null hypothesis were
true. By sheer luck, she might still guess correctly more often or less, just as you might
happen to get more or fewer heads over a series of coin tosses.
Just as we can determine the probability distribution for different numbers of heads
over 100 coin tosses, we can also determine the probability distribution for different
numbers of correct guesses out of eight trials on the null hypothesis that your friend has
no exceptional tea-tasting discriminatory ability. The probability of each number of cor-
rect guesses can be calculated based on the .5 probability of success on each guess and
the number of different ways to get each number of correct guesses. (There’s only one
way to guess none or all eight correctly, but eight different ways to guess one or all but
one correctly, 56 ways to guess two or all but two correctly, and so on.) This probability
distribution is shown in Figure 6.6. It indicates what data we should expect if your friend
is guessing randomly.
But instead of calculating the probability distribution in this way, we could instead
simply find the mean outcome and standard deviation of the probability distribution. In
many instances of statistical hypothesis-testing, this will be the only feasible approach to
developing the probability distribution. We already know the mean outcome. Your friend
is most likely to have four correct guesses out of eight if she is guessing randomly. The
0.3
0.2
Likelihood
0.1
0
0 1 2 3 4 5 6 7 8
Number of correct guesses
FIGURE 6.6 Probability distribution of the number of guesses your friend will get correct if she
is randomly guessing
standard deviation for the probability distribution can be calculated using the following
formula:
σ = √[mean × (1 − Pr(O=success)]
For the tea-tasting experiment, the standard deviation is √[4 × (1 − .5)] = √2 = 1.414.
Notice that this standard deviation formula is very different from the other standard
deviation formulas we have encountered in this chapter and Chapter 5. In this case, the
task is to calculate the probability distribution of getting the outcome or set of outcomes
you’re interested in, ranging from that happening in no trials to it happening in every trial,
assuming the null hypothesis of chance outcomes (true if your friend is merely guessing).
In the formula, 1 is the total probability, and Pr(O = success) is the probability of success
(here, guessing correctly) in a single trial. Multiplying that by the mean number of suc-
cesses yields the variance; the square root of that number is, then, the standard deviation.
Now, armed with the probability distribution or simply its mean and standard deviation,
we can establish the significance level we’ll require as the line for rejecting the null hypothesis.
This is a decision about how improbable, given the null hypothesis, an experimental result
must be to warrant rejecting the null hypothesis. Different levels of significance can be used
in hypothesis-testing, but .05 is the most common choice. This is a convenient choice because
of the 68–95–99.7 rule: there’s a .95 probability of your friend guessing correctly within two
standard deviations of the mean if she’s guessing randomly. So, there’s only a .05 chance of
her guessing outside that range if she’s guessing randomly. Another convenient choice would
be to set .003 as the significance level, which corresponds to getting a result outside three
standard deviations of the mean value. We’ll say more about the decision of what significance
level to require later. For now, let’s stick with the customary threshold of .05.
More on Steps 3–4: Assessing the Statistical Significance

of Observations
With all of this preparatory statistical work completed, you are now ready to test your
friend’s tea-tasting skills. This is the third step of statistical hypothesis-testing: actually
collecting the data. You ask your friend to judge ‘tea-first’ or ‘milk-first’ for all eight cups
of tea. She correctly judges all eight cups! Given this data, would you be tempted to
conclude that you were wrong and your friend was right, that maybe she can perceptually
discern something about tea and milk order? Deciding this is the fourth step of statistical
hypothesis-testing.
There are two possibilities: (1) the null hypothesis is true, which means that your friend
has no exceptional ability, or (2) the null hypothesis is false, and the alternative hypoth-
esis that your friend can discriminate between milk-first and tea-first cups of tea is true.
The goal of statistical hypothesis-testing is to do our best to decide whether (1) should
be rejected in favor of (2). This is the question of whether you should believe that your
friend can actually distinguish the order tea and milk were added to a cup.
From the probability distribution developed in step 2 and the outcome of your experi-
ment, when your friend correctly judged all eight cups of tea as tea-first or milk-first, we
can calculate a p-value. This is the probability of the observed data assuming the null
hypothesis is true. The smaller the p-value, the more unlikely the data if the null hypoth-
esis is true. Whether one should reject the null hypothesis is determined by comparing the
p-value and the significance level selected in step two. If the p-value is less than or equal
to the significance level, we can reject the null hypothesis with reasonable confidence. If
the p-value is greater than the significance level, we can’t rule out the null hypothesis.
In the tea-tasting experiment, there are actually two different ways of establishing how
the p-value relates to the significance level. One way is to consult the probability distribu-
tion shown in Figure 6.6. We can see from that figure that there is only a very small chance
of guessing all eight cups correctly via random guesswork, so the p-value of our data is very
low. The precise p-value can be calculated from the probability of guessing all eight cups
correctly. This is easy to find with the multiplication rule; it’s just the probability of the
first guess being correct and the probability of the second guess being correct, and so on,
so it’s 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5 × 0.5, or 0.58, which is .0039. The p-value,
the probability of your friend guessing correctly on all eight cups by random guesswork,
is only .0039. This means there’s only a 0.39% chance of this happening.
Put another way, if your friend tasted many series of eight cups of tea, she could get
this outcome by guessing randomly only about one out of 256 times of tasting eight cups
of tea. If she guessed all eight correctly by sheer luck, on her first try, she’s really lucky!
And, indeed, the p-value of .0039 is lower than our chosen significance level of .05. The
outcome of this experiment is thus statistically significant: unlikely enough if the null
hypothesis were true that it provides grounds for rejecting the null hypothesis.
The second way to establish how the p-value relates to the significance level for the
tea-tasting experiment is by using the mean and standard deviation rather than the prob-
ability distribution. By the 68–95–99.7 rule, outcomes that are two standard deviations
away from the mean are the threshold for statistical significance at the significance level
of .05. So, we can check to see whether the observed value is inside or outside that line.
Two standard deviations in this case is 2.828 (or 1.414 × 2), so outcomes outside the
range of 4 (the mean) ± 2.828 are statistically significant. That range is 1.17 to 6.828.
If your friend guessed zero or one cups correctly or seven or eight cups correctly, this
would be grounds for rejecting the null hypothesis. Since she in fact guessed all eight
cups correctly, you should reject the null hypothesis. It looks like you need to believe in
your friend’s tea-tasting superpower!
We’ve compared the p-value of the experimental outcome to the significance level
using both the probability distribution and the mean and standard deviation, but either
one of these is enough on its own. This comparison provides a simple way to decide
whether to reject the null hypothesis on the basis of the data. Nonetheless, as we remarked
earlier, there is a role for choice in exactly what level of statistical significance to require.
One can always ask whether this outcome is unlikely enough to reject the null hypothesis.
This is a version of the more general decision we’ve seen elsewhere in this book regarding
when there is sufficient evidence to accept some hypothesis.
As we have said, it’s common to draw the line at a significance level of .05. Observed
results with a probability of less than .05 given the null hypothesis are said to be sta-
tistically significant at the .05 level. One can abbreviate this: p < .05. This is, of course,
true of the outcome of our tea-tasting experiment, which is why we rejected the null
hypothesis. Notice that we could still be wrong; it’s always possible that our friend really
was just extraordinarily lucky. But if we instead decided to play it safe and not reject
the null hypothesis, we could be wrong about that as well. We might have then failed to
detect our friend’s tea-tasting superpower. By its very nature, statistical hypothesis-testing
gives no guarantees.
The risk of erroneously rejecting the null hypothesis when it is true is called a type I
error. The risk of erroneously failing to reject the null hypothesis when it is false is called
a type II error. These are the two different ways you could be wrong, and one or the
other is always a risk. The choice of significance level indicates the degree to which you’re
willing to accept the risk of a type I error versus a type II error. Requiring a higher sig-
nificance level—that is, requiring a lower probability for statistical significance—reduces
the chance of a type I error, but it simultaneously increases the chance of a type II error.
Lowering the significance level to a less extreme value reduces the chance of a type II
error, but it increases the chance of a type I error. So, whatever significance level you
settle upon, you might draw the wrong conclusion from your data (cf., Benjamin et al.
2018; Lakens et al. 2018).
Scientists sometimes adjust the conventional .05 line for statistical significance in light
of whether type I or type II errors are riskier. Imagine a new drug is being tested. If the
drug is for a life-threatening illness with no treatment options otherwise—say, pancreatic
cancer or Ebola—and experiments regarding the efficacy of the drug find it works better
than a placebo with a p-value of .055, just missing the line for statistical significance,
researchers may still be inclined to bring the drug to market or at least continue testing.
In contrast, if scientists are thinking about announcing a new particle, and they know
their colleagues will scrutinize their findings, they may require much greater statistical
significance. Recall that the Higgs boson discovery was announced after the finding that
the probability of the data observed was only one in three million without a boson pres-
ent. This significance level is so close to zero it’s difficult to even display numerically; it
just rounds down to zero.
Considerations in Designing Statistical Tests

We have described a simple form of statistical hypothesis-testing here, but this discussion
has just scratched the surface. There’s significant variety and complication to inferential
statistics that is beyond the scope of this book. For example, there are variations on this
approach to statistical hypothesis-testing and different forms of statistical tests entirely.

There are also different ways of approaching the decision of whether to reject the null
hypothesis. There are also bigger issues lurking below the surface; we’ll discuss the major
contender against this approach to statistical reasoning in the next section. But for now,
we’ll finish this discussion of statistical hypothesis-testing with a few important consid-
erations about designing statistical tests.
To begin, notice that statistically significant results aren’t necessarily important or
substantive results. Statistical significance just means statistically unlikely enough to
reject the null hypothesis. Statistically significant results might nonetheless be theo-
retically uninteresting and practically irrelevant. You could test whether human sub-
jects can reliably distinguish humans from dogs when passing them on the street. If
your observations are statistically significant, you should reject the null hypothesis that
humans cannot distinguish between members of these two species. But you wouldn’t
be surprised. Scientists and members of the public alike should be on the lookout for
statistically significant findings that are nonetheless not worth our attention (see also
Gelman & Stern, 2006).
Another feature of data that is wholly distinct from statistical significance is effect size,
the magnitude of difference some variable makes. Effect size can be reflected in different
measurements, such as the mean difference in some feature between two groups or the
correlation between two variables. The important point here is that a statistically signifi-
cant finding may nonetheless identify a factor with only a small effect size. Sometimes
scientists represent the effect size along with the statistical significance of their finding,
and this additional information can help put the finding into perspective.
Statistical tests vary in their power, that is, in the probability that they will enable
the rejection of a null hypothesis. More powerful tests increase the chance of rejecting
false null hypotheses and thus decrease the chance of type II errors. Power increases with
sample size. In the tea-tasting experiment, we weren’t able to reject the null hypothesis
after two cups were guessed correctly, but we were able to after eight cups were guessed
correctly. Increasing sample size to increase the power of a statistical test can be a good
thing, as in the tea-tasting experiment. But it also has a downside: this increases the chance
of making a type I error. Studying a very large sample of individuals makes it relatively
easy to uncover statistically significant findings, but it also makes it relatively easy to
erroneously reject the null hypothesis—that is, to declare incorrect findings.
Statistical tests that increase power by using very large samples also enable a very small
effect size to qualify as statistically significant. For example, a certain gene has been linked
with the chance of someone smoking cigarettes (Thorgeirsson et al., 2010). If you have
this gene, you are more likely to smoke cigarettes. Can researchers tell from your genes
whether you have smoked, or will smoke? As the researchers acknowledged, absolutely
not: there was only a very weak relationship. This, too, can be an advantage but also a
drawback. It can be useful to detect subtle statistical relationships, but weak statistical
relationships are often uninteresting or unimportant. It’s also possible to take too seriously
a statistical relationship that actually has a very small effect size.
Armed with this account of statistical hypothesis-testing and taking into consideration
features like effect size and sample size, you will be better able to critically assess sci-
entific findings based on statistical reasoning, and to distinguish the truly surprising and
important discoveries from the confused and inconsequential results.
EXERCISES
6.12 In your own words, define the terms statistical significance, p-value, type I error, type
II error, and effect size.
6.13 (a) Give an example of a finding that is or would be statistically significant but is
not important. (b) Give an example of a factor that you expect would have a small
effect size.
6.14 Consider a new version of the tea-tasting experiment. Suppose that your friend
samples 10 cups of tea, among which five had the tea poured first and five had the
milk poured first.
a. Calculate the mean outcome and the standard deviation, showing your work.
b. Find the range that defines the .05 significance level.
c. Imagine your friend correctly identified exactly three of the cups. Is the p-value
of this outcome small or larger than the .05 significance level?
d. Should you reject the null hypothesis? Why or why not?
e. Do you risk making a type I error or a type II error? Why?
6.15 It’s estimated that 10% of the general population is left-handed. Imagine testing whether
your group of friends contains an unusually large number of left-handed people. Let’s
say that, in this age of social media, you have 75 friends, 14 of whom are lefties.
a. Write out your null hypothesis and alternative hypothesis, making clear which
is which.
b. Calculate the mean and standard deviation for how many of your group of
friends would be expected to be lefties if the null hypothesis were true.
c. From the information you calculated in (b), set an appropriate significance level
for your test and say why you chose that level.
d. Based on the information you calculated in (b) and the significance level you set
in (c), decide whether to reject the null hypothesis, and justify your decision with
statistical considerations, including citing p-value and/or statistical significance.
6.16 You find yourself wondering whether left-handed people have an unusually high
chance of accidental injury. Recall from Exercise 6.15 that 14 of your 75 friends are
left-handed. Of those 14 left-handed friends, seven have been injured in accidents
of one kind or another in the past two years.
a. Write out your null hypothesis and alternative hypothesis, making clear which
is which.
b. What is your sample size for testing this hypothesis?

c. Let’s say that 36% of the general population suffers accidental injury in any
given two-year period. What is the expected mean and standard deviation for
your sample if the null hypothesis is true?
d. Set an appropriate significance level for your test and say why you chose that
level.
e. Decide whether to reject the null hypothesis, and justify your decision with sta-
tistical considerations, including citing p-value and/or statistical significance.
f. Consider the possibility that your test led to a type I or type II error. Which error
are you concerned about? What feature(s) of the test might increase the chance
of this error?
6.17 Each of the following is a bold conjecture that can serve as an alternative hypoth-
esis. For each, (a) formulate the null hypothesis, (b) describe what a type I error
would be and what a type II error would be, and (c) say which kind of error would
be more serious and why.
a. Adding water to toothpaste helps protect against cavities.
b. This man is guilty of murder.
c. The use of social media makes users depressed.
d. The new drug is more effective than the old drug.
e. The new drug is more dangerous than the old drug.
f. Reading books promotes happiness.
6.18 Scientific journals tend to publish statistically significant results much more often than
they publish findings of statistical insignificance. Why do you think this might be?
Considering the earlier discussion about power, type I and II errors, and effect size,
can you think of any potential problems with this practice?
6.19 Classify each of the following statistical techniques as belonging to descriptive sta-
tistics, statistical estimation, or statistical hypothesis-testing. Give your rationale for
each answer.
a. Displaying a data set in a chart
b. Surveying a group about their pizza preferences to decide if they have an
unusual preference for anchovies
c. Surveying a group about their pizza preferences in order to place an order
d. Calculating the sample mean and sample standard deviation
e. Surveying a group about their pizza preferences in order to guess what all
Canadians’ pizza preferences are
f. Finding the mean level of preference for anchovies on pizza among a group
and the standard deviation in that level of preference
g. Rejecting a null hypothesis on the basis of data
h. Finding the correlation coefficient of a data set
6.20 In Chapters 5 and 6, we have seen different formulas for mean and standard devia-
tion and different uses for these as well. Write out the proper mean and standard
deviation formulas for each of the following: (a) representing the frequency distribu-
tion of a data set, (b) estimating mean and standard deviation in a population from
a sample, and (c) establishing the probability distribution for outcomes given the
null hypothesis. Describe how each relates to observed frequency distributions and
probability distributions. Finally, describe how the differences among the formulas
relate to their different uses.
6.3 A DIFFERENT APPROACH TO STATISTICAL INFERENCE
• Describe three problems with classical statistics

• Define Bayes’s theorem, prior probability, posterior probability, and Bayesian condition-
alization and describe how these relate to one another
• Calculate a Bayes factor to compare how much an observation favors different

hypotheses
• Calculate a posterior probability using provided statistical information and Bayes’s
theorem
• Describe some advantages and disadvantages of Bayesian inference
Classical Statistics and Its Problems

The statistical hypothesis-testing described so far is part of classical statistics, also called
the frequentist interpretation of statistics. Classical statistics is so called because it is more
or less standard—at least presently. But it’s not the only game in town, and someday, it may
no longer be the standard way of doing statistics. We describe a different approach, called
Bayesian statistics, in this section. The point of this section is to show that there are dif-
ferent statistical approaches and to discuss some of the pros and cons of both approaches.
Classical statistics has several problems; we’ll discuss three. To start, think back to
the last step of the procedure of statistical hypothesis-testing. What results from this
application of inferential statistics is a p-value—that is, the probability of the observation
occurring given that the null hypothesis is true—which we compare with a pre-established
significance level. There are three oddities about this, all of which relate to problems with
classical statistics.
Here’s the first problem with p-values. Scientists’ primary interest in statistical hypothesis-
testing is to figure out which hypotheses are true. But a p-value doesn’t indicate how probable
the hypothesis itself is, that is, how likely the hypothesis is to be true. It only indicates how
probable the observation is (which we already know occurred) given the null hypothesis. If
the p-value is small enough, we can decide that the observation we made is so unlikely given
the truth of the null hypothesis that we should reject the null hypothesis. But we still don’t
know anything about the chance the null hypothesis is true (Cohen, 1994).
In some circumstances, this can lead to a type I or type II error—erroneously reject-
ing or not rejecting the null hypothesis. This is a second problem with classical statistics:
because this procedure doesn’t find the likelihood of the null hypothesis, it doesn’t take
into account any information we might have in favor of—or against—the truth of the
null hypothesis. For example, in our tea-tasting example, it may be the right decision not
to reject the null hypothesis even though your friend guesses correctly so often. What
we know about how tasting works and about what properties a cup of tea can and can’t
have suggest this tasting feat should be nearly impossible. Maybe your friend was just
extraordinarily lucky, or maybe she had a way of cheating. Regardless, perceptually dis-
cerning tea-first or milk-first by taste seems nearly impossible. In contrast, a friend who
could discern two different types of wine in a blind taste test wouldn’t really be that
surprising. The same success rate may thus lead us to want to reject the null hypothesis
of random guessing for wine-tasting but not reject the null hypothesis of random guessing
for tea-tasting. That is, we may have different expectations—prior to any experimenta-
tion—regarding different hypotheses. The trouble is that statistical hypothesis-testing as
we have described it doesn’t have any way to take account these different expectations
(Lindley, 1993).
Finally, here’s a third problem. The probability of the observation given the null
hypothesis, the p-value, doesn’t directly relate to the alternative hypothesis at all. This only
tells you something about the relationship between the null hypothesis and the observed
data. And yet, the alternative hypothesis, the bold and speculative conjecture, is what
scientists are truly interested in knowing about. How likely is the alternative hypothesis
to be true? This is the million-dollar question in hypothesis-testing. But classical statistics
gives us no way to answer that question.
The Bayesian Approach

The Bayesian approach to statistics solves all three of these problems with classical statis-
tics, as well as some other problems not discussed here. Bayesian statistics aims to deter-
mine when an observation counts as evidence for one hypothesis and against a competing
hypothesis and how that observation should change our degree of belief that each of these
competing hypotheses is true. Bayesian statisticians do not use p-values and significance
testing, and they do not attempt to reject null hypotheses. For Bayesians, an observation
counts as evidence for a hypothesis when it raises the probability of the hypothesis. This
should, in turn, influence our degree of belief in the hypothesis.
Here’s a more exact statement of the relationship between the probability of a hypoth-
esis and the degree of belief in the hypothesis. In Bayesian statistics, the probability of
a hypothesis is understood as a rational degree of belief that the hypothesis is true. The
terms belief and rational here are technical terms, and so do not have their ordinary mean-
ings. Beliefs are rational when they follow the rules of probability theory. The rational
degree of belief in a hypothesis H after observation O is just the conditional probability
of H given O. Remember that this, Pr(H|O), is what we criticized classical statistics for
not being able to provide.
This conditional probability can be calculated using Bayes’s theorem. Thomas Bayes,
an English statistician, philosopher, and Presbyterian minister, proved this theorem in the
18th century. It is a mathematical formula that can be derived from the definition of a
conditional probability. Here is Bayes’s theorem in its simplest formulation:
Pr(H|O) = Pr(O|H )Pr(H ) / Pr(O)
This formula states that the probability of a hypothesis given an observation—which is

what we want to find out—is the probability of the observation given the hypothesis
multiplied by the probability of the hypothesis and then divided by the probability of
the observation.
Pr(H) is called the prior probability of the hypothesis; Pr(H|O) is called the posterior
probability of the hypothesis. This is because Pr(H) is our rational degree of belief before
making the observation, that is, prior to the observation, while Pr(H|O) is our rational
degree of belief after (posterior to) making the observation. Taking into account the
prior probability of a hypothesis enables us to hold implausible hypotheses to a higher
standard of evidence than plausible hypotheses are held to. We’d look for more support
before agreeing that our friend can tell the order in which milk and tea were added to
her cup than we would before agreeing that our friend can tell the difference between
the tastes of two different kinds of wine. Our prior probability for the former is lower
than it is for the latter.
FIGURE 6.7 Thomas Bayes
Bayes’s theorem takes three things as input: the prior probability of the hypothesis
under investigation, Pr(H); the probability of the observation given the hypothesis (in
other words, if the hypothesis is true), Pr(O|H); and the probability of the observation
under all possible hypotheses, Pr(O). When these numerical values are available—a
major source of controversy for Bayesian statistics—we can use them to calculate the
probability of the hypothesis given the observation that has been made (or the data
gathered). And this, again, is the main thing scientists want to discover from statistical
hypothesis-testing.
If Pr(H|O) > Pr(H), then we say that the observation O confirms hypothesis H.
That is, an observation confirms a hypothesis if the probability of the hypothesis,
a rational degree of belief that the hypothesis in question is true, goes up once
the observation has been made. So, comparing the prior and posterior probabilities
shows us whether an observation confirms or disconfirms a hypothesis and by how
much. A big increase in probability implies a large degree of confirmation, and a
small increase implies a small degree of confirmation; a big decrease in probability
implies a large degree of disconfirmation, and a small decrease implies a small degree
of disconfirmation.
Comparing Support for Different Hypotheses

The core of Bayesian statistics we just described can be used in numerous ways in statisti-
cal hypothesis-testing. Let’s now turn to two main ones.
First, Bayesian statistics can be used to calculate the degree to which some observation
or data set favors one hypothesis over another. In theory, posterior probabilities can be
calculated for any number of hypotheses from the same observation, taking into account
the prior probabilities of the various hypotheses and the probability of the observation
given each of the various hypotheses. These posterior probabilities can then be com-
pared with one another to find which hypothesis is more likely, taking into account
the observation that’s been made. Unlike classical statistics, this provides a comparative
approach to hypothesis-testing.
Another approach is a kind of shortcut. This approach is to compare not posterior
probabilities of different hypotheses but the probability of the observation given each
hypothesis, or Pr(O|H1) versus Pr(O|H2). These probabilities are usually easier to find
than posterior probabilities.
Consider this example. Imagine that Lasha and Janine are interested in public opinion about
the theory of evolution. Based on their separate research, Lasha believes that 70% of the public is
convinced by the theory of evolution, while Janine believes that 60% of the public is convinced.
They decide to query 100 randomly selected people about their opinions. Based on their differ-
ent hypotheses, Lasha and Janine can predict what they will observe: Lasha predicts that about
70 / 100 will say they believe the theory of evolution; Janine’s prediction puts that number
at about 60 / 100. In fact, using tools of inferential statistics described earlier in this chapter,
we can find the probability distribution each predicts for this random sample of 100 people.
As it turns out, of the 100 people in the sample, 62 said they are convinced by the
theory. According to the probability distribution based on Lasha’s hypothesis of 70% belief
in evolution, this observation has a probability of .02, that is, Pr(O|H1) = .02. According
to the probability distribution based on Janine’s hypothesis of 60% belief in evolution,
this observation has a probability of .08, that is, Pr(O|H2) = .08.
An observation favors one hypothesis over a second hypothesis to the degree that the
first hypothesis predicts the observation better than the other hypothesis. Given Janine’s
hypothesis, the observed result is much more likely than it is given Lasha’s hypothesis.
This can be expressed numerically with the Bayes factor, which is the ratio of the prob-
ability of the observation given the first hypothesis to the probability of the observation
given the second hypothesis, that is Pr(O|H1) / Pr(O|H2). The Bayes factor expresses the
discriminatory power of the evidence with respect to the two hypotheses. In this case,
the Bayes factor is .08 / .02, or 4. This means that the result of the survey favors Janine’s
hypothesis over Lasha’s by a factor of four.
Here’s a shorthand method for calculating the Bayes factor in circumstances like this
(random sampling, independent outcomes, and different hypotheses about the distribution
of the values of a binomial random variable). If Lasha’s hypothesis is right, each individual
has a 0.7 probability of saying he or she believes the theory of evolution; if Janine’s hypoth-
esis is right, each individual has a 0.6 probability of saying he or she believes the theory of
evolution. The Bayes factor can be found by comparing these probabilities. In particular:
[Pr(yes|H1)(# of yes answers) × (Pr(no|H1)(# of no answers)] / [Pr(yes|H2)(# of yes answers)

× (Pr(no|H2)(# of no answers)]
In this case, this is [(.762) × (.338)] / [(.6562) × (.3538)], which is 4.
Bayesian Belief Updating

A second use of Bayesian statistics is to guide how our beliefs should be updated in light of
new evidence. This is simple, assuming we can calculate posterior probabilities. The rule is
called Bayesian conditionalization, and it states that the new degree of belief in a hypoth-
esis H ought to be equal to the posterior probability of H: Prnew(H) = Pr(H|O). Every time
we make observations, we learn something new, so we should update our beliefs accordingly.
Bayesian conditionalization shows us how. These updated beliefs are our new prior prob-
abilities for hypotheses, which are then the basis for assessing how to respond to the next
observation. There’s even a slogan for this: ‘Today’s posteriors are tomorrow’s priors’.
Here’s an example. Around age 40, most women undergo routine mammography
screening. Mammograms are x-ray photographs of the breast tissue, which can be used
to screen for breast cancer in women who otherwise have no signs or symptoms of the
disease. Suppose you are a doctor and that one of your patients is a 50-year-old woman
with no symptoms who is participating in routine mammography screening. She tests
positive. She is alarmed for obvious reasons and immediately wants to know from the
doctor—you—whether she has breast cancer. You can’t tell her that without more test-
ing, but you can tell your patient the probability that she has breast cancer given the
positive test result and the probability that the result was a false positive. That is, you
can calculate Pr(H1|O) and Pr(H2|O), where the first hypothesis is that she has breast
cancer and the second hypothesis is that it was a false positive. You need three pieces of
information for the calculation:
1. The probability that a 50-year-old woman has breast cancer is around 1%.
2. If a woman has breast cancer, the probability that she tests positive is around 90%.
3. If a woman does not have breast cancer, the probability that she tests positive anyway
is around 9%.
Given this data set (which is always being updated; visit <www.cancer.gov/types/breast>
for current statistics), how should you answer the patient’s questions in light of the
screening results?
Here, again, is Bayes’s theorem: Pr(H|O) = Pr(O|H)Pr(H) / Pr(O). This theorem can
be rewritten in a form that’s easier for the task at hand:
Pr(H|O) = Pr(O|H)P(H) / [Pr(O|H)P(H) + Pr(O|not-H)Pr(not-H)]
Pr(O|H) and Pr(O|not-H) are the probabilities of the observation, given a specific hypothesis
and the negation of that hypothesis, just like the alternative and null hypotheses. We have
this in this example. The two hypotheses under consideration are that your patient has
breast cancer (H1) and that the test was a false positive (H2), which is another way of saying
that your patient doesn’t have breast cancer. This version of Bayes’s theorem simplifies the
calculation by eliminating the need to find Pr(O), the overall probability of the observation.
Your patient is looking for Pr(H1|O) and Pr(H2|O), so we’ll need to use Bayes’s theo-
rem on each hypothesis. To start, for each, we need to find the prior probability of the
hypothesis in question and the probability of the observation given the hypothesis. From
these numbers, we can calculate the posterior probability of each hypothesis, given the
observation of the positive test result.
For the first hypothesis, that your patient has breast cancer, the prior probability, Pr(H1),
is the rate of breast cancer in the general population (#1 above). Before the exam, the
rational degree of belief in the hypothesis that your patient has breast cancer is just the
disease’s incidence in the population, so Pr(H1) = .01. The likelihood of the positive test
result given the first hypothesis (that is, if it’s true that your patient has cancer) is 90%
(#2 above). So, Pr(O|H1) = .90.
For the second hypothesis that the test was a false positive, the prior probability,
Pr(H2), is the rate in the general population of not having breast cancer, which is 100%
of the population minus the 1% that does have breast cancer, or 99%. So, Pr(H2) = .99.
The likelihood of the positive test result given the hypothesis of a false positive is 9%
(#3 above). So, Pr(O|H2) = .09.
Now we can calculate both Pr(H1|O) and Pr(H2|O):
Pr(H1|O) = Pr(O|H1)P(H1) / [Pr(O|H1)P(H1) + Pr(O|H2)Pr(H2)]

= (.90 × .01) / [(.90 × .01) + (.09 × .99)] = .009 / (.009 + .0891)
= .009 / .0981 = .0917
Pr(H2|O) = Pr(O|H2)P(H2) / [Pr(O|H2)P(H2) + Pr(O|H1)Pr(H1)]

= (.09 × .99) / [(.09 × .99) + (.90 × .01)] = .0891 / (.0891 + .009)
= .0891 / .0981 = .908
We are imagining that your patient has just tested positive for breast cancer. We have
found that, given this positive result, she has a .0917, or 9.17%, chance of having breast
cancer and a .908, or 90.8%, chance of getting a false positive on the test. It’s true your
patient should be concerned; her chance of breast cancer just increased from 1% to more
than 9%. But she shouldn’t be as concerned as she no doubt is: there’s no guarantee she
has breast cancer, and in fact, there’s over a 90% chance that she does not.
Problems with Bayesian Inference

In describing Bayesian statistics, we’ve also gestured at a few of its benefits over classical
statistics. But Bayesianism faces its own problems. For one thing, it is often criticized for
its apparent lack of objectivity. There’s often not enough information to have objective
grounds for the prior probability—the probability of the hypothesis before data are gath-
ered. In the mammogram/breast cancer example, we could calculate prior probabilities
from well-established facts about the incidence of breast cancer in the general population
and the commonness of false positives for mammograms. But that data is unavailable for
many hypotheses. Recalling the earlier example, where do Lasha and Janine get their
different ideas about the percentage of people convinced by evolutionary theory?
Without clear, objective information to guide the selection of prior probabilities, indi-
vidual biases and subjective values can find their way in. This is a problem because prior
probabilities influence posterior probabilities, and so subjective starting points can find
their way even into conclusions based on data. This possibility seems to undermine the
objectivity of Bayesian reasoning. This is perhaps the main challenge facing Bayesian
statistics, and it’s received a lot of attention.
Some responses to this challenge about subjectivity have been to develop rules for
how prior probabilities should be established. A different kind of response is to argue that
the variability in prior probabilities is a good thing. Different people often have different
background beliefs, and one might think these different background beliefs should be
taken into account. Different choices of priors make it transparent how two scientists’
judgments differ. So instead of lurking in the background, with unclear influence on sci-
ence, different background beliefs and how they influence scientific judgment are brought
into explicit consideration by Bayesian statistics. What’s more, this transparency in prior
beliefs enables rational disagreement. Scientists should be able to provide justification for
particular choices of prior probabilities, articulating what sorts of theoretical or empirical
considerations informed their choice. In this respect, Bayesian and classical statistics are in
similar situations. When testing hypotheses or making general inferences, scientists using
classical statistics must decide on sample size, which kind of statistical test to employ, and
so forth. These decisions are also open to criticism, and scientists making these decisions
should be able to justify them.
However, some remain unconvinced by this argument for subjectivity. The choice
of prior probabilities in Bayesian statistics is a kind of direct influence of background
beliefs on scientists’ beliefs about hypotheses under investigation, which many scien-
tists are uncomfortable with (Gelman & Hennig, 2017). And so far, no rule for how
prior probabilities should be established is both broadly applicable and enjoys broad
support.
A second problem for Bayesian statistics is that it’s not obvious that Bayesian con-
ditionalization, in which one updates one’s belief in accordance with posterior prob-
abilities, is always the right thing to do. Some have suggested that abductive reasoning,
or inference to the best explanation, is a better alternative. Recall from Chapter 4 that
when people engage in abductive reasoning, they use explanatory considerations as
evidence to support one hypothesis over others. You see cheese crumbs, small drop-
pings, and some chewed-up paper, and so you might reason that a mouse resides in
your kitchen. But does that inference follow Bayesian conditionalization? It’s not clear
it does. The kind of work and reasoning performed by some scientists, such as paleon-
tologists, is akin to CSI-style forensic work. They gather different pieces of evidence
from several fields, and on the basis of that evidence and explanatory considerations,
they weed out implausible hypotheses and develop the most plausible hypothesis
about the distant past of life on Earth. Bayesian conditionalization may not capture
this explanatory leap.
There is no universal method for statistical inference. There are different approaches
to classical statistics, an alternative framework of Bayesian statistics, and even differ-
ent approaches to Bayesian statistics. All of these offer tools that scientists can use
in hypothesis-testing, depending on the type of hypothesis to be tested, the type of
experiment or observational study that will be run, and the nature of the relevant back-
ground knowledge. Bayesian statistics is perhaps a better guide to belief than classical
statistics when prior probabilities can be reliably estimated, as with medical diagnoses
based on epidemiological studies. The classical statistics method of hypothesis-testing

we described earlier in the chapter is perhaps better when there is little background
knowledge to draw upon or when scientists are unable to specify multiple alternative
hypotheses.
The statistical toolbox is large, with many different tools. It contains many forms of
inferential statistics, which we’ve barely scratched the surface of in this chapter. It also
contains descriptive statistics, as introduced in Chapter 5, as well as other tools we haven’t
discussed. Statistics is not the mindless application of mathematical formulas but careful
scientific work influenced by scientists’ aims and concerns, just as are other forms of sci-
entific reasoning. And, as we emphasized at the outset of Chapter 5, statistical reasoning
is an important form of literacy in today’s world.
EXERCISES
6.21 In your own words, describe three problems for the classical statistics approach to
hypothesis-testing.
6.22 Write out the mathematical formula for Bayes’s theorem, then state what it means in
your own words. Write out the definition of conditional probability from Chapter 5.
Bayes’s theorem can be derived from this definition; describe anything you notice
about how the two relate.
6.23 Describe two ways in which Bayes’s theorem can be used in inferential statistics,
including what you can accomplish with each. Illustrate each of these two main uses
of Bayes with a simple example (imaginary or real).
6.24 Suppose that you are being screened for a disease that affects about one person in
1,000. You have no symptoms, and the test is accurate 90% of the time. That is, if you
actually have the disease, then the test result is positive with 90% probability, and if
you do not actually have it, the test result is negative with 90% probability. After several
anxious minutes, the test results come back: positive! How worried should you be?
a. Find the prior probability of the hypothesis that you have the disease, the prior
probability of the hypothesis that you don’t have the disease, the probability of
the test result given the hypothesis that you have the disease, and the probability
of the test result given the hypothesis that you don’t have the disease.
b. Use Bayes’s theorem with these probabilities to calculate your chance of having
the disease given your positive test result. Describe how concerned you think
you should be in light of your positive test result.
c. Consider that, out of 1,000 people, 100 will test positive. About how many of
those people will actually have the disease? Does this consideration change
your reasoning in (b)?
6.25 A small company has bought three software packages to solve an accounting prob-
lem. These packages are called Fog, Golem, and Pear. On first trials, Fog crashes
10% of the time, Golem 20% of the time, and Pear 30% of the time. Of 10 employ-
ees, six are assigned to Fog, three are assigned to Golem, and one is assigned to
Pear. Jan was assigned a program at random. It crashed on the first trial.
What is the probability that Jan was assigned Pear? You can answer this question
by finding the posterior probability of Jan being assigned to Pear given that the
program crashed from the prior probability of Jan being assigned to Pear and the
overall chance of one of the three programs crashing.

6.26 Seamus and Amanda have different opinions regarding public support for a smok-
ing ban in restaurants and pubs. Seamus believes that 75% of the people in town
support the ban; Amanda thinks that only 50% support the ban. They decide to ask
100 randomly selected people; 65 are in support of the ban, 35 against it.
a. Calculate the Bayes factor. Is Seamus’s hypothesis or Amanda’s hypothesis
more favored by the data?
b. How would the Bayes factor change if a single survey participant changed his
or her opinion from a ‘yes’ to a ‘no,’ resulting in 64 in support of the ban and
36 opposed? Calculate Bayes factors for this alternative outcome.
c. Do you find the change in Bayes factor in this alternative scenario surprising?
Why or why not?
6.27 Imagine that you are a lawyer with a client who has been accused of commit-
ting a heinous crime. Your client’s DNA matches some of the traces found on
the victim. This is the only piece of evidence against her, but it is a serious one.
The court is told that the probability that this match occurred by chance is one in
100,000 (or 0.001%). Do you believe this proves your client is guilty? Why, or why
not? (Hint: consider what the numbers mean in terms of frequencies. Out of every
100,000 people, one will show a match. If you live in a city with two million people,
for example, how many will have DNA matching the trace on the victim?)
6.28 In your own words, describe (a) three different types of problems for the Bayesian
approach to statistics and (b) three different advantages that the Bayesian has over
the classical approach to statistical testing.
FURTHER READING
For a historically informed treatment of different approaches to statistical inference and

the relationships among them, see Gigerenzer, G. (1993). The superego, the ego, and
the id in statistical reasoning. In G. Keren & C. Lewis (eds.), A handbook for data
analysis in the behavioral sciences: Methodological issues (pp. 313–339). Hillsdale:
Erlbaum.
For an example from economics of the difference between statistical significance and
scientific significance, see McCloskey, D., & Ziliak, S. (1996). The standard error of
regressions. Journal of Economic Literature, 34, 97–114.
For more on Bayesianism, see Howson, C., & Urbach, P. (2006). Scientific reasoning: The
Bayesian approach (3rd ed.). La Salle: Open Court. For a more concise overview,
see Hartmann, S., & Sprenger, J. (2010). Bayesian epistemology. In S. Bernecker and
D. Pritchard (eds.), Routledge companion to epistemology (pp. 609–620). London:
Routledge.
For more on the classical approach to statistical inference and a vigorous critique of the
subjective Bayesian approach, see Mayo, D. G. (1996). Error and the growth of experi-
mental knowledge. Chicago: University of Chicago Press.
For an accessible treatment of an approach to statistics focused on the notions of effect
sizes, confidence intervals, and meta-analysis, see Cumming, G. (2013). Understand-
ing the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York:
Routledge.
CHAPTER 7
Causal Reasoning
7.1 WHAT IS CAUSATION?
• Describe the difficulty about causal claims that worried David Hume
• Give three reasons why correlation and probabilistic dependence don’t guarantee
causation
• Describe the physical process and difference-making accounts of causation
• Indicate how each of the following informs the investigation of causal relationships:
spatiotemporal contiguity, correlation, probabilistic dependence, causal background
• Analyze whether a cause is necessary, sufficient, or probabilistically related to an
effect, and gauge the strength of a probabilistic causal relationship
Does Fracking Cause Earthquakes?

Hydraulic fracturing, or ‘fracking’, is used across North America and in several other
places in the world to unearth natural energy resources. Fracking involves drilling into the
earth, and then using high-pressure injections of sand and water treated with hydrochloric
acid and other chemicals to break up shale formations. This produces small underground
explosions, which result in the release of oil or gas for corporate capture.
Over the last decade, concerns about fracking have increased. In part, this is because
fracking is correlated with increased seismic activity, as well as environmental con-
tamination, habitat loss, and damage to surrounding surface structures. In the US, for
instance, Oklahoma has become infamous for a sudden increase in earthquakes thought
to be linked to fracking. From 1978 to 1999, Oklahoma averaged approximately one

earthquake of at least 3.0 magnitude per year. By 2009, that average had surged to
over 20 such earthquakes per year—20 times the past rate. Since 2009, Oklahoma has
experienced approximately 2,300 earthquakes of 3.0 or greater magnitude. It now has
earthquakes more frequently than Pacific Ring of Fire states like California, historically
known for their earthquakes.
These dramatic numbers, shown in Figure 7.1, provide sufficient reason to rule out the
possibility that this is merely an unlucky coincidence. Statistical hypothesis-testing would
undoubtedly show the increase in seismic activity in Oklahoma to be due to something
other than chance. But is the increase because of fracking or something else? Does frack-
ing cause earthquakes?
FIGURE 7.1 Annual seismic activity in Oklahoma 1978–2017
Sources: USGS-NEIC ComCat & Oklahoma Geological Survey; Preliminary as of July 4, 2017
FIGURE 7.2 USGS map showing locations of wells related to seismic activity 2014–2015
244 Causal Reasoning
The answer may seem obvious. How could there be such a dramatic rise in earthquakes
in Oklahoma if fracking weren’t the cause of them? Lobbyists and other advocates for the
US oil and gas industry are quick to remind everyone that correlation does not guarantee
causation. Just because fracking has increased and seismic activity has increased—that is,
just because these are correlated—doesn’t mean that one caused the other. Some other
unknown cause might be responsible for the earthquakes. However, while not all cor-
related types of events are causally related, correlation does raise the question and can
even be the proverbial ‘smoking gun’ for causation. We need to look more closely to
know whether fracking causes earthquakes or if there is an alternative explanation for
the correlation. And, in fact, the answer is a bit subtle.
There is evidence of some type of relationship between fracking in particular and the
uptick in seismic activity in Oklahoma. But how are they related? Since 2009, most of
the Oklahoma earthquakes have been located very close to fracking wells, which pump
massive volumes of liquid up to the surface. The spatial correlation of wells and earth-
quakes adds some support for the idea that fracking is, in some sense, involved in the
rising numbers of earthquakes.
However, geologists, hydrologists, and the other scientists involved in the US Geological
Survey—a federal agency devoted to the scientific study of the American landscape and
the natural hazards that can threaten it—have concluded that earthquakes resulting
directly from fracking tend to be relatively minor in Oklahoma. So fracking operations
are highly correlated with the dramatic increase in seismic activity in Oklahoma, but they
are not directly responsible for causing many of those earthquakes. Instead, wastewater
injection from both fracking and non-fracking wells appears to be more directly respon-
sible for the increase in earthquakes in Oklahoma.
During hydraulic fracturing, some of what’s pumped up is oil, while some is the by-
product of fracking: salty, sandy, chemically treated wastewater. After capturing oil and
gas, corporations inject large volumes of this wastewater back into disposal wells. Doing
this raises the pressure within the pores of a hydrocarbon reservoir over large areas, which
tends to shift subterranean stress. And shifting stress tends to destabilize preexisting faults.
This subterranean stress from wastewater injections back into the Earth’s sedimentary
formations has been implicated as one cause of seismic activity. The US Geological Survey
results identifying wastewater injection as the primary cause of the significant uptick in
Oklahoma earthquakes are fairly conclusive.
Of course, energy industry operations involved in fracking are still the culprit. Even if
fracking does not directly cause earthquakes, it is not causally unrelated either. Cease all
fracking activity, and the volume of wastewater injected back into the earth will signifi-
cantly diminish; so too will the risk of earthquakes. So while not solely to blame, fracking
is one of several oil and gas operations that are together causing increased seismic activity.
Over the last decade, Canada has seen a similar increase in seismic activity, which has
also been tightly correlated in time and space with fracking. Researchers documented
more than 900 seismic events near shale drilling sites in northwest Alberta and observed
a pattern between the timing of fracking operations and the timing of earthquakes (Bao
& Eaton, 2016). In this case, scientists found that both the increase in pressure during
fracking operations and the increase in pressure from wastewater injection induced seis-
mic activity. The triggers for induced seismicity in Alberta may be different from those
in Oklahoma.
Causal Reasoning 245
Thus, existing evidence indicates that fracking does play some causal role in producing
seismic activity but that the role it plays may be not be simple. Fracking’s causal role may
be modulated by other factors, such as the local geology of Alberta versus Oklahoma,
and the extent to which it plays that causal role may be changed by other causes of
earthquakes, like wastewater injections back into the earth. It is an important task for
seismologists, geologists, and hydrologists to clarify the complex web of causal relation-
ships that lead to earthquakes. More generally, unravelling the causes underlying complex
phenomena like polio or cholera epidemics, global warming, or economic crises is usually
a tricky process; scientific investigation is our best hope for doing so.
Scientific Reasoning about Causes

Scientific investigation of the causal consequences of fracking illustrates three general
characteristics of causal reasoning in science and in everyday situations.
First, causal relationships are learned on the basis of information about the timing, loca-
tion, and frequency of events. We described how scientists considered the timing, location,
and frequency of seismic activity to help discern fracking’s role. Correlation in time or in
location can be suggestive of a causal relationship. If you get sleepy after lunch, perhaps
eating lunch is the cause of your drowsiness. If you get a rash where your arm brushed
against an unknown plant in your backyard, perhaps contact with that plant caused the
rash. Similarly, an increase or decrease in the frequency of some type of event—like
earthquakes—draws our attention to what else changed during that time. This is also a
form of correlation—correlation in the frequency of two outcomes. (Recall our discussion
of statistical correlation in Chapter 5; two variables or events are correlated when higher
values of one are related to either higher or lower values of the other.)
Second, testing causal hypotheses often involves doing something in the world, such
as performing an intervention. Varying a suspected causal factor while leaving other fac-
tors unchanged can provide more insight into causal relationships than just observing
a correlation in the frequency of two outcomes. For instance, scientists could test the
hypothesis that wastewater injection is the main trigger of earthquakes by performing
fracking operations but not injecting the wastewater into disposal wells. If seismic activ-
ity is significantly reduced, this suggests wastewater injection is largely to blame for the
earthquakes in that region. If not, other possible causes, including fracking itself, should
be considered.
Third, causal reasoning has great practical significance. Knowing about causes is how
we can make things happen—and prevent things from happening—in the world. Besides
the effects of fracking on seismic activity and other features of our environment, causal
reasoning is also crucial for inferring the effects of economic policies like tax rates, for
inferring medical conditions from symptoms, and for establishing legal responsibility or
liability, among many other things. Good causal reasoning thus can be an urgent matter
of scientific and practical importance.
It’s no wonder, then, that causal reasoning is a central feature of science. This chapter
explores how the scientific tools encountered thus far in this book—especially experi-
mentation and observational studies, modeling, and inference using logic, probability, and
statistics—are used to identify causal relationships. This chapter thus refers to several ideas
from earlier chapters that are helpful to clarify what’s involved in good causal reasoning.
Skepticism about Causation

Imagine the following scenario. You are playing a game of billiards at your local pool
hall. You hit the cue ball, which then rolls across the felt and strikes the 8-ball, which is
itself then set in motion. Did the cue ball cause the 8-ball to move? The answer seems
obvious. What else could have possibly made the 8-ball move?
There’s a worrisome issue here though—one that has some significance for science.
The Scottish philosopher David Hume (1711–1776) challenged us to examine what our
experience allows us to know about the nature of causal relationships. Hume argued
that experience doesn’t tell us much. When your cue ball hit the 8-ball, you only saw a
series of events, one after another. You saw the cue ball moving toward the 8-ball, the
cue ball touching the 8-ball, and then the 8-ball itself moving. Where was the causation
in all of that? What makes you think there is any ingredient above and beyond just a
series of events?
Hume agreed that we regularly experience a constant association between two events
and their correlation in space and time. Cue balls hit 8-balls, earthquakes follow frack-
ing operations, you experience drowsiness after lunch, and so on. Hume would not have
denied that there is a regular association between such events. What he doubted was that
there is anything more beyond just those events occurring together in a certain order; all
you really discern when you perceive causation is constant association. Hume doubted
that there was anything special to call causation (Hume, 1738).
We’re not going to address Hume’s concern head-on. Instead, we are going to use this
as a challenge for us to get very clear about the specifics of causal reasoning. Causal claims
are important to science, and they have real practical significance as well. But it’s tricky
to say what, exactly, causation amounts to. What is it, exactly, that you would look for to
decide whether smoking causes cancer? There are smokers who are in excellent health,
and there are people with cancer who have never smoked. And even when smokers
develop cancer, how are we to say smoking was responsible for their cancer?
Spatiotemporal Contiguity as a Guide to Causation

The perception of causal relationships is a robust, automatic, and often reliable process. It
was systematically investigated by the Belgian psychologist Albert Michotte (1881–1965)
in the 1940s. Michotte’s experiments showed that it’s very hard not to perceive certain
sequences of events as involving causation (Michotte, 1962). These experiments also show
that causal perception depends on spatial and temporal information. If two events—for
example, pressing a piano key and hearing B sharp—are spatially and temporally contigu-
ous, that is, if they happen at the same time and place, then we perceive them as causally
related without requiring repeated exposure to those events. When there is a spatial or
a temporal gap between two events, we are much less likely to perceive the one event
as causing the other.
Although spatiotemporal cues can be an important element of the perception of causal
relations, they are not always a reliable guide. Sometimes they mislead us. It can be
mistaken to conclude that one event causes another simply because the events occur in
succession close to each other. A child in Oklahoma might stamp her foot right before
an earthquake, but we know the stamp couldn’t have caused the quake. The mistake of
reasoning from spatiotemporal succession to causation is named, from Latin, the post hoc,
ergo propter hoc fallacy (‘after this, therefore because of this’).
So, spatiotemporal contiguity doesn’t guarantee causation. It’s not necessary for cau-
sation either. Many causes are separated from their effects in time and even space. For
instance, when you hang out with a friend who has the flu, you may begin to feel ill a
few days later. In this case, your friend’s flu caused your own illness, despite an interven-
ing delay. And when you play a video game, pressing buttons on a remote control causes
changes in the game, even though the two events happen in different places.
Indeed, many of the cause-effect relationships investigated in science, and important
for everyday life, are spatiotemporally separate to some degree. Sometime the degree
of temporal separation is used to distinguish among the causes of an event. Proximate
causes are those that occurred more closely in time and place to the event that was
caused, while distal causes occurred further back in time or place from their effects. For
example, when asked about the cause of your illness, you may cite your friend’s recent
case of the flu. Or you might instead reply that we’re in the midst of flu season, and this
year’s seasonal flu has spread extensively. The former cause is proximate, the latter distal.
As the fracking example illustrates, identifying a cause of some event doesn’t imply
that you have identified the cause or have ruled out other causes. The distinction between
proximate and distal causes shows that any event may have multiple causes. One way to
think about this is in terms of ‘chains’ of causation—like neural firings that contracted the
muscles, that moved the hand, that pushed the cue stick, that hit the cue ball, that hit
the 8-ball into the corner pocket. Such causal chains go back and back and back. Another
way to think about the multiple causes of some event is in terms of complex webs and
networks: all the different factors that contributed to bringing about some outcome. The
8-ball’s moving was caused not only by my cue ball hitting it, but also by my choosing
to go to the pool hall and picking up the cue stick, the 8-ball resting where it in fact
was, the cue stick being chalked by the previous player, the billiard cloth having a certain
smoothness, and so on. Whether you think of causal relationships in terms of chains or
in terms of contributing factors, it seems clear that causal relationships are everywhere.
Correlation as a Guide to Causation

Besides spatiotemporal cues, we also tend to use information about correlation between
events to discern causal relations. Correlation is a measure of the association between
two variables. If two variables are correlated, then they are not statistically independent;
the variation in their values shows some trend. If the values of two variables are cor-
related, then we may wonder if one causes the other. For example, imagine you have
always observed that whenever the price of a beer at your local pub is $5, there are fewer
customers than when the price is $3. This is a correlation. You may wonder, based on
this, whether the increased price of the beer decreases demand for beer. This is a causal
claim. You may think this is so even if the timing doesn’t match; maybe customers only
start to trail off a while after the price of beer has gone up.
While correlation is a guide to causation, it’s an imperfect guide. Correlation can
exist when causation does not. For one thing, correlation is symmetric: if an event A
correlates with another event B, then B correlates with A as well. But causation isn’t
symmetric. Having cancer correlates with death, and death correlates with having cancer,
but cancer causes death and not the other way around. In other cases, neither correlat-
ing event causes the other, but they share a common cause—a third event that causes
both. Ice-cream consumption and homicide rates are famously correlated, but eating
ice cream is not a cause of murder, nor does committing murder cause ice-cream eat-
ing. Instead, there is some evidence that hot days increase both ice-cream consumption
and homicide rates.
There are also spurious correlations, where two types of events happen to be corre-
lated but are not related in any interesting way, causally or otherwise. For example, from
2000 to 2009, data from the US Dairy Association regarding per capita cheese consump-
tion and data from the Centers for Disease Control regarding the numbers of people who
died by becoming tangled in their bedsheets were highly correlated (see Figure 7.3), but
obviously there’s no causal relationship connecting these variables.
Causal relations between events can also exist even when they don’t seem to be
correlated. Philosopher Nancy Cartwright has suggested the following example in
which counterbalanced causal relationships cancel each other out. Smoking cigarettes
is well established as a cause of heart disease. It’s also the case that adequate exercise
prevents heart disease. If, for whatever reason, smoking is strongly correlated with
exercise, then a well-established cause of heart disease will also be strongly correlated
with its prevention, and smoking and heart disease will not generally correlate. But,
smoking would remain a cause of heart disease (Cartwright, 1989). Here’s another
example. Pregnancy is a cause of thrombosis, which involves blood clots forming inside
blood vessels. Since taking contraceptive pills reduces the chance of pregnancy, one
might hope that taking contraceptive pills indirectly prevents thrombosis. However,
taking contraceptive pills is also a cause of thrombosis. So contraceptive pills prevent
thrombosis by reducing the chance of pregnancy, while also causing thrombosis. If
these opposed influences exactly cancelled each other out, then thrombosis and taking
contraceptive pills will not exhibit a statistical correlation even though the two events
are related causally (Hesslow, 1976).
Per capita cheese consumption

correlates with
Number of people who died by becoming tangled in their bedsheets
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
33lbs 800 deaths

Bedsheet tanglings
Cheese consumed
31.5lbs 600 deaths
30lbs 400 deaths
28.5lbs 200 deaths

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Bedsheet tanglings Cheese consumed
tylervien.com
FIGURE 7.3 Visualization of the correlation between per capita consumption of cheese and
number of people who died from getting tangled in their bedsheets
Reproduced under Creative Commons, <http://tylervigen.com/>.
So, while correlation is a guide to causation, causation doesn’t just boil down to corre-
lation. There must be something more to causation, Hume’s skepticism notwithstanding.
The Nature of Causation: Difference-Making

and Physical Processes
Here are two ideas about what causal relationships are, beyond the mere correlation of
types of events. One idea is that causal relationships are, at root, relationships of difference-
making. Put simply, if the occurrence of one event makes a difference to the occurrence
of a second event, then the first event is a cause of the second event. If the billiard ball
had not struck the 8-ball, then the 8-ball wouldn’t have moved. If the billiard ball had
struck the 8-ball in a different place or at a different speed, then the 8-ball would have
moved in a different direction or at a different speed. The billiard ball’s motion made a
difference to the 8-ball’s motion. Thus, the billiard ball’s motion caused the 8-ball to move.
This difference-making relationship is something beyond the mere correlation of events.
A second idea about what causal relationships are, beyond the mere correlation of
events, is that of a physical process. On this view, causation occurs when there is a con-
tinuous physical process connecting a cause to its effect, such as the transfer of energy.
When the billiard ball knocked into the 8-ball, some of its kinetic energy transferred
to the 8-ball, which is why the 8-ball started moving. This is a physical process con-
necting the billiard ball’s motion to the 8-ball’s motion. Thus, the billiard ball’s motion
is a cause of the 8-ball’s motion. This physical process is something beyond the mere
correlation of events.
Box 7.1 Counterfactual Statements and Difference-Making

According to the difference-making account of causation, causes are those factors
that make a difference to whether an effect happens or not. The idea of difference-
making can be made more precise with the help of counterfactual conditionals.
Recall that a conditional is any ‘if/then’ statement. Counterfactual conditionals,
statements like ‘if you had scored three goals, your team would have won the
game’ have the form:
If it were the case that C, then it would be the case that E.
These are called counterfactuals because the antecedent of the conditional is

contrary, or counter, to fact. For material conditionals, truth or falsity is simply
determined by the truth or falsity of the antecedent and consequent. This isn’t so
for counterfactual conditionals. You didn’t score three goals, but that doesn’t
necessarily make it true that you would have won if you had.
On a counterfactual approach to difference-making, to identify causes, you
should check that the following two counterfactual conditionals are true:
(I) If C had occurred, then E would have (probably) occurred.

(II) If C had not occurred, then E would not have (probably) occurred.
Physical process and difference-making accounts of causation may be compatible.

Perhaps both physical processes and the ability to make a difference distinguish causal
relationships from mere correlations. Perhaps, in the billiard ball case, the billiard ball’s
motion makes a difference to the 8-ball’s motion precisely because of the transfer of
kinetic energy from one to the other. However, some philosophers think that one of these,
and not the other, is the right account of causation. Others think that causation might
include numerous different kinds of relationships, including both of these and perhaps
others as well.
Each of these accounts is more useful for thinking about causal reasoning in different
circumstances. For some causal claims, physical processes are difficult to track. Imagine
you want to investigate the causal influence of the average values of homes on the stock
market. How would you even start thinking about energy transfer or other physical
processes based on the average value of homes? In contrast, it’s clear how to think about
changes in average home value and whether those changes do or don’t make a difference
to the stock market. For other causal claims, the idea of difference-making doesn’t apply
very well. The moon orbits around the sun because of the curvature of space-time. (If you
want, you can just think about this in terms of gravity.) How would you start thinking
about space-time having a different curvature? (Or gravity not existing?) It’s a bit confus-
ing; this seems like an ingredient of reality that can’t be changed. And without that, you
can’t very well assess whether such a change would be a difference-maker.
To sum up, difference-making and physical processes offer two ways to think about cau-
sation that go beyond mere correlation. These might be compatible accounts of causation,
or one or the other might be better, or they each might be right in some circumstances.
Regardless, we think these are both helpful ways to think about the nature of causation.
Necessary and Sufficient Causes

Sometimes a cause is, by itself, enough to bring about an effect. To say ‘electrocution
causes death’ is to say that electrocution is enough to cause death, although there are
other ways of dying. Likewise, your reasoning about the increased price of beer driv-
ing down demand for beer (if accurate) cites one cause that suffices to bring about the
effect. Increased price is one way to ensure fewer customers, but other things—like a
snowstorm or holiday closure—might also lead to fewer customers. These are sufficient
causes: the causal condition is enough to bring about the presumed effect, but that effect
might sometimes occur because of some other cause. If the occurrence of a cause doesn’t
guarantee the occurrence of the effect, then the cause is not a sufficient cause.
Some causes are needed for an effect to occur but may not by themselves guarantee the
effect. To say ‘oxygen causes combustion’ is to say that combustion never occurs without
oxygen present, although oxygen is often present in the air without causing fires. This is
a necessary cause: the causal condition must be present for the effect to occur, but the
cause might sometimes occur without bringing about the effect. If the occurrence of a
cause isn’t required for the occurrence of the effect, then the cause is not a necessary cause.
So, sufficient causes guarantee their effects, while necessary causes are required for their
effects. This should bring to mind the discussion of necessary and sufficient conditions
from Chapter 4. It can be useful to keep in mind the difference between necessary and suf-
ficient causes. Knowledge of sufficient causes empowers us to bring about desired effects.
If we introduce the causes that are sufficient to bring about an effect, we’re guaranteed
that effect will occur. To have healthy teeth, for example, it’s ordinarily sufficient to brush,
floss, and visit the dentist regularly. Knowledge of necessary causes enables us to prevent
some effects from happening. If we remove just one necessary cause, this will eliminate
the effect. For example, spaying or neutering one’s pets prevents unwanted kittens or
puppies, regardless of what other conditions occur. This is because intact reproductive
systems are necessary for reproduction. And abstaining from excessive drinking prevents
hangovers and liver cirrhosis, because significant alcohol consumption is necessary for
both of these health conditions.
Although there’s a useful distinction between necessary and sufficient causes, matters
are often not so simple. For many putative necessary conditions, alternative causes can be
found. For example, having sex is usually necessary for sexual reproduction, but it isn’t
always; in vitro fertilization is an alternative. Likewise, for many putative sufficient causes,
exceptions can be found too, when the cause doesn’t bring about the effect as expected.
Raising the price of goods, like beer, does not always decrease demand. Sometimes instead,
demand is sustained by institutions that regulate the market.
The exceptions to sufficient and necessary causal relationships hint at the importance
of background conditions for causal relationships, what we might call the causal back-
ground. The causal background of two events comprises all the other factors that actually
do, or in principle might, causally influence these two events, thereby also potentially
affecting the causal relationship between the two events. Oftentimes causal background
is ignored when causal claims are made, but it’s actually crucial for causal relationships
to occur as expected.
Revisiting a couple of our previous examples shows that causes only count as suffi-
cient or necessary assuming a given causal background. Brushing, flossing, and visiting the
dentist regularly is sufficient to ensure healthy teeth if your dentist is qualified and (say)
you haven’t already had all your teeth removed. And spaying and neutering one’s pets
works to prevent unwanted kittens and puppies because pets having intact reproductive
systems is necessary for new kittens or puppies if in vitro fertilization isn’t employed and
no stray kittens and puppies show up at your house.
Causation and Probability

This discussion of causal background suggests that causal relationships are seldom straight-
forward guarantees. They depend on the causal background and often in subtle ways.
Some causal relationships may even have exceptions within a given causal background.
Consider again the example of fracking causing earthquakes. If this is true in Alberta but
not in Oklahoma, this may well be because of different causal backgrounds in those two
locations, perhaps having to do with geological features. If fracking causes an earthquake
at one site in Alberta but not at another, is this also due to different causal backgrounds
in those two locations, or is it pure chance? Occurrences of a cause do not always lead to
occurrences of its effect, either because causation itself is probabilistic or because causal
backgrounds vary. Here’s another example. There are people who smoked two packs of
cigarettes a day without ever getting cancer, even though smoking does cause cancer.
Is this because smoking causes cancer probabilistically or because some feature of the
causal background prevents some people from getting cancer? This is a matter of debate.
A factor that increases the likelihood of an event occurring despite being neither neces-
sary nor sufficient for the effect is called a contributing cause or partial cause. Contributing
causes are much more common than truly necessary or sufficient causes. For this reason,
it is useful to think about causation probabilistically. Usually, a cause raises the probability
of its effect. This idea can be formalized in terms of conditional probabilities, which we
discussed in Chapter 5. For a cause C and an effect E,
Pr(E|C) > Pr(E|not-C)
The effect is (usually) more likely to occur if the cause occurs than if the cause doesn’t
occur. This idea is deeply related to correlation as a guide to causation.
The probabilistic relationship that generally holds between causes and their effects
can also be exploited beyond the observation of correlations. Recall the difference-
making account of causation. Well, if researchers bring about some event and observe
a resulting increase in the frequency of a different event, this is some evidence that
the first causes the second. Even better if this intervention is carried out when extrane-
ous variables are controlled directly or indirectly. This enables the causal background
to be held fixed or to vary randomly, leaving the intervention on the suspected cause
as the only difference between the circumstances in which the suspected effect does
and doesn’t occur. This relates deeply to our discussion of experimental design in
Chapter 2. If you suspect that playing video games causes violent behavior, you might
ask one group of people to play several hours of video games and another group of
people to do something else like read books, and then query them about their moods
and dispositions afterward. If more video game players are agitated or aggressive or
disposed to act violently, this may point at the video games as the culprit—the cause
of violent behavior.
Thinking about causation in terms of conditional probabilities also provides a way to
define the strength of a causal relationship. If Pr(E|C) = 1 and Pr(E|not-C) = 0, then the
cause is both necessary and sufficient for the effect, in any causal background(s) where this
is true. When the cause occurs, so does the effect; when the cause is absent, so is the effect.
For probabilistic causal relationships, the stronger they are, the closer they will be to this
ideal. You can judge the strength of a causal relationship with the following calculation:
Strength = Pr(E|C) − Pr(E|not-C)

Notice that a necessary and sufficient cause will result in the maximum value of 1. If,
at the other extreme, there is no difference in the probability of E when C is present
(holding fixed the causal background), then the occurrence of C is causally irrelevant to
the occurrence of E. For the video gaming and violence example, this would correspond
to the finding that the experimental and control groups do not differ in their levels of
violent behavior. The strength of most causal relationships is somewhere in between the
two extremes of perfect guarantee and irrelevance.
We have already discussed that causation doesn’t just boil down to correlation. The
same goes for probabilistic dependence. Changes in causal backgrounds can interfere with
probabilistic dependence. Beyond this, probabilistic dependence may change in different
causal backgrounds and only hold in some causal backgrounds. Smoking may not raise
the probability of someone getting heart disease if the person also starts a serious exercise
regime at the same time. And smoking does not raise the probability of someone getting
cancer if the person already has cancer. Also, probabilistic dependence, like correlation,
doesn’t distinguish among causes, effects, and events that are correlated but not causally
related.
All of these are reasons why causation isn’t just probabilistic dependence. These are
also reasons for ensuring good experimental design when looking for probabilistic depen-
dence. Intervention is a way of isolating the expected cause, which avoids mistaking an
effect for a cause or events that share a common cause with cause and effect. And hav-
ing a control group is a way of controlling for the influence of the causal background.
These steps enable researchers to determine which events truly make a difference to the
occurrence of other events.
Box 7.2 Simpson’s Paradox

This paradox has nothing to do with Homer Simpson’s toast, ‘To alcohol! The
cause of, and solution to, all of life’s problems’. Rather, it concerns how an
aggregate statistical trend can differ from the individual trends that comprise it.
In the 1970s, the University of California Berkeley was one of the first universities
to be sued for sexual discrimination against women who had applied for admis-
sion to graduate school. In the fall of 1973, 12,763 people applied for admission;
44% of the men were admitted but only 35% of the women. There was a positive
correlation between being a woman and being rejected from Berkeley’s graduate
school in 1973. That is, Pr(rejected|woman) > Pr(rejected|man). For this reason,
suspicions were raised of sexual discrimination. But when admission rates to
individual programs were examined, the correlation between being a woman and
being rejected vanished. It wasn’t the case, for any given program, that
Pr(rejected|woman) > Pr(rejected|man). So women were overall less likely to be
admitted to graduate school at Berkeley but not less likely to be admitted to
individual programs at Berkeley for all programs. How could that be?
It turns out that during that year, more women applied to competitive programs
with low admission rates, whereas more men applied to less competitive programs
with higher admission rates. The positive correlation between rejection and being
a woman was thus due not to gender itself but to a correlation between gender
and the competitiveness of the program applied to. This is an instance of Simpson’s
paradox, described in 1951 by the British statistician Edward Simpson. Simpson’s
paradox demonstrates the importance of considering the causal background. A
correlation between two types of events can disappear, or be reversed, when
data are grouped in a different way, because different groupings take into account
different factors in the causal background (here: the competitiveness of different
graduate programs).
EXERCISES
7.1 Describe Hume’s worry about causal reasoning in your own words. Evaluate the merits
of his concern, taking into account the main points of discussion throughout this section.
7.2 Define correlation, and give three examples of events that you believe are correlated.
7.3 Describe how correlation and probabilistic dependence relate to causation. Give
an example of a causal relationship that results in straightforward correlation, an
example of a causal relationship that in some contexts does not seem to result in cor-
relation, and an example of a correlation that is not due to a causal relationship.
7.4 Describe each of the following scenarios as a causal claim put in terms of difference-
making, and then as a causal claim put in terms of chains of physical processes. You
might need to invent some details about these causal relationships to give a thorough
answer—feel free to get creative.
a. The high tide washing ocean debris up to a certain point on the beach
b. Your pickup basketball team winning its game yesterday
c. Smoking causing lung cancer
7.5 What do you think the advantages are to thinking about causation in terms of differ-
ence-making? How about the disadvantages? What are the advantages and disad-
vantages of thinking about causation in terms of physical processes?
7.6 Describe what each of the following is and how each informs, or is taken into ac-
count, in the investigation of causal relationships: spatiotemporal contiguity, correla-
tion, probabilistic dependence, and causal background.
7.7 Give a novel example of each of the following:
a. A causal relationship that violates spatial contiguity
b. Events at the same place that are not causally related
c. A causal relationship that violates temporal contiguity
d. Events at the same time that are not causally related
e. A causal relationship and causal background in which the cause is not correlated
with the effect
f. Two correlated events that are not cause and effect
7.8 Define proximate causes and distal causes. Then, for each of the following events,
describe a more proximate cause and a more distal cause. You might need to invent some
details about these causal relationships to answer this question; feel free to be creative.
a. The Titanic sinking

b. Ruth leaving a tip after her meal at the restaurant
c. A hurricane occurring
7.9 For each of the following pairs of events, say which is the cause and which is the effect.
Then decide whether the cause is necessary or sufficient to bring about the effect. You
can assume that the causal background is normal, that is, what things are usually like.
a. Buying a lottery ticket, winning the lottery
b. Attending a concert, buying a concert ticket
c. Attending class, getting an A in the course
d. Becoming an attorney, being accepted into law school
e. Passing the bar exam, becoming an attorney
7.10 Write down the formula regarding conditional probabilities that gives the strength
of causal relationships. Then, considering that formula, order the following causal
relationships from strongest to weakest:
a. Brushing your teeth, flossing, and visiting the dentist prevents cavities.
b. Frequent smiling increases well-being.
c. Eating pizza prevents getting the flu.
d. Consuming anabolic steroids improves physical strength.
e. An increase in the minimal wage produces higher attendance at football games.
f. Warmer summers lead to longer periods of drought.
7.11 For each of the causal relationships in 7.10, name one feature of the causal back-
ground that would make the causal relationship stronger and one feature of the
causal background that would make the causal relationship weaker. It might help
to consider the conditional probability relationship that gives the strength of causal
relationships.
7.2 TESTING CAUSAL HYPOTHESES
• Characterize the relationship between intervention and difference-making

• Identify and describe Mill’s five methods
• Discuss the significance of having a control group and random assignment to groups
for causal hypothesis-testing
• Articulate how statistical hypothesis-testing helps to test causal hypotheses
Intervention and Difference-Making

Understanding the causal structure of the world requires more than just sitting back and
seeing what happens. In the first section of this chapter, we discussed how causation is
more than just correlation or probabilistic dependence. Despite Hume’s skepticism, it sure
seems like there must be something else to causation. We have also touched on two can-
didates for what that something else might be: difference-making and physical processes
like energy transfer. But there is a lingering problem related to Hume’s skeptical worry.
It’s relatively simple for scientists to discover correlations. Going beyond correlation to
discover causal relationships is much trickier.

As we discussed in the previous section, for most scientific investigations, looking
directly for physical processes like energy transfer from cause to effect won’t work, even
if this turns out to be involved in all causal relationships. This is because the events
under investigation are usually only distantly related, and scientists often have a lim-
ited understanding of intervening processes. For example, the question of how smoking
causes emphysema is much more difficult to answer than establishing that smoking causes
emphysema. That smoking causes emphysema is a causal relationship inferred from a lot
of scientific evidence about relatively clear biophysical processes. In contrast, the ques-
tion of how this happens requires knowledge of some processes that are more difficult
to understand or not yet known.
The idea of difference-making is much more useful for causal analysis in many fields
of science. Scientists have at least two methods to go beyond statistical information about
correlation to uncover difference-making relationships. One method is to run an experi-
ment—ideally, a perfectly controlled double-blind experiment, as detailed in Chapter 2.
Another method, when experimentation isn’t feasible, is to construct a causal model
and rely on statistical information about variables of interest to make causal inferences.
This section discusses how experiments can be used to uncover causal relationships;
causal modeling will be addressed in the next section. We have covered topics related
to testing causal hypotheses and causal modeling earlier in the book, including Chapter
2’s discussion of experimentation, Chapter 3’s discussion of modeling, and Chapter 6’s
discussion of statistical hypothesis-testing. But let us reconsider these topics now with an
eye to how they relate to causation in particular.
Let’s suppose that you are a farmer and you are interested in finding out whether
using a new fertilizer will increase your crop yield. This involves a causal hypothesis.
How would you test it?
One way would be to try out the fertilizer on your crops this year and see what kind
of a yield you get. But the causal background might vary from last year to this year in
a way that affects crop yield. You wouldn’t be able to distinguish that influence from
the specific effect of the fertilizer on the yield. What you want to know is whether the
fertilizer makes a difference to crop yield.
A better approach would be to divide your field into different plots of equal size. You
can then use the new fertilizer on some of the plots but not on the others. After some
time, go to your field and compare the crop yield from the fertilizer plots to the crop yield
from the other plots. If the plots treated with the new fertilizer produced, on average,
a larger crop yield than the other plots, then the fertilizer made a difference. If the two
groups of plots yielded about the same amount of crop, then the new fertilizer is probably
useless (or no better than your old fertilizer, if that’s the comparison you were studying).
If the fertilizer plots do worse, the fertilizer makes a difference—but the wrong kind!
Let’s redescribe this scenario using concepts from Chapter 2. The farmer has created an
experimental group of plots (to which the new fertilizer is applied) and a control group
of plots (which is handled according to the farmer’s past practices). The application of
fertilizer to plots in the experimental group is an intervention (or treatment). In causal
terms, the farmer is intervening on a suspected cause in order to see whether this makes
a difference to the suspected effect. The suspected cause is the independent variable, and
the suspected effect is the dependent variable.
In testing causal hypotheses like this, sometimes the aim is to establish whether there
is a causal relationship. Other times, the aim is to clarify the nature and strength of a
causal relationship. For example, some drug trials simply seek to establish safety—that a
drug won’t have negative effects. Others seek to establish efficacy—that a drug will have
the expected positive effect. And still others aim to determine whether some drug is
more effective than another drug already on the market, that is, to establish the relative
strength of a causal relationship already identified.
By introducing an external influence on a system, interventions disrupt ordinary func-
tioning in a way that can help to disentangle causal relations. That’s in part why the
suspected cause is called an independent variable—the intervention independently deter-
mines its value, which eliminates the possibility that the suspected cause is affected by
the causal background. Other features of experimental design, such as having a control
group, are used to minimize the chance that changes to the suspected effect are due to
the causal background instead of the intervention. Altogether, these features help scientists
test causal hypotheses, identifying which particular factor is a genuine difference-maker.
Mill’s Methods
The English philosopher and social scientist John Stuart Mill (1806–1873) emphasized
the role of both observation and experimentation in discerning causal relationships (Mill,
1893). Mill identified five methods (see Table 7.1) used in the science of his day—and
before the development of statistics—to evaluate hypotheses concerning cause and effect.
(Scholars have suggested that some of these methods were discussed by scientists and
philosophers well before Mill—for instance by the Persian polymath Avicenna, whom you
encountered in Chapter 1.) Mill’s methods have proven to be a helpful way to think about
how observation and experiments, even nowadays, are used to identify causal relationships.
Let’s start with what Mill called the method of concomitant variations. This method
begins with the observation of correlation: that the values of two variables change in the
same circumstances. Mill noted that when one variable varies together with another, we
may infer a causal connection of some kind between them, although we won’t yet know
just how they are causally related. More specifically, we won’t yet know whether the two
variables are cause and effect, or share a common cause, or are related in some other way.
So, for example, we might see that people who play more video games than average are
also more violent than average. But while these attributes may be causally related, we
cannot tell just from their concomitant variation whether propensity to violence causes
an interest in video games, whether people become more violent by virtue of video game
exposure, or whether there is some indirect relationship between them, like a love of
excitement causing both a propensity to violence and an interest in video games.
The other methods Mill identified help get to the bottom of that question, and they
do so in ways that suggest the importance of intervention and randomization or other
forms of variable control. According to the method of agreement, one begins with cases
that agree in effect, and then scrutinizes them to learn what possible cause they have in
common—some way in which they agree. If in all instances when an effect occurs there
is one prior event or condition common to all of those cases, then one may infer that the
event is the cause of the effect. To use this method, one might let the causal background
vary while keeping the suspected cause the same. If the suspected effect still occurs in
those different instances, this is evidence that the suspected cause is indeed responsible
for the effect. If the causal background is varied sufficiently, this rules out a common
cause or other circuitous causal relationship.
The opposite approach is the method of difference. It begins with cases that differ in
effect, and then scrutinizes them to learn whether there’s some other respect in which
they differ. If in one case an effect is observed and in another case that effect is not
observed, and the only difference is the presence of a single event or condition in the first
case that is absent in the second case, then one may infer that this event is the cause of
the effect. An instance in which the suspected effect occurs is compared to an instance in
which the suspected effect does not occur. If the suspected cause is the only factor present
in the former but not the latter, this suggests the suspected causal relationship obtains.
The method of difference can also be employed when agreement has been discovered;
this is called the joint method of agreement and difference. We can consider cases where
the suspected effect occurs and see what they have in common and consider also cases
where the suspected effect does not occur and see what those have in common. If the
suspected cause is the only difference between the two sets of cases, then this affirms
a causal relationship between the suspected cause and the suspected effect. Imagine
interviewing people with a record of violence and people without such a record. If the
only distinguishing feature we find is that those in the former group play a lot of video
games and those in the latter group do not, this result would indicate a causal connec-
tion between video games and violence. This joint method of agreement and difference
provides more evidence of the causal relationship than either the method of agreement
or the method of difference by itself.
None of these methods—the method of agreement, the method of difference, or the
joint method of agreement and difference—eliminates the possibility that the suspected
effect is instead the cause. From the investigation described above, we have established a
causal relationship between video games and violence, but we can’t know whether video
games cause violence or the other way around. To resolve this, we can perform an interven-
tion on an experimental group with the joint method, with the added element of exter-
nal influence on the independent variable. If we randomly choose groups of participants
(thereby eliminating any pre-existing differences between people in the groups) and ask
one group to play a lot of video games, then we’ve eliminated the possibility of violent
tendencies causing video-game-playing. In observational studies (see Chapter 2), the joint
method of agreement and difference can be supplemented not with an intervention but by
using other forms of causal analysis. So, for example, we might ask our interview subjects
not just how much gaming they do but also for how many years they’ve played video
games. For each person, we can compare that with when his or her violent behavior began.
Finally, the method of residues is a way to apportion causal responsibility. With this
method, one traces all other effects to their causes and looks for the causal variable that
remains. If scientists have learned that some causal factors bring about certain effects, and
some of those causes present by themselves bring about some but not all of the effects,
then the missing cause(s) should be taken to be responsible for the absent effect(s).
This is a way of taking into account the causal background in order to focus on some
specific cause and determine the difference it makes. Imagine we’ve learned that obesity
and smoking cause diabetes, heart disease, and lung cancer. From our knowledge that
obesity causes diabetes and heart disease but not lung cancer, we can infer that smoking
causes lung cancer. A limitation of this form of causal reasoning is that it assumes causal
relationships are simpler than they often are. What if, for example, the combination of
obesity and smoking together causes lung cancer, but neither does by itself? The method
of residues can’t evaluate this possibility.
Consideration of Mill’s methods is in part of interest because causal hypothesis-testing
in today’s science inherits some of the features of these methods. These include a focus on
similarities among like situations, differences among unlike situations, and causal appor-
tioning. Mill’s methods also illustrate the difficulty of establishing the direction of causa-
tion, the importance of intervention, and the limitations of apportioning causal influence.
With Mill’s methods in the background, let’s now move on to these and other topics
regarding causal hypothesis-testing.
TABLE 7.1 Mill’s methods
Method Procedure
1. Method of agreement Start with cases that agree in the effect, and find a possible
cause they have in common
2. Method of difference Start with cases that differ in the effect, and find a possible
cause on which they differ
3. Joint method of agreement Compare cases that agree in the effect to cases that agree
and difference in not having the effect, and find if there is one possible
cause that cases in the former group have in common but
cases in the latter group do not
4. Method of residues Trace all known causes to their effects, and find a possible
cause and possible effect that are left over
5. Method of concomitant Find a possible cause that varies (directly or inversely) with
variations the effect
Testing Causal Hypotheses

In this book, we’ve repeatedly returned to the simple picture of formulating a hypoth-
esis, generating expectations, comparing those expectations to observations, and reason-
ing from the results of that comparison. Let’s consider a few distinctive features of this
process for causal hypothesis-testing.
Causal hypotheses can posit the existence of a causal relationship, the direction of a causal
relationship, or even the strength of a causal relationship. Two of the founders of microbiol-
ogy, Louis Pasteur (1822–1895) and Robert Koch (1843–1910), wanted to better understand
the causes of diseases like tuberculosis and cholera. They advanced a causal hypothesis: that
some diseases are caused by microorganisms like bacteria. This is called the germ theory of
disease. It posits the existence of a causal relationship between diseases and microorgan-
isms and the direction of the relationship: exposure to certain microorganisms leads to (or
increases the chances of) developing certain diseases. Recall John Snow’s observational study
of cholera in Chapter 2 and, from Chapter 4, Ignaz Semmelweis’s discovery that patients were
being infected by ‘cadaverous particles’ from doctors who didn’t wash their hands thoroughly
enough after performing autopsies. Both of these were steps toward the germ theory of disease.
Compare Pasteur’s and Koch’s causal hypothesis about some microorganisms causing
disease to the example of a farmer testing how a new fertilizer influences crop yield. In
the latter case, the farmer’s hypothesis not only posits a causal relationship and its direc-
tion but also something about the strength of the relationship. In particular, the farmer
is interested to know whether the fertilizer increases crop yield by at least enough to
justify the additional cost of purchasing and applying the fertilizer.
Hypotheses about causal relationships, their direction, and their strength are used to
develop specific expectations regarding how dependent variables will change in response
to changes to independent variables. Based on the germ theory of disease, Koch expected
that healthy mice infected with the proper bacteria would develop anthrax, an infec-
tious disease. The farmer was evaluating the expectation that the fertilizer plots of land
produced enough additional crop yield to offset the increased costs of supplies or labor.
But expectations based on causal hypotheses inherit all the complications of causal
reasoning in general. Should Koch expect all treated mice to develop anthrax? We have
seen that many causes aren’t sufficient by themselves but only increase the probability of
their effects. So, if not every mouse, how many should Koch expect to develop anthrax?
And, in what conditions should we expect this to happen? Other features of the causal
background might interfere with this causal relationship, even if the germ theory of
disease is true. We don’t expect the application of fertilizer to increase crop yield if the
crops aren’t watered, after all. These are a few of the complications in determining what
expectations we should generate from a causal hypothesis.
These complications with causal reasoning make some features of experiments and
observational studies particularly significant. To start, we have seen that control groups
provide a way to eliminate differences in the causal background, keeping them from
becoming confounding variables. In Koch’s experiments, he inoculated some mice with
blood taken from the spleens of farm animals that had died of the anthrax disease. He
inoculated other mice with blood from the spleens of healthy animals. The only (rel-
evant) difference between these groups of mice was thus their exposure to blood from
an animal that died from anthrax (Ullman, 2007). Random assignment to groups is also
important to control variation in the causal background. Our farmer’s investigation of the
new fertilizer won’t be very illuminating if all the fertilizer plots are in an arid, low-yield
part of the farm and the control plots aren’t.
Statistical hypothesis-testing is also crucial for testing causal hypotheses. As we saw in
Chapter 6, statistical hypothesis-testing involves the development of specific expectations
regarding the probability distribution values of a random variable on the assumption the
null hypothesis is true. This is important for hypotheses that predict probabilistic causal
influence. Causal hypotheses play the part of the alternative hypothesis in statistical
hypothesis-testing. The null hypothesis is, usually, simply that the posited cause does not
actually influence the phenomenon of interest. So, for our farmer, the null hypothesis
is that the fertilizer is causally inefficacious: the range of crop yield from the fertilized
plots of land will only differ from the range of crop yield from the other plots by chance
variation. Taking into account the number of plots of land and average crop yield for the
plots in the control group, the farmer can predict how high of a crop yield for the fertil-
izer plots is sufficiently unlikely, given the null hypothesis, to warrant rejecting the null
hypothesis. And, to warrant buying this new fertilizer.

Many scientific hypotheses are concerned with causal relationships. Knowledge about
causes and effects are, as we have seen, key to bringing about desirable outcomes and
preventing undesirable outcomes. Productive farming practices are worth adopting, and
wasting the farm’s money is worth avoiding. Given the range of negative health effects from
smoking, perhaps that’s an activity we should each work to avoid or minimize. Coming to
grips with the effects of fracking is crucial to deciding whether to use this form of energy
capture and, if so, how the industry should be regulated. The techniques of experimental
and control groups, randomized assignment to groups, and intervention, as well as statistical
hypothesis-testing, are all motivated in large part by their ability to discern causal relation-
ships from mere correlation, cause from effect, and causal influence from chance variation.
EXERCISES
7.12 Describe how you might apply each of Mill’s methods to test the causal hypothesis
that not getting enough sleep makes you (you in particular) hungrier the next day.
7.13 Describe the ideal experiment, looking back to Chapter 2 if helpful. You should
reference experimental and control groups, random assignment, independent and
dependent variables, extraneous variables, and intervention. Then, articulate the
significance of each of the features of the ideal experiment for testing causal hypoth-
eses in particular. Your response should discuss causal background, distinguishing
causes and effects, common causes, and spurious correlation.
7.14
a. Describe how statistical hypothesis-testing can be used to investigate a causal
hypothesis—say, that the death penalty prevents crime. (Look back to Chapter
6 if this is helpful). Make sure you specify the null hypothesis and describe, in
general, what is needed in order to reject it.
b. Write out the formula for determining the strength of a probabilistic causal
relationship (from 7.1). What is the relationship between the two sides of this
equation if C does not influence E, that is, if C is not a cause of E?
c. Considering your answers to (a) and (b), answer the following questions. What
would the process of statistical hypothesis-testing show if C is not a cause of E
(and there is no type I or type II error)? If one causal relationship (CR1) is proba-
bilistically stronger than another causal relationship (CR2), is there a greater
chance of a type I error with CR1 or with CR2? How about a type II error?
7.15 Headlines in popular media often misrepresent the scientific studies they discuss.
One way this happens is that many headlines suggest a causal relationship where
the evidence provided by the scientific study only supports a correlation. Consider
the following headlines. For each, (a) identify whether it makes either a causal or a
correlational claim; (b) rewrite any headline using causal language so that it reads
as a correlational study; and (c) suggest a possible explanation for each correlation
that is not the posited or suspected causal relationship.
1. ‘Lack of Sleep May Shrink Your Brain’, CNN, September 2014
2. ‘To Spoon or Not to Spoon? After-Sex Affection Boosts Sexual and Relationship
Satisfaction’, Science of Relationships, May 2014
3. ‘Daytime TV (Soap Operas) Tied to Poorer Mental Scores in Elderly’, Reuters,
March 2006
4. ‘Study Suggests Attending Religious Services Sharply Cuts Risk of Death’, Medi-
cal Xpress, November 2008
5. ‘Facebook Users Get Worse Grades in College’, Live Science, April 2009
6. ‘Texting Improves Language Skill’, BBC, February 2009
7. ‘Study Suggests Southern Slavery Turns White People into Republicans 150
Years Later’, Think Progress, September 2013
8. ‘Dogs Walked by Men Are More Aggressive’, NBC News, November 2011
9. ‘Want a Higher GPA? Go to a Private College’, New York Times, April 2010
10. ‘Sexism Pays: Men Who Hold Traditional Views of Women Earn More Than
Men Who Don’t, Study Shows’, Science Daily, September 2008
7.16 Choose three of the headlines listed in Exercise 7.15, and then, for each, look up
the text of the popular media report. Write a paragraph evaluating the strength of
the evidence cited in the media report supporting the claim (causal or correlational)
in the headline. Try to note both positive features and negative features.
7.17 For each of the following claims, identify three possible confounding variables in
the causal background that may impact the relationship. Say whether each possible
confounding variable would be an alternative cause, contributing cause, common
cause of both stated cause and stated effect, or something else.
a. Watching pornography leads to committing sex crimes
b. Eating pizza promotes immunity to flu
c. Ice-cream consumption raises the probability of drowning deaths
d. Being an American scientist raises the chance of having a scientific paper
published
e. Volcanic eruptions cause tsunamis
7.18 Describe an experiment you could use to determine whether smoking marijuana is
a cause of schizophrenia. Address how extraneous variables are to be controlled.
Finally, identify the expectations given the hypothesis, that is, what finding would
enable you to conclude that smoking marijuana is a cause of schizophrenia.
7.19 Psychologists have long studied the causes of altruistic behavior. In a classic psycho-
logical study by Darley and Latane (1968), participants walked down an alley on
their way to another experiment. Some were told they were late for the experiment,
others were told they were on time. Each passed by a confederate slumped in a
corner. Darley and Latane found that time pressure decreased helping behavior.
Describe the specific causal hypothesis and the features the experimental design
must have had to adequately test this hypothesis.
7.20 Economists have taken a different approach to studying altruistic behavior. They
have investigated it using experimental paradigms, such as the ultimatum game
encountered in Chapter 2—a task in which one player is given a real sum of money
and decides how to split that money with a partner, then the partner can decide
only whether to accept or reject the offer. The finding was that people offered fairer
divisions than self-interest predicts, and they rejected divisions deemed unfair even
though this results in no money won. The researchers concluded that people sacrifice
some self-interest to promote fairness. What are some important differences and
similarities between this approach and the experiment described in 7.19? Evaluate
each approach for how well it can investigate the causes of altruism.
7.3 CAUSAL MODELING
• Describe the advantages of causal modeling and when this approach is called for
• Define causal Bayes nets and say what they are good for
• Specify the kinds of assumptions embedded in causal Bayes nets and discuss their
significance and limitations
Modeling to Search for Causal Relationships

Scientists have developed a variety of sophisticated modeling approaches based on math-
ematical results and statistical methods that can help investigate causal relationships. Most
causal modeling approaches are closely associated with the difference-making account
of causation, which emphasizes the importance of interventions to find out about causal
relationships. Recall that, according to the difference-making account, if an event C causes
another event E, then intervening on C will change the value of E. In contrast, if C and
E are merely correlated, then intervening on C won’t change the value of E.
Causal modeling approaches can be used when experimentation or observation studies
are not possible, or in combination with them. With causal modeling, scientists can learn
about causation from data and can derive precise expectations about the outcomes of
experiments from causal hypotheses. The basic idea is to use probabilities to infer causal
relationships. Causal modelers use patterns of probabilistic conditional independence
within a set of variables to draw inferences about causal relationships among those vari-
ables. Causal modeling requires some specialized assumptions in order to relate probabi-
listic dependencies to causal relationships. When these assumptions hold for a data set,
one can reliably learn about causal relationships from the causal model.
There are several different approaches to causal modeling. Galton’s method of regres-
sion analysis, introduced in Chapter 5, is one of the oldest causal modeling procedures.
The basic idea of regression analysis is to estimate the correlation of two variables con-
ditional on all other measured variables. You can think of this as drawing a best-fitting
line for the relationship in the values of two variables based on data on a scatterplot.
When a causal relationship between the two variables is suspected, this can be used to
estimate how the causal variable affects the other variable. This cannot tell us whether
each variable is the cause or the effect or if they share a common cause. But when there
is independent reason to suspect one of these causal relationships, regression analysis can
be used to estimate the nature of that relationship.
The building blocks of other causal modeling approaches are also, like in Galton’s
regression analysis, statistical correlations between variables. In graphical representations
of causal models, nodes in the graph stand for variables of interest, and arrows connecting
different nodes stand for direct causal relationships between variables. See Figure 7.4 for a
generic example. A causal graph of a system enables scientists to make reliable predictions
about how the value of a variable would change, should the value of another variable
change. In other words, scientists can use causal graphs to figure out what difference an
event would make to the occurrence of another event.
Suppose you are interested in the relationship between three variables: vaccination
(V ), immunity (I), and autism (A). All three variables have two possible values: true and
false, or yes and no. You already know that vaccination causes immunity. But—worried
about what you’ve heard about potential side effects of vaccination—you make three
hypotheses about the dependency between autism and vaccination.
The first hypothesis is that vaccination causes immunity, which in turn causes autism.
This structure can be graphically represented as a straightforward chain: V→ I → A. The
second hypothesis is that vaccination is a common cause of immunity and autism. Using
arrows pointing from a cause to its effect, you can graphically represent this structure
as: I ← V → A. If this is right, then vaccination is a way to become immune to various
FIGURE 7.4 Generic causal graph with nodes representing variables of interest and arrows
representing direct causal relationships
diseases, but it also has some chance of inducing autism. Your third hypothesis is that
autism isn’t causally related to either vaccination or immunity. To evaluate these hypoth-
eses, scientists would collect data about the values of the three variables of interest in
different patients. What should you expect to find if each of these hypotheses were true?
Consider the first hypothesized causal structure: V → I → A. This hypothesis states
that immunity causally depends on vaccination and that autism depends on immunity.
If this is right, then an intervention on immunity (say, due to decreases in the levels of
antibody that protect from acquiring a disease) will decrease the chance of autism but will
not affect whether one was vaccinated. This intervention would set the variable immunity
to the value false and disrupt causal links from vaccination to immunity. And, if this first
hypothesis were true, then intervening on immunity in this way would interfere with
any correlation between vaccination and autism, making the variables vaccination and
autism statistically independent, or uncorrelated. Put another way, on this hypothesis, if
you consider everyone, vaccination would be correlated with autism, but if you consider
only patients who are immune to a disease, then patients having autism would be uncor-
related to patients having been vaccinated. Vaccination would have no effect on autism
beyond its influence on immunity.
Consider the second hypothesis, that vaccination is a common cause of both immunity
and autism: I ← V → A. What should the data look like to support the conjecture? You
should find two correlations, one between the variables vaccination and immunity and
another between vaccination and autism. Generally, if you find a correlation between
two variables, then this dependence may result from one variable causing the other, but
it is also possible that there is some third variable, a common cause, that causes the val-
ues of both variables and explains their correlation. Given the common cause structure
associated with our hypothesis, you should also find that altering the value of the vari-
able autism will not affect the value of immunity and that altering the value of immunity
will not affect the value of autism. Holding fixed the value of vaccination makes autism
probabilistically independent from, or uncorrelated with, immunity. So, if this hypothesis
is true, then examining only people who are vaccinated (or aren’t vaccinated) would result
in no correlation between immunity and autism.
In actuality, there is no evidence that vaccination of any kind causes autism. Both
hypotheses are false. Before delving into that, let’s first consider where the practice of vac-
cination came from. Vaccination has been practiced for three centuries. In the 1700s, there
was some recognition that survivors of certain infectious diseases would become immune
to future exposure, and researchers began a primitive form of inoculation by infecting
themselves with a disease to gain immunity. The risk of sickness and death with these
primitive forms of inoculation was high. Then, the English physician and scientist Edward
Jenner (1749–1823) discovered that if he infected people with the cowpox virus, related
to smallpox but less dangerous, they had far lower mortality rates from smallpox. Vaccine
research advanced significantly again almost a century later, when Louis Pasteur identified
bacteria as a major cause behind several diseases; this knowledge led to the germ theory of
disease we discussed earlier in this chapter, and to the first synthetically made vaccination.
The 1900s saw the introduction of several successful vaccines, including those against
diphtheria, measles, mumps, and rubella. As vaccines became more common, their causal
mechanism became well understood. Basically, vaccines train the immune system to iden-
tify and combat pathogens, either viruses or bacteria. Certain molecules from the patho-
gen must be introduced into the body to trigger an immune response but not necessarily
the whole pathogen. So, many modern vaccines have no chance of making you sick from
the pathogen, since they don’t even contain the full viruses or bacteria.
But despite increased understanding of how vaccines work and drastically increased
vaccine safety, misconceptions remain. The myth that vaccines cause autism originated
with a study published in a prestigious medical journal in 1997. The study linked the
measles, mumps, and rubella (MMR) vaccine to increasing autism in British children. This
was a correlation. Several other studies were independently conducted to test whether
this correlation was due to a causal relationship; none found a causal relationship between
vaccination and autism. In fact, several studies couldn’t even replicate the correlation
between vaccination and autism. In the meantime, several other researchers pointed out
that there were several methodological errors in the original study, that the authors had
financial conflicts of interest, and that the study was ethically problematic. The article
was eventually retracted from the journal. While the causes of autism are unclear, it has
been definitively shown that vaccination is not among them.
From this, we can conclude that any data you gathered would not confirm either your
first or second hypothesis about a causal pathway from vaccination to autism. Vaccination
and immunity are strongly correlated with each other; the reason why is that vaccina-
tion is one of the major causes of immunity. But vaccinations have undergone extremely
extensive safety testing with huge groups of test subjects, and none has shown a cor-
relation with autism. And scientists now believe there are physiological signs of autism
even in utero, well before exposure to vaccination. Neither vaccination nor immunity are
correlated with autism; nor does vaccination cause autism.

Despite the scientific knowledge already achieved in this issue, the belief about a con-
nection between vaccination and autism persists. This stems in part from the fact that the
initial symptoms of autism can occur in early childhood, around the same time that many
vaccinations customarily occur. So there’s a temporal connection between vaccination and
autism diagnosis, and we have emphasized that spatiotemporal connection is a guide to
causation. But in this case, the temporal connection is simply because autism diagnosis
and vaccination share a common cause: being a young child.
Now, back to causal models. The main strengths of causal modeling are transparency
and flexibility. In constructing causal models, scientists are forced to be explicit about their
assumptions regarding the causal relationships in question. An example is the different sets
of expectations stemming from the three different hypotheses for how vaccination may
causally relate to autism. Once the assumptions about causal relationships are explicit, a
causal model can simply represent dependencies between different variables in the model,
and precise expectations can be formed about what would happen if you changed the
value of a variable in the model. Patterns of statistical information can be used to test
these expectations and, thus, the causal hypotheses behind them. Using causal models,
scientists can make a fine-grained evaluation of whether correlational evidence supports
a causal hypothesis; they can identify what manipulations to perform when conducting
an experiment to assess a causal connection; and they can better recognize what factors
in the causal background must be controlled.
Causal models are used across many different fields of sciences, from epidemiology to
economics. While there are several different approaches to causal modeling, the leading
approach to causal learning and reasoning is the causal Bayes nets approach. The rest of
the chapter will survey this approach.
Causal Bayes Nets

Causal Bayes networks, or ‘nets’, are a kind of graphical causal model. They are made
up of two components: a graph representing the variables in the system of interest with
directed links representing causal relationships, as in Figure 7.4, together with a set of
conditional probabilities specifying how the strength of each causal relationship.
The purpose of causal Bayes nets is to provide a compact visual representation of a
system’s causal relationships and their strengths. This purpose is accomplished using joint
probability distributions, that is, the probability distribution for each variable, taking into
account the probability of the other variables. These graphical models are called Bayes
nets because they use the rule of Bayesian conditioning to compute posterior distribu-
tions, updating the probabilities in the network whenever new information is acquired.
(See Chapter 6 for a discussion of Bayesian statistics.)
Suppose there are only two events that could cause your Facebook page to be shut
down: either you post material that infringes copyright or a friend reports you. Also sup-
pose that posting material that infringes copyright affects the chance that your friend
reports you: when you post a new song on your page, your friend will enjoy it and thus
won’t report you. This situation can be modeled with a Bayes net like the one seen in
Figure 7.5. This doesn’t yet show the conditional probabilities that specify the strength
Copyright – Your friend

infringement reports you
+ +
FB page
shut down
FIGURE 7.5 Causal graph of the relationships between posting copyrighted material on your
Facebook page, a friend reporting you, and your Facebook page being shut down.
of each causal relationship. The strength of those relationships matters; this decides
whether on balance posting copyrighted material increases or decreases the chance of
your Facebook page being shut down. Suppose that all three variables have two pos-
sible values, true and false, and that their conditional probability relationships are given
in Table 7.2.
Causal Bayes nets like this one can be used to make probabilistic and causal inferences
and to learn about causal relationships. Because they are complete models for specified
variables and their relationships, they can be used to answer questions about the prob-
ability that a certain variable takes on a specific value. For example, the causal Bayes net
model outlined in Figure 7.5 and Table 7.2 can be used to determine the probability
that you’ve been reported, given that your Facebook page has been shut down, but you
posted no copyrighted material. Another use is, when the values of certain variables are
observed, the network can determine the value of other variables by computing their
posterior probabilities using Bayesian conditioning.
Bayes nets can also be used to estimate causal relationships that are related to statisti-
cal features of our observations—for example, the negative correlation between copy-
right infringement and being reported by a friend. And they can be used to predict
the effects that potential interventions on some variables would have on the values of
other variables—for example, to predict what would happen if you posted copyrighted
material on your page.
TABLE 7.2 Conditional probabilities for the causal graph in Figure 7.5
Pr(Copyright infringement)
Copyright infringement = T Copyright infringement = F
0.20 0.80
Pr(Reported|Copyright infringement)
Copyright infringement Reported = T Reported = F
T 0.01 0.99
F 0.40 0.60
Pr(Page shut down|Copyright infringement, Reported)
Copyright infringement Reported Page shut down = T
T T 0.99
T F 0.80
F T 0.90
F F 0.00
To better understand how scientists use Bayes nets to learn about causal relationships,
consider this scenario:
Suppose that a patient has been suffering from shortness of breath (called dyspnoea)
and visits the doctor, worried that he has lung cancer. The doctor knows that other
diseases, such as tuberculosis and bronchitis, are possible causes of this symptom, as
well as lung cancer. She also knows that other relevant information includes whether
or not the patient is a smoker (increasing the chances of cancer and bronchitis) and
what sort of air pollution he has been exposed to. A positive x-ray would indicate
either TB or lung cancer. (Korb & Nicholson 2010, p. 30 ff.)
There’s plenty of causal information here, but how that information relates to the case
at hand is tricky to figure out. Constructing and using a causal Bayes net is one effective
way to assist the doctor in making a medical diagnosis. To construct such a model, the
first thing to do is to identify the relevant variables. Like in the previous example, each
variable will be represented with a node. There’s no uniquely right way of setting up the
Bayes causal net, but it helps to make choices about what nodes to include that enable us
to represent the relevant, known aspects of the situation with enough detail to perform
the desired reasoning. One possible modeling choice is shown in Table 7.3. In this case,
the variables include dyspnoea, smoker, pollution exposure, x-ray result, lung cancer.
The second step of constructing a causal Bayes net is to specify the causal structure of
the system by drawing arrows between the nodes. Smoking and living in a polluted area
are two factors affecting the patient’s chance of having lung cancer. In turn, having lung
cancer is a factor affecting the result of an x-ray, and the patient’s difficulty in breathing,
that is, the patient’s suffering from dyspnoea. If this is the structure of the situation, then
we may draw the graph pictured in Figure 7.6.
Several forms of causal relationships can be represented in a causal Bayes net. A cause
can increase or decrease the probability of some variable taking on a given value, causes
can influence themselves, or there can be a feedback loop where two or more variables
influence one another in a cyclical way. Most of the time, however, Bayes nets are assumed
to be directed acyclic graphs (sometimes abbreviated DAG), which means that all the
causal relationships are taken to go in one direction without feedback loops. This means
TABLE 7.3 Possible values for variables in dyspnoea case
Variable Values
Dyspnoea {T, F}
Smoker {T, F}
Pollution {low, high}
X-ray {positive, negative}
Lung cancer {T, F}
Smoker X-ray
Lung
cancer
Pollution Dyspnoea
FIGURE 7.6 Causal graph for the dyspnoea case
TABLE 7.4Conditional probabilities of developing lung cancer given level of pollution

exposure and whether or not a person smokes
Pollution Smoker Pr(Lung cancer=T|Pollution, Smoker)
High T 0.050
High F 0.020
Low T 0.030
Low F 0.001
that earlier causes are assumed not to also be later effects. You can see from Figure 7.6
that our graph satisfies this assumption; no arrows form circles like X → Y → Z → X,
and no arrow is bidirectional like X ↔ Y.
Having specified the nodes and their structure, the strength of the relationships
between connected nodes must now be specified. To do so, one needs to define a prob-
ability distribution for each node, conditional on any node(s) that causally influence it.
In the dyspnoea case, statistical information from medical studies or observed frequen-
cies can be used to specify these probability distributions. For variables for which no
such information is available, initial probabilities can be based on an intuition, guess, or
estimation. These are exactly like the prior probabilities from the discussion of Bayesian
statistics in Chapter 6. It turns out that Bayes nets can be accurate in the long run even
if they start off with imprecise or inaccurate initial probabilities.
Let’s take a look at the variable lung cancer in Figure 7.6. The variables that causally
influence it are pollution and smoker, each of which can take two possible values for a
total of four combinations of values: {<high, T>; <high, F>; <low, T>; <low, F>}. We can
specify the conditional probability of having cancer in each of these four cases. One way
to represent these conditional probabilities is in a table, as in Table 7.4.
Once all the conditional probability distributions are determined, our causal Bayes
net captures all of the relevant knowledge available. Now we can start to reason with
it. Reasoning with a Bayes net amounts to the task of computing a posterior probability
distribution for one or more variables of interest given the values of variables that you have
information about. These computations are governed by Bayesian conditioning. Think
of this as updating your beliefs about a variable based on changes to your beliefs about
other variables. The arrows connecting nodes in the causal Bayes net show the paths that
probability distribution changes follow.
Belief updating can happen either from cause to effect, based on information about the
value of a cause variable, or from effect to cause, based on information about the value
of an effect variable. For example, if we’re certain that the patient has dyspnoea, and her
x-ray results are negative, then we can update our diagnosis about whether the patient
has cancer, a causal influence on both dyspnoea and x-ray results. In turn, updating our
diagnosis of cancer will affect our beliefs about whether the patient is a smoker and lives
in an area with high levels of pollution, proceeding up the chain of causal influence. Or
if we are certain that the patient is a smoker, we can update our beliefs about her chance
of having lung cancer accordingly, which is causally influenced by smoking status. This
also influences our expectations of the x-ray result.
A different type of reasoning with causal Bayes nets regards the relationship between
two causes that compete to explain an observed effect. In our case, smoker and pollution
are two such causes. They compete to explain the value of the variable lung cancer, which
they both influence. Suppose we learn that the patient has cancer. This new piece of infor-
mation raises the probability of both possible causes. Suppose that we learn further that
the patient lives in a badly polluted city. Something interesting would now happen in our
causal Bayes net. This new piece of information both explains the patient having cancer,
and it also lowers the probability that the patient is a smoker. Although the variables
smoker and pollution are initially probabilistically independent, given that we know that
the patient has cancer and lives in a highly polluted area, the probability that the patient
is a smoker goes down. Now that we know the patient has been exposed to significant
pollution, this information accounts for the lung cancer and disrupts the attribution of
a probabilistic association between lung cancer and smoking. Put another way, we don’t
need to speculate that the patient was a smoker in order to explain the lung cancer.
In the simple cases we’ve considered, a Bayes net is fully specified, and then used to
make causal inferences and predictions. In some scientific applications, in contrast, causal
Bayes nets are incomplete in two respects. First, there are many other variables that could be
added to the model; variables that precede, mediate, or follow the variables that are explicitly
represented. Second, information might be lacking about the causal relationships between
variables represented in the model. In this case, the structure of the network and the relevant
probabilistic dependencies must be learned from data as the model is developed.
Cognitive neuroscientists, for example, are interested in the causal relationships between
brain areas that support the same cognitive capacity. To find out about these causal rela-
tionships, they often rely on brain imaging data, where subjects perform tasks that tap the
cognitive capacity of interest while having their brain activity recorded. Neuroscientists
already have some background knowledge about which brain regions might be involved
in a task, so they often focus their attention on recorded activity from only a few regions
of interest, each one of which can be treated as a variable and represented as a node in
a causal Bayes net. The challenge is then to discover the causal structure of these regions
of interest—to determine the nature of the arrows.
Machine learning algorithms help neuroscientists to tackle this challenge. One of these
algorithms searches the brain imaging data set to find the causal structure that best
helps scientists explain observed statistical dependencies between the variables of interest.
Roughly, this search procedure begins with a graph with no arrows. Arrows are added
sequentially, based on how well they would help account for observed correlations. When
no further addition of arrows can improve the account of observed correlations, the pro-
cedure moves to eliminating arrows until the account is as simple as it can be while still
matching the observed correlations. The resulting causal structure is invoked as the best
explanation of the observed data (Glymour, 2007).
Assumptions of Causal Modeling

Reasoning with causal Bayes nets and other forms of causal modeling requires a num-
ber of assumptions (see Eberhardt, 2009). In closing this chapter, we’ll discuss three
such assumptions: modularity, the causal Markov condition, and faithfulness. When these
assumptions are satisfied, causal Bayes nets are promising for learning reliably about
causal relationships between variables from their observed statistical features. The failure
of these assumptions can, in some cases, undermine the usefulness of causal Bayes nets.
Modularity is the assumption that interventions on some causal relationship will not
change other causal relationships in the system. If a system is modular and there is a cor-
rect causal Bayes net of that system, then dependencies between variables in the model
that are not directly manipulated should not change. Thus, if modularity holds, it should
be possible to change the value of a variable X in the Bayes net without making arrows
into variables that depend on X appear or disappear. If variable X is not a cause of Y, then
the probability distribution of Y should remain unchanged when there’s an intervention
on X. In contrast, in systems that are not modular, an intervention on one variable may
change other causal relationships in the system.
The assumption of modularity allows one to make precise predictions about the effects
of intervening on a particular variable. When the modularity assumption is not satisfied,
Bayes nets may not provide correct answers to questions about the effects of an inter-
vention. Different systems are modular to varying degrees. Systems that are not modular
can sometimes be rearranged so that its causal relationships can be correctly represented
by a Bayes net.
Closely associated with modularity is the causal Markov condition. This is one of the
most important assumptions of causal Bayes nets. The causal Markov condition specifies
that each variable in a Bayes net, conditional on its direct causes, is independent of all
other variables other than its direct and indirect effects. The basic idea is that remote
causes do not matter to conditional probabilities, and thus to causal inference, so long
as we know the immediate causes of an event. In the dyspnoea case, for example, the
causal Markov condition assumes that whether the patient has a positive x-ray is influ-
enced by whether he has cancer, but taking into account whether he has cancer, it is not
influenced by whether he is a smoker, or by whether he lives in a high pollution area.
The idea is that cancer causes a positive x-ray result, whether the cancer was caused by
smoking or by pollution.
The causal Markov condition indicates which variables will be probabilistically inde-
pendent conditional on other variables. This enables scientists to reason from probabilistic
information to causal relationships. If the causal Markov condition holds, then a Bayes net
can correctly represents the absence of a direct causal relationship with the conditional
independence of two variables. Our reasoning about vaccination, immunity, and autism
relied on this reasoning. The causal Markov condition might fail if the set of variables
included in a Bayes net is incomplete in certain ways. But, here too, there are sophisticated
machine learning techniques for causal discovery that work reliably.
The third assumption of causal Bayes nets we’ll discuss is faithfulness. While the Markov
condition indicates which variables in a Bayes net will be probabilistically independent,
faithfulness specifies which variables will be probabilistically dependent conditional on
other variables. In the dyspnoea example, if having cancer is causally related to tuber-
culosis, then TB and cancer in our Bayes net should be probabilistically dependent. The
basic motivation for the faithfulness condition is that a causal relationship between two
variables entails, almost always, a probabilistic dependence between those variables. This
implies that the probabilistic influence of different causal pathways from one cause to
an effect will not exactly cancel out each other’s influence.
However, faithfulness doesn’t always hold. In Section 7.1, we discussed two examples
of events that are causally related but uncorrelated. If smoking causes heart disease, but
also causes exercise, and exercise prevents heart disease, then the causal influence may
exactly cancel out. Here the faithfulness assumption fails. Failures of faithfulness don’t
compromise causal inference as seriously as failures of the causal Markov condition.
Conditions where faithfulness fails are much better understood than conditions where
the causal Markov condition fails, and the number of techniques for causal discovery that
don’t rely on faithfulness is larger.
There are many more assumptions underlying reasoning with causal Bayes nets, beyond
modularity, the causal Markov condition, and the faithfulness condition. As we have said
of causal modeling in general, specifying these assumptions, and seeing where they fail
to hold, is an important step toward making causal claims transparent. Understanding
how causal modeling works when some of assumptions fail and what kinds of errors they
may introduce is one of the most important challenges at the forefront of current causal
modeling approaches.
EXERCISES
7.21 Describe what causal modeling can be used for. What are some advantages and
limitations compared to other strategies we have seen for learning about causal
relationships?
7.22 For each of the following cases, (a) indicate the causal hypothesis, explicitly distin-
guishing the cause from the effect; (b) offer another plausible cause for the effect;
and (c) draw a simple causal model to help you assess whether the reasoning
described in the case is good or bad.
1. You have eaten your birthday dinner at your favorite pizzeria in town for the
past 10 years. This year, you got sick. This was also the first time your uncle
Sam was there. You conclude you got sick because uncle Sam was there.
2. Every time Felipe goes to see Real Madrid play, they lose. Whenever he is not
there, they win. If I want Real Madrid to win, I had better not let Felipe go to
any more games.
3. Eryka normally goes to bed at midnight and gets up by 7:00 a.m. each morn-
ing. She usually runs two kilometers after having some breakfast. This morning,
however, she ran only half a kilometer and had to stop, as she was so tired.
She recalled that she had gone to sleep unusually early the night before and
concluded that too much sleep made her too tired to run.
4. In Albystown, there are two kinds of students: those who own a diary and those
who own a smartphone. A first-grade teacher in Albystown noticed that all the
students who consistently failed exams owned a smartphone. He concluded that
those students who own a smartphone are intellectually inferior to those who
own a diary, and that’s why they failed more exams.
5. Phineas Gage’s moral character changed dramatically after an explosion blew
a tamping iron through his head. Gage was leading a railroad construction
crew near Cavendish, Vermont, when the accident occurred. ‘Before the acci-
dent he had been a most capable and efficient foreman, one with a well-bal-
anced mind, and who was looked on as a shrewd smart business man.’ After
the accident, he became ‘fitful, irreverent, and grossly profane, showing little
deference for his fellows. He was also impatient and obstinate, yet capricious
and vacillating, unable to settle on any of the plans he devised for future action’.
7.23 Causal reasoning involves various types of probabilistic inferences: predictive infer-
ences (from causes to effects); diagnostic inferences (from effects to causes); and rea-
soning about interventions (what would happen if you manipulated a certain feature
of a system). For each of the following situations, (a) indicate whether you would
make a predictive or a diagnostic inference to find out about the events described;
(b) describe what intervention you would carry out to find out about the events
described; and (c) explain why you would make those inferences and interventions.
1. You are a physician working at a hospital, and you notice that some patients
have been infected with influenza.
2. You notice that you have a runny nose, body aches, and a sore throat.
3. You notice that there is an unusual smell coming from the engine of your car, while
the needle on the temperature gauge creeps up quickly past the normal limit.
4. Every morning, you notice a continuous tinkling noise coming from the kitchen
in your apartment.
5. You notice that the countryside of your town has more animals than the site
could support for a grazing season.
7.24 Describe the important elements of a causal Bayes network and what each represents.
7.25 A group of psychologists is interested in how intrinsic motivation of university students
affects their exam results. They believe that intrinsic motivation affects both class atten-
dance and home preparation (reading the textbooks, doing the assignments, and so on).
They also believe that both class attendance and home preparation affect exam results.
They do not believe that there are any further causal interactions. All relevant variables
(intrinsic motivation, class attendance, home preparation, and exam results) have two
values: high and low for intrinsic motivation, class attendance, and home preparation
and pass and fail for exam results. The psychologists observe the following frequencies:
1. 40% of all students have a high intrinsic motivation.
2. 90% of all highly motivated students attend classes regularly, as opposed to
60% of all students with low motivation.
3. 70% of all highly motivated students prepare well, as opposed to 20% of all
students with low motivation.
4. 80% of all students who prepare well and attend class regularly pass the exam.
5. 60% of all students who prepare well and do not attend class regularly pass the exam.
6. 45% of all students who do not prepare well and do attend class regularly pass
the exam.
7. 40% of all students who do not prepare well and do not attend class regularly
pass the exam.
Draw the causal Bayes net that corresponds to the story. Then, suppose that the uni-
versity implements a new policy that forces students to attend class. Assume that all
students comply with this policy. From the causal Bayes net and the frequencies given
above, determine the probability that students pass the exam after this intervention.
7.26 Construct causal Bayes nets for simple examples of causal relationships with (a) a
common cause structure, (b) a common effect structure, and (c) a chain structure.
7.27 Tillbourg College admits students who are either brainy or sporty (or both). Let C
denote the event that someone is admitted to Tillbourg College, which is made true
if they are either brainy (B) or sporty (S). Suppose in the general population, B and
S are independent. Draw a causal Bayes net to represent this situation, defining all
relevant variables and probabilities. If you learn that all students at Tillbourg College
are sporty, what can you infer about the value of S? Explain your reasoning.
7.28 Give an example of explaining away, a situation in which discovering one causal
relationship diminishes the probability of some presumed cause.
7.29 Suppose that we measure the variables storm (S), barometer reading (B), and atmo-
spheric pressure (A). You find that storm and barometer reading are probabilisti-
cally dependent, as are barometer reading and atmospheric pressure, and storm
and atmospheric pressure. Furthermore, you find that storm and barometer reading
given atmospheric pressure are independent. From these constraints alone (assum-
ing the causal Markov condition and faithfulness hold), what underlying causal
structures can you infer? For each, provide a causal Bayes net.
FURTHER READING
For more on the psychology of causal reasoning, see Sloman, S., & Lagnado, D. (2015).
Causality in thought. Annual Review of Psychology, 66, 223–247.
Pasteur’s influence on the history and sociology of medicine is described in more detail in
B. Latour’s (1993). The pasteurization of France. Cambridge: Harvard University Press.
For an account of the difference-making view of causation and its importance in scientific
explanation, see Woodward, J. (2003). Making things happen: A theory of causal expla-
nation. Oxford: Oxford University Press.
For a pluralist view of the nature of causation and discussion of causal analysis, includ-
ing causal Bayes nets, see Cartwright, N. (2007). Hunting causes and using them:
Approaches in philosophy and economics. Cambridge: Cambridge University Press.
For advanced treatments of causal modeling, see Pearl, J. (2009). Causality: Models, rea-
soning, and inference, 2nd edition. New York: Cambridge University Press. Also see
Spirtes, P., Glymour, C., & Scheines, R. (2001). Causation, prediction, and search, 2nd
edition. Cambridge: MIT Press.
CHAPTER 8
Explaining, Theorizing,
and Values
8.1 UNDERSTANDING THE WORLD
• Articulate the roles played by explanation in science

• Describe the nomological, pattern-based, and causal conceptions of explanation
• List advantages and problems of each of these three conceptions of explanation
The Workdays of Taxi Drivers

Many taxi drivers in New York City are independent workers. They lease their cabs for
a fixed fee, or they own them outright. They keep the fares they earn. And they can call
it a day at any time during their shift. Some days are good for business: during weekdays
when Wall Street is open, when it’s rainy, or when some big event is in town. On a good
day, taxi drivers spend less time cruising around searching for customers, and they earn
a relatively high hourly wage. Other days are bad, however; and taxi drivers may have
a hard time finding costumers. So, although taxi fares are set by law, taxi drivers’ daily
wages can fluctuate significantly. Given this variation in their hourly wages from day
to day, how do taxi drivers in New York City choose the number of hours they work
each day?
Economists can answer this question by appealing to the law of supply. According
to the law of supply, there is a direct relationship between price and quantity of
goods and services: generally, as the price of an item increases, suppliers will attempt
to increase their profit by increasing the quantity of items offered for sale. So, people
will sell more of something when the price is high than when the price is low. Taxi
drivers in particular will tend to sell more of their labor hours—that is, they will
tend to work longer—when wages are higher than when they are lower. In other
words, they will work more, when it really pays off, and cut out early on bad days
when it doesn’t.
The law of supply—along with its counterpart, the law of demand—is one of the
most fundamental and intuitive explanatory principles in economics. Assuming people
strive to do what is in their best interest, economists invoke general principles like the
laws of supply and demand to explain how people set the prices of goods and services
and how people allocate resources like their time. When an employer pays higher over-
time hourly rates, the number of hours employees are willing to work increases. When
276 Explaining, Theorizing, and Values
consumers are willing to pay more for a slice of pizza than for a cupcake, bakeries will
increase their production of pizza and reduce the production of cupcakes. The law of
supply captures the relationship between price changes and suppliers’ behavior, as in
these examples.
Psychologists give a different answer to the taxi-driver question. There is a theory about
daily income, called the ‘daily-income-targeting theory’, that appeals to two psychological
tendencies. One tendency is that, when confronted with multiple related decisions over a
period of time, people often consider the merits and weaknesses of only a single decision
at a time, instead of considering the consequences of all decisions at once. The second
psychological tendency is loss aversion: people dislike losing money or other resources
more than they enjoy gaining similar amounts. Applied to taxi drivers, these tendencies
suggest that their decisions about how much to work are made day by day instead of all at
once and that they generally will resist quitting until they reach their daily target income.
This predicts that taxi drivers will work longer hours on low-wage days and quit early on
high-wage days. This is, of course, the opposite of what economists’ law of supply predicts.
A group of economists and psychologists tested these competing predictions by car-
rying out a field study, where they analyzed data about New York taxi drivers’ behavior
from the years 1988, 1990, and 1994 (Camerer et al., 1997). Their data indicated that
less-experienced drivers tend to work more hours on bad days, when working does not
pay off, and clocked off too early on good days. The income-targeting theory explains this
apparently irrational behavior in a simple way: inexperienced taxi drivers use a simple rule
of thumb—a heuristic—that guides them to aim for a certain amount of earnings over a
certain period of time. If they are falling behind that rate, they work longer to catch up,
and if they are ahead, they quit early.
The data showed that more experienced taxi drivers don’t display this pattern of
behavior. To figure out why, the researchers evaluated their data sets with an eye to other
possible explanations. Two plausible explanations were that taxi drivers may learn with
experience to resist the temptation to quit early on good days. Or they may simply learn
that driving a fixed number of hours each day is more efficient than aiming for a certain
amount of money. Neither of these possible explanations appeals to general economic
principles. Taxi drivers, inexperienced or experienced, don’t seem to act in accordance
with the law of supply.
Explanation, Understanding, and Scientific Knowledge

In Chapter 1, we discussed how science aims at the production of knowledge—an aim

that is constitutive of the very meaning of the word science. There are many kinds of
knowledge, each of which can be important. But the kind of knowledge that science aims
to produce is distinctive: scientific knowledge is explanatory knowledge of why or how
the world is the way it is. In the case of taxi drivers, scientists have used both the law
of supply and the daily-income-targeting theory to attempt to explain how taxi drivers
decide how long to work each day. This can help scientists understand more generally
how humans make decisions about things like time and revenue.
Unfortunately, the explanation issuing from the law of supply doesn’t seem to ade-
quately account for the behavior of actual taxi drivers, and so it doesn’t produce knowl-
edge of how or why drivers make the decisions that they do. The explanation based on
Explaining, Theorizing, and Values 277
the daily-income-targeting theory does a better job at accounting for drivers’ behavior.
This explanation seems to help us understand taxi drivers’ decisions about how long to
work each day, and it may be a promising start for explanations of other, similar human
behavior (for a nice example see Camerer, 1997).
To say that science aims to produce a special kind of knowledge is not to say that scien-
tific explanations are entirely different from ordinary, everyday explanations. The explana-
tory knowledge produced in science is a special kind of knowledge, explicitly supported by
evidence through the use of methods discussed in this book. But there’s significant overlap
between scientific and everyday forms of explanation. All of us sometimes notice things
that cry out for explanation. We routinely ask questions such as: ‘how much does drinking
corrode the liver?’, ‘why did the economic crisis happen?’, ‘why do colleges and universi-
ties have vastly more highly paid administrators than they used to, given steep declines in
public funding for higher education?’, and, of course, ‘how did the dinosaurs go extinct?’
Even children regularly engage in this pursuit of explanatory knowledge. Many have
wondered why the sky is blue. A parent might quickly answer that the sky is blue because
it looks that way to us or because that’s just the way the sky is. Such answers don’t explain
why the sky is blue; they offer no insight into why or how the phenomenon is the way it
is. A satisfying explanation of why the sky is blue relies on some sophisticated scientific
theorizing: sunlight travels in straight lines unless some obstruction either reflects it, like a
mirror; bends it, like a prism; or scatters it, like the molecules of gas in the Earth’s atmo-
sphere. Because blue light has shorter wavelengths, it is scattered more than other colors
in the spectrum. That’s why we normally see a blue sky. In contrast to most parents’ quick
answers to this question, this explanation appeals to other facts about the world and scien-
tific laws or theories in order to give a deeper understanding of the phenomenon in question.
Generating explanations serves a variety of cognitive roles. It facilitates learning and
discovery, and plays a central role in confirmation and reasoning. As we discussed in
Chapter 4 in relation to abductive reasoning—also known as ‘inference to the best expla-
nation’—explanatory considerations can be used as evidence in support of a hypothesis,
making the hypothesis more credible. With respect to learning, generating explanations
to oneself or to others facilitates the integration of new information into existing bodies
of knowledge and can lead to deeper understanding; this is called the self-explanation
effect. Performance on a variety of reasoning tasks, including logical and probabilistic tasks,
can be improved when one is asked to explain. This is why explaining the study mate-
rial and responding to explanatory questions is such a good way to learn new material
encountered in a course. Instructors and tutors learn material faster and with more depth
of insight by virtue of explaining it to others.

Perhaps most important among these cognitive roles, explanation produces under-
standing. Understanding the world around and within us is a supreme achievement that
is absolutely central to science. Understanding involves grasping why or how something
came about or is the way it is. This makes it possible for us to intervene in the world
and to anticipate what will happen next. When we understand how a system works—say,
the tidal system of the San Francisco Bay, an example from Chapter 3—we are able to
anticipate how changes in some features of the system (like the Reber Plan) will lead to
changes in some other features (tides, salinity, and so forth).
When explanations generate genuine understanding, they can satisfy our curiosity. To
satisfy our curiosity and have that experience of ‘Aha! Now I get it!’ can feel really good.
Psychologist Alison Gopnik (1998) once likened understanding to orgasm. Sex evolved to
feel good because it leads to babies, which is needed for a species to continue. Similarly,
Gopnik reasoned, understanding is enjoyable because explanations are tremendously help-
ful to people getting around in the world. And so, the desire to satisfy our curiosity has
led humans to ever more sophisticated and accurate theories about our world.
The satisfaction of curiosity is no guarantee of a good explanation, though. People
can have a sense that they understand something without genuinely understanding it—
explanations can be wrong. People also often fall prey to an illusion of explanatory
depth, believing they understand the world more clearly and in greater detail than they
actually do. We all regularly overestimate our competence and depth of knowledge; recall
our discussion in Chapter 1 of the cognitive errors, like confirmation bias, which science
is designed to correct for.
An illustration of how one can be dangerously misled by the feeling one understands
something is the public reception of climate change research. As you may recall, cli-
mate change was originally called ‘global warming’. But this terminology misled many
people about what they should expect to experience. When a season was not warmer
than usual in some particular location, some people were tempted to doubt the reality
of climate change—it seemed to them like things weren’t getting warmer after all. But
climate change does not produce warmer temperatures in every location at every point
in time. Instead, it produces a global increase in average temperatures and increasingly
extreme weather and storms along the way.
Unfortunately, some people—including some politicians who shape how nations
respond to climate change—still disregard scientific knowledge of climate change because
of apparent conflicts with the daily weather they experience. Figure 8.1 pictures Oklahoma
FIGURE 8.1 Oklahoma Senator James Inhofe speaking before the US Congress in 2015 while
brandishing a snowball
Reproduced from C-SPAN
Senator James Inhofe speaking before the US Congress in Washington D.C. in February
2015. Inhofe brought a snowball to illustrate that it was (he claimed) unseasonably cold
outside. In fact, it was not unusually cold in D.C., and meanwhile, the West Coast of
the United States was unusually warm. The year prior, 2014, had the warmest average
temperatures in recorded history, and the Earth has continued to warm in the years since.
Another example of the illusion of explanatory depth concerns public reception
of neuroscientific information. Experimental data suggest that people are often mis-
led into judging bad psychological explanations as better than they really are when
accompanied by completely irrelevant neuroscientific information. This ‘seductive
allure’ of neuroscientific explanations might interfere with people’s ability to criti-
cally evaluate the quality of an explanation (Weisberg et al., 2008). Coupled with
an illusion of explanatory depth, this interference can have negative practical effects
when, for example, it is exploited by advertisements for ‘brain training’ that promise
brain enhancement ‘proven by neuroscience’. This is the opposite of the climate change
case. Instead of scientific expertise being disregarded because of personal experience,
scientific credibility is misapplied to get people to believe something there’s not actu-
ally sufficient evidence for.
Given the centrality of explanation to the scientific enterprise and the potential for
all people, including scientists, to feel like they understand something even when they
do not, it’s an important task to clarify the nature of scientific explanation. If we can say
what features good explanations must have, then we will be better able to judge whether
something counts as an adequate explanation.
One simple idea is that explanations are just true answers to why or how questions,
such as ‘why is the sky blue?’ or ‘how do bicycles move?’ But we have suggested that
some true answers to the question of why the sky is blue, like ‘because that’s the way
it is’, don’t count as explanations. So, we need a way to determine when a true answer
to a why- or how-question is a good explanation. What features should good answers to
why- or how-questions have?
Philosophers of science and some scientists have thought long and hard about this
question. The possible answers relate to other topics we have discussed in this book. Some
have suggested that explanations should cite laws in order to account for phenomena,
either deductively or probabilistically. Another idea is that explanations should show how
phenomena fit into patterns. Others have suggested that explaining is a kind of causal
reasoning and that explanations should say what causes a phenomenon.
Nomological Explanation and Pattern Explanation

Let’s consider these conceptions of explanation in greater depth. The first is that suc-
cessful explanations appeal to scientific laws. This idea is at the heart of the nomological
conception of explanation (from the Greek nomos, meaning law). According to this con-
ception, a scientific explanation references a law that can account for the phenomenon
to be explained.
The nomological conception of explanation was developed most fully by the German
philosopher of science Carl Hempel (1905–1997). Hempel proposed that explanations
are arguments that appeal to general scientific laws to derive statements about the occur-
rence of the phenomena we want to explain. Explanations demonstrate that there are one
or more scientific laws or principles that, together with background conditions, make it
so that the phenomenon to be explained was to be expected. So, according to Hempel,
nomological explanations have a form like this:
1. L1, …, Ln
2. C1, …, Cn
∴ 3. E
In this scheme, L1, …, Ln are statements of general laws, such as the laws of supply and
demand in economics. C1, …, Cn are statements of background conditions, such as the
actual price and quantity of a good in some market at some time. And E is a statement of
the phenomenon to be explained, like a dramatic decrease in the number of people taking
taxis over the past year. Hempel believed that knowing the law and background condi-
tions would lead people to realize the phenomenon in question was to be expected. By
rendering phenomena expectable, scientific explanations reveal our world to be ordered,
proceeding in accordance with general laws.
Thus, if you want to explain why people are taking fewer taxis, you may begin by
stating the law of demand: all other factors being equal, as the price of a good increases,
the quantity of goods demanded by consumers decreases, and as the price of a good
decreases, the quantity demanded increases. Then you may point out an increase in ride-
share programs and cycling incentives and the advent of companies like Uber and Lyft.
(See Figure 8.2 for some relevant data.) From these background conditions and the law
of demand, it follows that taxi rides have gotten comparatively more expensive. And so,
as the law of demand predicts, many people who previously bought taxi rides are now
500000
450000
400000
350000
300000
Trips/Day
250000
200000
150000
100000
50000
0
Dec-15
Jan-15
Oct-15
Jan-16
Oct-16
Dec-16
Jan-17
Mar-15
Apr-15
Jul-15
Mar-16
Apr-16
Mar-17
Apr-17
Nov-15
Jul-16
Nov-16
Feb-15
Jun-15
Jun-17
May-15
Aug-15
Sep-15
Feb-16
Jun-16
May-16
Aug-16
Sep-16
Feb-17
May-17
Yellow Taxi Uber
FIGURE 8.2 Ridership data for Yellow Taxis and Uber in New York City 2015–2017, based on
data from reports by the New York City Taxi & Limousine Commission
doing so less often; they can instead use other, cheaper forms of transportation. That’s
why fewer people take taxis.
Hempel thought that some nomological explanations were valid deductive arguments,
while others were strong inductive arguments. (See Chapter 4.) As in the preceding
explanation scheme, the premises must include at least one statement of a scientific
law—a general pattern or regularity. The premises also must have empirical content, so
they can be tested.
Many scientific explanations fit this nomological conception of explanation. Consider
how scientists might explain the increase in the average global temperature of Earth’s
atmosphere. One can begin by pointing out that atmospheric density changes in propor-
tion to the permeability of the atmosphere to solar radiation and that the permeability
of the atmosphere to radiation is directly correlated with average surface temperature.
These are law-like generalizations that describe patterns and regularities in nature. Next,
note that the atmospheric density on Earth has increased (because of greenhouse gases).
This is a background condition, a fact about current circumstances. Together, these claims
deductively imply the conclusion that the Earth’s average temperature has increased. This
argument is deductively valid with all true premises, so we have a simple nomological
explanation of global warming.
Box 8.1 Scientific Laws

Laws play an important but controversial role in science. Historically, scientists have
taken the purpose of science to be formulating laws, which can in turn be used to
provide explanations and make predictions. Examples include Newton’s law of
gravitation in physics, Arrhenius’s equation in chemistry, and the laws of supply and
demand in economics. Laws often take the form of universal generalizations: rules
for inferring what, in general, follows from some set of conditions. The law of supply,
for example, is a way to infer the relative price of goods from the quantity of goods
supplied. Newton’s law of gravitation is a way to infer the force between two bodies
on the basis of their masses and the distance between them. However, some scien-
tific statements of regularities do not seem to qualify as laws. Genuine laws of nature
are frequently said to possess most or all of the following features (among others):
non-trivial exceptionless systematic

general based on evidence precise
true explanatory predictive

Although many scientific laws satisfy many of these criteria, few if any satisfy
them all. Philosophers thus debate what is required for something to count as a
law and whether all scientific inquiry involves discovering laws.
Just as phenomena can be explained by laws, scientific laws themselves can be explained
by appealing to other, more comprehensive laws. For example, consider Galileo’s law that
bodies fall toward Earth at a constant acceleration. This law can be deductively derived
from the Newtonian law of gravitation. The Newtonian force of gravity explains the
constant acceleration of bodies falling toward Earth. Newtonian laws, in turn, can be
explained by appealing to the principles of the more comprehensive general theory of
relativity developed by Einstein. The Earth’s gravity is explained as a distortion of space
caused by the Earth’s mass. Objects speed up as they fall toward Earth, just as a ball
rolling from the edge to the center of a bowl speeds up.
The idea of explaining scientific laws with reference to other, more general laws intro-
duces a second conception of explanation. According to the pattern conception, expla-
nations fit particular statements about phenomena into a more general framework of
laws and principles. This has been called a unification conception of explanation, since
the number of assumptions and beliefs required to explain phenomena decreases when
an explanation is provided. Phenomena, and laws as well, are unified by uncovering the
basic patterns that govern them.
One advantage of the pattern conception over the nomological conception is that
there’s no requirement of citing laws. Pattern explanations can cite regularities that may
not qualify as laws. In place of the law requirement, there’s an emphasis on fitting the
phenomenon to be explained into a wider pattern, to see it as one instance of a more
general regularity of the world that has been identified.
Earlier, we described the simple explanation of decreased taxi ridership as a nomologi-
cal explanation, but it can also be construed as a pattern explanation. The phenomenon of
decreased taxi ridership is explained as one instance of the general pattern whereby higher
prices drive decreased demand and vice versa, a pattern that also applies to purchases as
different from taxis as pizza, pomegranates, and tickets to the cinema.
Many scientific explanations fit the pattern conception of explanation rather well.
Consider evolutionary theory. This theory explains a great many phenomena involving the
traits of organisms and the relationships among them with reference to a simple pattern
that plays out in a multitude of ways. The pattern at the heart of evolutionary theory is
that natural selection acts on variation among organisms to produce cumulative change in
species. The theory of evolution is not a single, general law of nature; it recognizes many
different influences on evolution besides natural selection and random variation, which
proceed in various ways depending on various factors. Many evolutionary explanations
are thus not productively viewed as nomological explanations. But they do fit the pattern
conception rather well.
The ideas behind the nomological and pattern conceptions of explanation—that expla-
nations make phenomena less surprising by referencing laws or by showing how they fit
into a wider pattern—are undoubtedly important. These ideas describe important features
of many scientific explanations. But both also face significant objections.

A significant problem for both is that they neglect a key feature of explanation:
asymmetry. If one thing explains another, then this explanatory relation does not
seem to hold in reverse. Consider the following example. Your mobile phone sends
you a weather alert, and you explain this both with reference to the fact that a storm
is approaching and the generalization that weather alerts are sent out when storms
are approaching. But, it seems, you can’t explain the approaching storm by citing the
weather alert you received, along with the generalization that weather alerts are sent
out when storms are approaching. This mixes things up: the storm isn’t approaching
because you received the weather alert. The weather alert gives you evidence of the
storm, but it can’t explain the storm.
Yet, the nomological and pattern conceptions of explanation don’t recognize this asym-
metry. Consider the nomological account. Suppose that, in general, weather alerts are
sent out when storms are approaching. Then, from this generalization and the premise
that a storm is approaching, you could explain why you received a weather alert. This is
a valid argument as required for nomological explanation. But the generalization about
weather alerts being sent when storms are approaching and the premise that you received
a weather alert can also be used to deductively infer that a storm is approaching. There
is a valid deductive argument whether the weather alert or the approaching storm plays
the part of background information.
And yet, the storm is a good explanation for why you received a weather alert, but the
weather alert is no explanation for why the storm is approaching. You can do a lot with
your mobile phone, but you can’t usually bring about a storm. The problem is similar
for the pattern conception of explanation. There is a general pattern relating weather
alerts to approaching storms. What’s to say that this pattern can explain weather alerts
but can’t explain storms? That difference isn’t accounted for by the pattern conception.
A second problem for the nomological conception is that many good explanations
don’t appeal to any laws. We have already suggested that some evolutionary explanations
are successful without appealing to laws. Here’s another example. Why dinosaurs went
extinct some 65 million years ago is explained by one of two hypotheses: either there
was a massive bout of volcanism or an enormous asteroid hit the Earth. Either event
would have had dire consequences on Earth’s climate and on dinosaurs’ ecosystems, and
whichever occurred caused dinosaurs’ extinction because of those consequences. But
neither of these explanations involves a general law of nature. We can’t say, in general,
what to expect on the basis of a volcano or asteroid collision. This depends on numerous
circumstances related to the nature of the catastrophic event, the organisms in question,
and other factors.
There is a related second problem for the pattern conception. The pattern conception
focuses on explanations that fit a phenomenon into a wider pattern. But some explana-
tions seem to be highly specific. Consider the explanation for how the human heart pumps
blood. This explanation may not apply to the function of the hearts of other kinds of
organisms. This is because hearts and other organs vary across species, and their differences
are more significant the more distantly related organisms are. Something similar is true
for the explanation for why dinosaurs disappeared. Both the volcanism explanation and
the asteroid explanation are highly specific. They rely on particular conditions on Earth
over 65 million years ago to help account for this extinction event. Nothing guarantees
that these circumstances will ever recur; the explanation might account for only this one
phenomenon, ever. So, they do not describe general patterns. Still, it seems like whichever
of these is true is a good explanation.
Here’s one final concern with the nomological and pattern conceptions. Discussion of
laws has been decreasing in science. The decline is perhaps most evident in the psychologi-
cal sciences. Psychologists are spending less and less time discovering and appealing to laws
in their research. A bibliometric study of abstracts from the PsycLit database—indexing
psychological research papers and journals—during the last century (1900–1999) looked
at over 1.4 million abstracts and found 3,093 citations of law—an average of 22 cita-
tions per 10,000 entries (Teigen, 2002). As shown in Figure 8.3, the average number of
such references significantly dwindled over time. Further, the laws psychologists are most
300—
286
250—
200—
citation count
150— 138
119
100—
82
50— 40
28 32
23
13 10
0
1900– 1910– 1920– 1930– 1940– 1950– 1960– 1970– 1980– 1990–
1909 1919 1929 1939 1949 1959 1969 1979 1989 1999
FIGURE 8.3 Occurrence of the word law in PsychLit abstracts per 10,000 entries
(Teigen, 2002)
concerned about or familiar with were discovered long ago, with the most commonly
cited laws discovered from 1834 to 1957.
If psychology is any guide, the nomological conception of explanation is in trouble.
And the pattern conception’s emphasis on broad patterns might be plagued with similar
difficulties. Over this same period of time, psychologists have established very few general
relationships between empirically measured variables—that is, very few general patterns.
Causal Explanation
Many laws and patterns in phenomena are also called effects. For example, in psychol-
ogy, there is the Garcia effect: an aversion to a particular taste or smell associated with
a negative reaction. This is why you might have trouble ever again eating whatever food
you had right before a bout of stomach flu. There are the primacy and recency effects,
according to which people recall more easily items at the beginning (primacy) and items
at the end (recency) of a list. And there is the self-explanation effect, described earlier
in this chapter. This is where explaining something to yourself boosts your learning and
helps you integrate new knowledge with existing knowledge.
The convention of referring to certain patterns as effects isn’t limited to psychology.
Consider the Larsen effect in acoustics. A public-address (PA) system has at least sev-
eral major components: microphone, mixing console and soundperson, amplifier, and
loudspeaker. If the soundperson registers that the broadcast is inaudible to the audience,
she can adjust the volume level via the mixing console to increase the microphone’s
input sensitivity, so it can pick up the speaker’s vocalizations more effectively, which the
loudspeaker puts out via the amplifier. This system is a basic homeostatic mechanism
involving feedback. But if volume levels increase beyond optimal values, the loudspeaker
can emit an unpleasant, high-pitched, runaway squeal. This feedback pattern is called the
Larsen effect.
The Larsen effect can be invoked to explain why there’s a squeal when a soundperson
adjusts the volume on a PA system. But this effect is also something that stands in need
of explanation. The explanation of the Larsen effect is that the microphone pickup locks
on to, or couples with, the natural vibration produced by the loudspeaker, which causes
them to begin resonating together. This pure tone resonance, or ‘ring’, causes the loud-
speaker to further increase in efficiency, and the microphone picks it up again and relays
it back to the loudspeaker. The coupling process is repeated at the speed of sound, and
as the set-points for minimum and maximum volume are exceeded, the resonance seizes
the system with abnormal levels of gain. This transition occurs very suddenly, temporarily
arrests the broadcast, and is dangerous to the system (including the ears of people in the
audience) if left unattended.
Consider a second pattern that also involves homeostasis, that is, a stable equilibrium
among interdependent elements. The scientific explanation of an organism’s regulation of
blood sugar appeals to homeostatic systems that use pancreatic endocrine hormones to
maintain blood sugar within a certain range (≈70–110 milligrams of glucose per 100 mil-
liliters of blood). If blood sugar decreases below this range, pancreatic alpha cells secrete
glucagon, which causes the liver to release stored glucose. If blood sugar increases above
the range, pancreatic beta cells secrete insulin, which causes adipose tissue to absorb glu-
cose from the blood. This explanation is also part of the explanation of diabetes, which
is a disorder characterized by the pancreas producing insufficient amounts of insulin.
These patterns, or effects, seem to be explained by describing their causes. The Larsen
effect is caused by a coupling between the microphone pickup and the loudspeaker’s
vibration, and this explains the volume feedback. According to the causal conception,
explanations appeal to causes that bring about the phenomenon to be explained. The
causal conception seems to account well for many explanations in science, including
especially in fields that do not deal with laws. As we emphasized in Chapter 7, knowledge
of causes enables prediction and manipulation of phenomena, via intervention on causal
factors. It’s plausible that explaining those causal factors is also central to understanding.
In one variety of causal explanation, the focus concerns how causal factors regularly
combine into complex systems that produce the target phenomenon. The blood sugar
regulation example exhibits this nicely. Pancreatic hormones, liver tissue, and blood sugar
ordinarily work together in complex ways to maintain blood sugar levels within a nar-
row range. Some call this variety of causal explanation mechanistic. The search for causal
mechanisms seems to play an especially important role in some parts of the social and
life sciences.
A causal conception of explanation can address the concerns raised earlier with the
nomological and pattern conceptions. First, causal explanations are automatically asym-
metric: causes explain their effects, but effects cannot explain their causes. This solves the
symmetry problem of nomological and pattern accounts. The reason why appealing to
the storm explains your mobile phone’s weather alert, but appealing to the weather alert
doesn’t explain the storm, is that the storm’s approaching is a causal factor in producing
your mobile phone’s alert; but the alert didn’t cause the storm.
Second, some causal relationships occur in very general patterns or are law-like in
nature, but others do not. If you heat ice, it will melt or evaporate. There are virtually no
exceptions to this. If you heat chocolate, it will usually melt—but if you heat it too quickly,
it gets thick and lumpy instead. This is a general pattern, but it has some exceptions. In
contrast, perhaps the volcanic or asteroid episode that led to the dinosaurs’ extinction was
an event that will happen only once in the Earth’s history. Perhaps background conditions
had to be just right for such an event to cause a major extinction. But all of these, from
the law-like to the highly particular, are still cause-effect relationships.
This resolves the second concern with the nomological and pattern accounts: causal
explanations can range from the highly general to the highly specific. The third concern
raised with the other conceptions of explanation is that laws and general patterns seem
to be of decreasing importance in science. In contrast, we suggested in Chapter 7 that
causal reasoning is absolutely central to science.
Yet the causal conception of explanation faces its own difficulties. First, as we also
surveyed in Chapter 7, there is no consensus about the nature of causation. So, there are
sometimes disagreements about whether a given explanation captures genuine causes.
For example, are we sure that the economic law of demand is the kind of thing that can
causally explain a decrease in taxi use? Some people respond to this concern by adopting
a very inclusive view of causation. Others think that some explanations cite causes, and
others cite other kinds of regularities, like mathematical regularities.
A second difficulty with the causal conception of explanation stems from the observa-
tion that phenomena often have many causes. For this reason, causal explanations may
come too easily. Causal explanations often cite only one or a few causal influences, when
we know there are many causal influences on the phenomenon that’s explained. How is
this enough to explain the phenomenon? Some respond to this challenge by saying that
the more causal information you can give, the better explanation you have. Others seek
another principle to decide what causal information belongs in an explanation.
A third difficulty for the causal conception of explanation results from simply pointing
out its difference from the nomological and pattern conceptions. If it is right that general
patterns and scientific laws help us understand the world, at least sometimes, then the
causal conception of explanation is lacking. This is because the causal conception doesn’t
give us a way to recognize the explanatory value of general laws or patterns.
So far, we have talked about these three conceptions of explanation as if one is right
and the others wrong. But it’s possible that each conception captures certain elements
of what helps us understand the world. One initial reason to think this might be so is
that each of these conceptions of explanation aptly characterize some, but not all, of
the examples of successful explanation we have discussed. Perhaps laws, patterns, and
causes all can contribute to our understanding, and so any of these can be an ingredient
of explanation.
EXERCISES
8.1 First, rate your knowledge and familiarity with bicycles on a scale from 1 (‘I know little
or nothing about how bicycles work’) to 7 (‘I have a complete understanding of how
bicycles work’). Figure 8.4 is a partial sketch of a bike; you will notice that it’s missing
some parts. Try to finish the drawing, adding in your own sketch of the pedals, chain,
and the missing pieces of the frame.
Frame
Pedals
Chain
FIGURE 8.4 Partial sketch of a bicycle
Rate once again your knowledge and familiarity with bicycles on a scale from
1 (‘I know little or nothing about how bicycles work’) to 7 (‘I have a complete
understanding of how bicycles work’). Did your rating go up, down, or stay the
same? (Adapted from Lawson, 2006).
8.2 Describe the illusion of explanatory depth in your own words. Then, think through
possible explanations for this illusion. Describe the possible explanation that you
think is most promising, and say what might help you avoid the illusion of explana-
tory depth if your explanation is correct. Finally, describe how you might be able to
test that explanation to see whether it’s correct.
8.3 From your background knowledge and the information provided in this section, do
your best to answer each of the following questions.
a. Why is the sky blue?
b. Why is December cold in Sweden but warm in Australia?
c. How do earthquakes happen?
d. How does cancer kill an organism?
e. Why do objects fall when dropped?
What are the common features of the explanations you’ve given? What are
some differences, and what do you think accounts for them?
8.4 For each of your explanations in 8.3, identify what conception(s) of explanation fits it best
and say why. Then reflect on all of your answers together, and describe what you notice.
For instance, if you answered in the same way about all or most of the explanations, why
do you think that’s so? If you answered in different ways, what do you think accounted for
the difference? Is there any general form—any common features—to your explanations?
8.5 After looking back at the box on scientific laws, consider the following argument: if crite-
ria for lawfulness are necessary criteria, then something must satisfy them all to count as
a genuine law of nature. Research in psychology, biology, and other disciplines do not
satisfy all these criteria. So, there are no genuine laws in psychology, biology, and other
disciplines. But if scientific explanation is nomological, it requires genuine laws. Thus,
there are no explanations in psychology, biology, and perhaps other fields of science.
We’re pretty confident this conclusion is false, but the argument is deductively valid.
So, at least one of its premises must be false. Decide which premise you think is
mistaken and develop an argument defending your view.
8.6 Choose one conception of explanation: nomological, pattern, or causal. Find a

novel example of a scientific explanation that seems to conform to each conception.
Describe the example, making clear how an explanation is given. Then describe
why this example should be seen as conforming to the conception of explanation
you chose.
8.7 Construct a chart or table listing the strengths and weaknesses of each of the three
conceptions of explanation discussed in this section. Decide which conception(s) of
explanation is the most promising, and support your answer with an argument.
8.9 In your own words, describe why explanation is important to science.
8.10 Thinking broadly about the topics you’ve encountered in this book, describe what
you think is distinctive about scientific knowledge compared to other forms of knowl-
edge and everyday information. Then, describe what features of science and scien-
tific knowledge are similar to other forms of knowledge and everyday investigation.
8.2 THEORIZING AND THEORY CHANGE
• Describe the role of theorizing in science

• Define scientific breakthrough and give an example
• Outline Kuhn’s view of the four stages of science
• Articulate two challenges for scientific knowledge from scientific breakthroughs and
at least one response to those challenges
Scientific Theories
Consider the grounds we’ve covered in the chapters of this book so far. In Chapter 1,
we considered what is distinctive about science. Chapter 2 focused on experiments and
other ways of testing hypotheses with observation. Chapter 3 looked at modeling, another
way of investigating hypotheses. Chapters 4–7 have all been about aspects of this same
process of subjecting hypotheses to empirical tests: deductive, inductive, and abductive
patterns of reasoning in scientific arguments; the role of statistics in representing data and
testing hypotheses; and the significance of causal hypotheses. All of this fits in some way
with the basic ingredients of recipes for science we laid out in Chapter 1: developing a
hypothesis, formulating expectations on the basis of the hypothesis, and testing expecta-
tions against observations. At the same time, there is also remarkable variation in how
science proceeds—recipes, not a single recipe—and we have tried to also give a sense for
that in how each of these topics has been addressed.
Still, recipes focused on hypotheses, expectations, and observations are not all there is
to science. We have already seen in this chapter that a central aim of science is explaining
our world. Scientists aren’t simply accumulating a list of confirmed hypotheses, the facts
we know about our world and ourselves. The project is bolder: scientists are charged with
helping us understand why and how things happen. And scientists are asked to furnish us
with tools for predicting and changing the world around us. Scientists also create scientific
theories, which are large-scale systems of ideas about natural phenomena, more general
and more elaborate units of knowledge than individual hypotheses typically are, and with
much more evidence to support them. Scientific theories thus provide bigger and more
powerful insights into the world.
Theories often go beyond what is readily observable. The Darwinian theory of evolu-
tion by natural selection is a grand theory about the origins of all the diverse life forms
on Earth, and Einstein’s theory of relativity is a grand theory about the very nature of
space and time. To be clear, empirical evidence has been central to testing and confirm-
ing both of these theories. But the content of these and other theories are usually taken
to be larger than their readily observable implications. Evolutionary theory, for example,
indicates what happened in the earliest years of life on Earth. Relativity theory tells us
what would happen if we travelled at the speed of light, and it also gives us a reason for
believing nothing but light will ever travel that fast.
In common usage, that an idea is a ‘theory’ sometimes indicates that it hasn’t been
tested out. Scientific theories are not like that. Quite the opposite, they are important
human accomplishments, as both the Darwinian theory of evolution and Einstein’s the-
ory of relativity illustrate. Yet, because scientific theories have implications that are not
immediately observable, they are never taken to be true beyond a shadow of a doubt,
no matter how much empirical data support them. Scientists have excellent justification
for the theories of evolution, relativity, and, say, the atomic composition of matter. Even
so, the possibility is held open that someday one or another of these theories, or another
theory among our prized scientific achievements, will be replaced by a better theory. This
possibility is intrinsic to the open and self-correcting nature of science.
Just as scientific theories go beyond the readily observable, theories also come about
not simply by extrapolating, or generalizing, from observations. Instead, there is usually
a significant conceptual shift, some feat of imagination, that gives rise to a new way
of thinking about observations. Darwin wondered whether the similar forms of life he
observed across continents might not suggest they dispersed from some ancient, common
ancestor. And he was inspired by an economist, Thomas Richard Malthus (1766–1834),
who wrote about the pressures to survive created by population increases. Einstein was
inspired by the puzzle of how to set clocks that are far apart to the exact same time and
how observers’ experiences vary depending on whether they are in motion, to reconsider
the very nature of space and time. In both cases, extensive observations were subsequently
obtained to empirically support the theories. But the initial idea was a kind of spark of
insight, a different way of thinking about what it was already known about the world.
Scientific Breakthroughs
No scientific theory is set in stone, and theories are sometimes replaced by successors. The
differences between a theory and its successor can be minor, or they can amount to radi-
cal shifts. The most significant scientific breakthroughs have been changes in worldview;
they involve comprehensive revision to how background or auxiliary assumptions, data,
and ideas are combined, and thus which scientific theory is supported.
Consider again theories of our universe and the bodies within it. The worldview
that arose with Aristotle (384–322 BCE) had great scope and logical coherence. The
Aristotelian theory of falling bodies claims that heavy bodies fall faster than light ones,
and its geocentric conception of the universe placed Earth at the center, which fit with
most common observations of how the world was. But over time, observations were made
that the Aristotelian worldview couldn’t easily accommodate. Eventually, it was replaced
by a Copernican conception of the universe, followed by a Newtonian conception, with
the Earth not a fixed center but a planet in motion around the Sun.
Because of the dramatic change in worldview, astronomers from the 4th century BCE
and the 17th century would have agreed about the positions of the stars in the sky, but
they would have radically different interpretations of those observations. Similar observa-
tions provided clues to constructing explanatory theories, but the differences between
those theories were vast. This is the Scientific Revolution of the 16th and 17th centuries,
discussed also in Chapter 1.
Additional radical shifts followed on the heels of the rejection of the Aristotelian
worldview, and with these changes came radical revisions to accepted ideas about the
position and movement of Earth, the shape of orbits, and the nature of universal forces. In
general, each new theory accounted for some body of evidence better than its predecessor.
Still, most of the changes were rather radical changes in perspective. The same is true of
the later replacement of Newtonian mechanics with Einstein’s theory of relativity, when
universal forces were replaced by non-Euclidean geometries of space-time.
Scientific breakthroughs have periodically occurred in other fields of science as well.
This is as you’d expect if scientists are truly open to revising or replacing any theory when
doing so is warranted by the available evidence. And breakthroughs seem rewarding and
significant; there’s a sense that, after a scientific breakthrough, we more clearly understand
our world. An initial spark of insight leads to a conceptual shift that reinterprets existing data
to support a new theory, and then more data are discovered that confirm this new theory.
From another perspective though, the idea of scientific breakthroughs is also troubling.
What happened to our scientific knowledge from before the breakthrough—were scien-
tists just altogether wrong? How do we know that our current best theories won’t suffer
the same fate and also be rejected for new and better theories? Can we trust our current
scientific theories at all then? These are deep and troubling questions that strike right at
the heart of science. But let’s postpone that discussion until later in this section, after we
have a fuller picture of what scientific breakthroughs are like and how and why they occur.
Kuhn’s Scientific Revolutions

The series of scientific breakthroughs in the 17th century suggests we might think about
scientific breakthroughs in general in terms of revolution. Revolutions are pretty dramatic;

think of political revolutions like the French Revolution at the end of the 18th century,
the fall of the Soviet Union two hundred years later, or the more recent Arab Spring. A
scientific revolution is a radical change of a reigning theory being overturned in favor of a
new theory, often involving an alternative worldview. Scientific revolutions don’t just change
which scientific theories are accepted; they also influence the fundamentals of science itself,
such as how to interpret evidence, which scientific procedures are accepted, and often the
social and institutional structure of science, such as who is accepted as a scientific authority.
Thomas Kuhn (1922–1996), an American physicist, and historian and philosopher of
science, wrote a famous book called The Structure of Scientific Revolutions, first published
in 1962. In this book, Kuhn advanced an influential model of scientific change based on
the notion of revolution. He suggested that scientific revolutions have occurred and will
continue to occur periodically as an important part of science. In his view, this would
prevent science from proceeding in a straight line by accumulating an increasing body of
knowledge and an expanding store of explanations. Kuhn thinks science instead proceeds
in phases. We’ll first describe these phases; then we’ll work through how they apply to
a specific scientific revolution.
Kuhn called the earliest phase of science pre-paradigmatic. This is characterized by
the existence of different schools of thought that debate very basic assumptions, including
research methods and the nature and significance of data. Data collection is unsystematic,
and it’s easy for theories to accommodate new observations because the theories are incho-
ate, or undeveloped. Such theories can easily be adapted in different ways to accommodate
new observations. There are many puzzles and problems but not very many solutions.
Kuhn’s second phase is the normalization of scientific research. One school of thought
begins to solve puzzles and problems in ways that seem successful enough to draw
adherents away from other approaches. Kuhn called this period normal science, because
widespread agreement about basic assumptions and procedures allows scientific research
to become stable. Scientific practices become organized. Laboratories or other workspaces
may be set up, experimental techniques and methods become widely accepted, and
agreed-upon measurement devices are improved.
During normal science, scientific developments are driven by adherence to what Kuhn
called a paradigm. Broadly conceived, a paradigm is just a way of practicing science. It
supplies scientists with a stock of assumptions about the world, concepts, and symbols
that they can use to more effectively communicate. It also involves methods for gathering
and analyzing data, as well as habits of scientific research and reasoning more generally.
Science students learn and come to tacitly accept the paradigm associated with a period
of normal science based on textbooks. Containing little historical insight into the dynam-
ics of scientific change, textbooks encapsulate the tenets of the paradigm, and provide
students with shared examplars of good science.
Kuhn thought that, during a period of normal science, each field of science is governed
by a single paradigm. But scientists in the grip of some paradigm have often ended up
with observations that are at odds with the paradigm or that lead to worrying puzzles
called anomalies: deviations from expectations that resist explanation by the reigning
theory. Usually, anomalies are just noted and set aside for future research. But anomalies
can accumulate, and this creates a kind of increasing tension for the accepted scientific
theory. Scientists begin to worry that the theory might not be right after all.
The accumulation of anomalies sets science up for a crisis. A crisis occurs when
more and more scientists lose confidence in the reigning theory in the face of mounting
anomalies. For Kuhn, a paradigm is only rejected if a critical mass of anomalies have led
to crisis and there’s also a rival paradigm to replace it. Another theory has been developed
by some renegade scientists, and the problems with the existing paradigm mean that this
new theory—together with its auxiliary assumptions, methods, and so on—can finally get
attention. If this is so, a crisis might be followed by a scientific revolution. In this period
of science, all the elements of the accepted paradigm are up for negotiation. Data, inter-
pretations of data, auxiliary assumptions, methods, and technical apparatus—any and all
might be rejected, replaced, or reinterpreted from the perspective of the new paradigm.
This four-stage view of scientific change is summarized in Table 8.1.
TABLE 8.1 Thomas Kuhn’s four-stage view of scientific change
Stage Features
1. Pre-paradigmatic science Different schools of thought debate basic

assumptions
2. Normal science A paradigm is accepted, and work research is

devoted to puzzle-solving
3. Crisis Scientists lose confidence in the reigning theory

in the face of anomalies
4. Revolution One paradigm is rejected in favor of a new one
The Chemical Revolution

The Scientific Revolution began when geocentrism was replaced with heliocentrism in
astronomy, that is, when the Earth was no longer seen as the central heavenly body but
instead taken to revolve around the Sun. According to Kuhn, this episode in the history
of science perfectly fits his description of scientific revolution. Another abrupt revolution-
ary change in science that Kuhn recognized as a scientific revolution involved sweeping
changes in the field of chemistry in the 18th century.
Two of the protagonists of the chemical revolution were the French chemists Antoine-
Laurent Lavoisier (1743–1794) and Marie-Anne Paulze Lavoisier (1758–1836). When
they began their work, scientific understanding of matter and its transformations was still
grounded in the Aristotelian worldview. Aristotle had believed that all earthly materials
are composed of the elements air, earth, fire, and water. This theory of the four elements
had been slowly modified by the medieval alchemists, who aimed to turn base metals into
gold and to produce an elixir of immortality. By the 18th century, alchemists believed all
things were compounds of sulfur, mercury, and salt.
In the early 18th century, one pressing scientific question was what happens when
something burns? Alchemists thought that when materials changed into slag, rust, or ash
by heating, they lost sulfur. The German medical physician and chemist Georg Ernst Stahl
(1659–1734) modified this idea, developing the theory that every combustible material
contains a universal fire-like substance, which he named phlogiston (from Greek, meaning
flammable). Combustible materials, like wood, tend to lose weight when burned, and
Stahl explained this change in terms of the release of phlogiston from the combustible
material to the air. When the air becomes saturated with phlogiston or when a combus-
tible material releases all its phlogiston, the burning stops. Stahl believed that the residual
substance left behind after a metal burns is the true substance of the original metal,
which lost its phlogiston during combustion. This residue, which was called metal calx
(what we now know to be an oxide), has the form of a fine powder. Both metal calx and
the gases produced during combustion could be captured, measured, and experimentally
manipulated.
Unlike gases and calx though, phlogiston was an utter mystery. Nobody had isolated it,
and nobody had found a way to experimentally manipulate it. In fact, phlogiston seemed
to have properties that were inconsistent with Stahl’s theory. Stahl believed phlogiston
had a positive weight. When you burn a piece of wood, the remaining ash loses phlogiston,
and it weighs less than the original log. But in other cases, for example, when magnesium
or phosphorus burn, the residue left behind weighs more than the original material. If
phlogiston was released during the burning process, why was there a gain in weight in
these cases? This is an anomaly.
Intrigued by this, the Lavoisiers experimented with a variety of metals and gases to inves-
tigate why and how things burn. They observed that lead calx releases air when it is heated.
This suggested that combustion and air were, somehow, linked. Explaining the link was a
difficult task, however, because at that time, little was known about the composition and
chemistry of air. Meeting the English theologian and polymath Joseph Priestley (1733–1804)
helped. Priestley had discovered a gas he called dephlogisticated air, which was released by
heated mercury calx. This gas was thought to greatly facilitate combustion because, being
free from phlogiston, it could absorb a greater amount of the phlogiston released by burning
materials. Candles burning in a container with dephlogisticated air would burn for much
longer, for example. This gas, Priestley observed, facilitated respiration too: mice in contain-
ers with dephlogisticated air lived longer than mice placed in containers without this gas.
The Lavoisiers tried to replicate Priestley’s experiments, and based on their own results
and observations, they elaborated a new theory of combustion. The central idea was that
combustion was the reaction of a metal or other material with the ‘eminently respirable’
part of air. Believing (incorrectly) that this kind of air was necessary to form all sour-
tasting substances, or acids, Lavoisier called it oxygène (from the two Greek words for
acid generator). According to this new theory, combustion did not involve the removal
of phlogiston from the burning material, but rather, the addition of oxygen.
This emerging rival paradigm set the basis for the revolution from which modern
chemistry emerged. In the 1780s, the Lavoisiers and other scientists adopted the idea of
a chemical element and of chemical compositions of simpler elements. This new system
of chemistry was set out by Antoine-Laurent Lavoisier in a textbook in 1789. As Kuhn
would expect, this book didn’t just describe the theory but also the other elements of
a paradigm. The book explained the effects of heat on chemical reactions, the nature
FIGURE 8.5 Scientists of the chemical revolution
of gases, and how acids and bases react to form salts. It also described the technological
instruments used to perform chemical experiments. And it contains a ‘Table of Simple
Substances’—the first modern listing of chemical elements.
After the publication of this textbook, most young chemists adopted Lavoisier’s theory
and abandoned phlogiston. ‘All young chemists’—Lavoisier wrote in 1791—‘adopt the the-
ory, and from that I conclude that the revolution in chemistry has come to pass’ (Donovan,
1993, p. 185). From a Kuhnian perspective, the next phase of normal science had begun.
Non-Revolutionary Scientific Change

Kuhn’s notion of scientific revolution seems to accurately characterize some episodes in
the history of science—the times of especially radical transformation in accepted scientific
knowledge. But this is a particularly extreme form of scientific change. Other episodes of
scientific change don’t seem to be so dramatic, and there’s also a question about whether
ordinary scientific activity fits Kuhn’s characterization of normal science.
It appears that small, incremental changes in science are far more common and less
abrupt than Kuhn’s account suggests. Consider, for example, the Darwinian revolution
in the 19th century. Darwin’s theory of evolution has had a dramatic and lasting impact
on our understanding of the nature of life forms, the relationships among different spe-
cies, and how species have changed over time. The Darwinian revolution was arguably a
scientific revolution. But Darwin’s theory was not the first evolutionary theory; nor has
evolutionary theory remained exactly the same as what Darwin first described. Changes
in the field of biology, both before and after Darwin’s revolutionary breakthrough, have
been more gradual than Kuhn’s ideas would suggest.
The general idea of evolution is that whole species—not just individuals—can change
over time, and this idea is many centuries old. The nature of biological change as a
scientific research program can be traced to the work of French, English, and Scottish
naturalists over a half century before the publication of Darwin’s Origin of Species in
1859. Even Darwin’s specific ideas about evolution were significantly influenced by other
scientific work; earlier, we mentioned the influence of the political economist Thomas
Richard Malthus. And another scientist working at the same time as Darwin, Alfred Russel
Wallace, was independently developing a theory of evolution by natural selection strik-
ingly similar to Darwin’s. So, although Darwin’s ideas were a tremendous breakthrough,
they did build upon existing scientific work, and they were inspired by and related to
concurrent scientific work by others.
Furthermore, the science of biology since the Darwinian revolution has not simply
consisted in the application of Darwin’s ideas, as Kuhn would have us expect for a
period of normal science. Rather, our understanding of evolution has been in continual
development. The so-called Modern Synthesis in the early 20th century integrated the
existing knowledge of genetics and Darwinian evolution, which had previously been seen
as competing theories. Other elements of evolutionary theory have been revised since,
like the recognition of non-genetic influences on traits and how significantly organisms
shape their environment, thereby affecting how natural selection acts on themselves and
nearby organisms of other species.
Another point in support of non-revolutionary scientific change is that theory change
doesn’t always involve the rejection of existing theories. Sometimes, it comes from the
joining of theories, as in the Modern Synthesis, and other times, it can come from new
methods. American biologist James Watson and English physicist Francis Crick, for exam-
ple, reached their groundbreaking conclusion that the DNA molecule exists in the form
of a double helix by applying a new modeling approach to data that had been gathered
by Rosalind Franklin. Using cardboard cutouts to represent the chemical components of
DNA, Watson and Crick tried to make different arrangements, as though they were solv-
ing a jigsaw puzzle. Through this concrete model-building, the double-helical structure of
DNA was identified. This had enormous consequences for subsequent biological research.
Mathematics and even philosophy can drive scientific theory change too. The develop-
ment of a new kind of geometry, non-Euclidean Riemannian geometry, paved the way for
Einstein’s theory of relativity. Einstein’s theory adopted this geometry as a description of
physical space-time. One basic difference between Euclidean and non-Euclidean geom-
etry concerns the nature of parallel lines. In Euclidean geometry, there’s only one line
through a given point that is parallel to another line. In some non-Euclidean geometries,
there are infinitely many lines through a point that are parallel to another line, and in
others, there are no parallel lines. This development in mathematics made it possible
for Einstein to wonder whether the geometry of our own universe might actually be
non-Euclidean.
Scientific Progress
Earlier, we raised worries about how scientific breakthroughs may undermine our con-
fidence in scientific theories. If some well-confirmed theories were eventually rejected,
who’s to say our current theories won’t also be rejected? And do such scientific break-
throughs make it so that science isn’t really making progress at all? Let’s consider these
questions in a bit more depth and isolate a few important considerations. But we’ve
entered deeper philosophical waters now, and this discussion won’t be decisive. There
are lots of interesting questions here about the nature and significance of science, even if
science is unquestionably our best way to gain knowledge about our world.
When scientific theories change, do we have reason to think that the new theory is an
improvement on the last one and that science is progressing toward truth? This question
is complicated by two features of theories and theory-change. First, theories often appeal
to phenomena that cannot be directly observed. Examples we have encountered in this
book include the Higgs boson, the first moments of the universe’s existence after the big
bang, and the original common ancestor of all life on earth. How can we ever be sure we
are right about these and other phenomena like them? Second, at least some instances of
theory-change have been radical: scientists rejected phlogiston, decided they were wrong
about the placement of Earth in the universe, and much more recently decided Pluto
wasn’t a planet after all. How can we ever be sure that our scientific findings are on a
path to truth, when the next radical revision could be right around the corner?
There’s at least one influential argument suggesting that, despite all this, we have reason
to believe that our best scientific theories are true. This argument—sometimes called the
no miracles argument—is an abductive inference, or inference to the best explanation,
from the success of science. It begins with the observation that our best scientific theories
are extraordinarily successful; they enable scientists to make empirical predictions, to
explain phenomena, to design and build powerful technologies. What could explain this
success? One possible explanation is that our best scientific theories are approximately
true. And if these theories were not approximately true, then the fact that they are so
successful would be astonishing. So, it seems, the best explanation for the success of sci-
ence is that our best theories are true—or at least on the path to true and getting closer.
Yet some believe that this conclusion is overly optimistic. Here’s an inductive argument
for the opposite conclusion. If we examine the history of scientific theories in any given
field, we find a regular turnover of older theories rejected in favor of newer ones. So, most
past theories are plainly false. Therefore, by generalizing from these cases, most scientific
theories are false. It seems this would include our current theories too. This suggests our
current theories stand a good chance of eventually being replaced and regarded as false.
The upshot of this argument—sometimes called the pessimistic meta-induction—is that we
do not have a strong reason to think our current best scientific theories are true.
This argument raises questions about how certain we can be about our current scientific
theories. But we want to emphasize that science is the single most successful project for
generating knowledge that humans have ever embarked on. Science as a set of methods
for investigating our world has persisted for centuries and is unlikely to be surpassed,
even if individual scientific theories are sometimes abandoned.
EXERCISES
8.11 Write a list of the primary features of scientific theories based on the discussion
from early in this section. How do theories differ from hypotheses and laws? What
features do they all share in common?
8.12 What do scientific theories add to science, beyond the processes of hypothesis-
testing we have mostly focused on in this book? How does theorizing relate to
hypothesis-testing?
8.13 Look back at the Higgs boson discovery discussed in Chapter 6. This discovery was
additional confirmation of the so-called Standard Model in physics. Investigate this
theory, then answer the following questions about it as best you can.
a. What is the theory a theory of—that is, what phenomena is it supposed to be about?
b. What are the central concepts featuring in the theory?
c. Are some claims made about things that we can’t directly observe? What kinds
of things?
d. What do scientists explain, predict, and describe with the theory?
e. What are some of the considerations that sparked the development of the theory?
f. Has the theory undergone any changes over time? Which one(s) and why?
8.14 Describe the features of each of Kuhn’s expected stages in your own words: pre-
paradigm science, normal science, crisis, and scientific revolution. Illustrate each
stage by describing how it applies to the chemical revolution.
8.15 Consider again the Copernican revolution, chemical revolution, and the Darwinian
revolution. Evaluate the merits of Kuhn’s view of scientific change. What do you think
are strengths of his view? Do you think there are any weaknesses or ways it is limited?
Support your points by referencing these episodes of scientific change. (You’re also
welcome to appeal to other scientific breakthroughs discussed earlier in this book.)
8.16 How does the existence of scientific breakthroughs, or revolutions, challenge the ideas
of scientific truth and scientific progress? Motivate the concern as well as you can.
Then, evaluate the merits of the concern, thinking back to all you’ve read in this book.
8.17 Think of the case of Lavoisier and the chemical revolution and the case of Darwin’s
evolutionary theory. In each of these episodes, in what sense did the breakthrough
represent progress? In what ways did chemistry and biology improve? More gener-
ally, what standards do you think we should use to identify progress and advances
in science?
8.3 SCIENCE, SOCIETY, AND VALUES
• Describe three examples of how science has been influenced by its social and his-
torical context
• Articulate how exclusion and marginalization based on race and ethnicity, national-
ity, gender, sexuality, and other factors have negatively influenced both society and
science
• Define the value-free ideal for science and give an example of when values have
influenced science in a problematic way
• List five ways in which values influence science in legitimate ways and give an ex-
ample of each
• Characterize the main contemporary challenges to science’s trust and objectivity
Science in a Social Context

Because scientific reasoning is a fundamentally human endeavor, it always occurs in socio-
historical contexts. Scientists make observations, elaborate theories, make discoveries,
and interact with one another all within specific interpersonal, institutional, and cultural
circumstances. Science is embedded in institutional structures like universities, laborato-
ries, hospitals, museums, journals, and publishing companies. As we stressed in Chapter
1, science is also a social practice, involving different people variously collaborating and
competing.
The social and historical context of scientific activity significantly influences the nature
of science. Even as science aspires to produce knowledge that is not limited by a specific
perspective, scientific theories are also creatures of the times, places, and people who cre-
ated them. Recall how Darwin’s ideas about evolution were influenced by the economist
Malthus, for instance. Some have also suggested that how Darwinian evolutionary theory
dealt with sexual reproduction and the differences between male and female animals
was strongly influenced by Victorian moral sentiment (Knight, 2002). Darwin took it for
granted that, throughout the animal kingdom, male animals are promiscuous and aggres-
sive and female animals are ‘coy’ and selective. This looks suspiciously like human gender
norms—in Darwin’s Victorian England and, to some extent, in many cultures today. While
Darwinian evolutionary theory was certainly a tremendous step forward for biology, it
was also influenced by the time and place of its creation and perhaps by features of the
individuals who created it.
So, science seems to be shaped by its social and historical context. Science is also
regularly used to promote particular social aims. The difficult truth is that, throughout
history, science has regularly been used to promote objectionable social aims and, at times,
has even been pursued in ways that incorporate morally repugnant social views. Science
has been used to expand power over others, to invent nuclear and chemical weapons for
the purpose of mass casualties, and to amass wealth for the few, as with research for the
fossil fuel industry. Science has also been used to promote misinformation, as when the
Ethyl Corporation paid Robert Kehoe to vouch for the safety of lead in gasoline (recall
from Chapter 1) or when tobacco corporations paid scientists to present cancer research
in a way calculated to mislead the public. Science has also directly abused people from
marginalized groups, as when the Nazis ran deeply cruel experiments on the prisoners of
concentration camps and when the US Public Health Service ran the Tuskegee Syphilis
Experiment. In this clinical study, researchers withheld treatment from 399 impoverished,
rural African-American men who had syphilis. They never informed the participants that
they had syphilis or that there was a cure for the disease.
In this section, we take a hard look at the relationship between science and society.
We will consider how the participants in science and the social context of science influ-
ence the development of science. We will investigate the roles moral and political values
should and should not play in science. And we will also raise some of the most pressing
challenges to science and scientific authority in the contemporary world.
Participation in Science
Let’s first explore the idea that the traits of scientists might influence the nature of the
scientific endeavor itself. Here’s one way in which this seems to be so. A negative social
influence on science is the exclusion or marginalization of individuals from the scientific
community because of their gender, race and ethnicity, sexuality, or social and cultural
background. The English polymath Alan Turing (1912–1954) did groundbreaking research
in computer science, formal logic, mathematics, cryptography, and morphology. During
World War II, he helped crack the code used by the Nazis to protect their military com-
munication, an achievement that many historians believe was the single greatest contri-
bution to the Allied victory. Turing was also a visionary of artificial intelligence. You may
have heard of the Turing machine and Turing test, which he invented; he anticipated that
human intelligence would one day be matched by machines. Turing was also gay, and at
the time, this was illegal in Britain. Despite his groundbreaking scientific contributions,
Turing was arrested and chemically castrated by the British government. Humiliated and
resentful, he killed himself at the age of 41.
Being outed as gay in the mid-20th century UK was awful; matters were also dark
for women in science for most of history. British-American astronomer Cecilia Payne-
Gaposchkin’s dissertation Stellar Atmospheres in 1925 became a cornerstone of modern
astrophysics, and for this, she was rewarded with low-paying adjunct teaching work for
the next 20 years. Rosalind Franklin (1920–1958) was an English chemist and x-ray crys-
tallographer, whom we mentioned earlier for her important contributions to the under-
standing of DNA’s structure. Among other contributions, Franklin was responsible for an
FIGURE 8.6 (a) Rosalind Franklin; (b) Franklin’s x-ray diffraction image that famously inspired
Watson and Crick’s double-helix model of DNA
x-ray diffraction image that was shared with Watson and Crick without her knowledge
(pictured in Figure 8.6b). After seeing that image, Watson and Crick developed their
physical model of DNA. They went down in history as having discovered DNA’s double
helix structure, eventually winning the Nobel Prize for this work. In contrast, Franklin’s
enormous contributions to the discovery were recognized only after her death. A similar
story is that of British neuroscientist Kathleen Montagu (est. 1877–1966), who published
her discovery of the neurochemical dopamine in the human brain in 1957. Her work
was largely overshadowed by a very similar discovery three months later by Swedish
neuropharmacologist and Nobel Prize winner Arvid Carlsson and colleagues.
This is a common enough phenomenon that it has been named. The Matilda effect
is the bias against recognizing women scientists’ achievements. These women’s work is
often uncredited or else attributed to their male colleagues instead. Societal prejudice has
made it more difficult for not only women but also racial and ethnic minorities, people
from developing nations, and other marginalized groups, to be supported in their scientific
work and even to become scientists in the first place.
When only certain kinds of people participate in science, the value of science is lower.
For one thing, science squanders or loses out entirely on the contributions of people who
couldn’t become scientists or who were marginalized in their participation in science.

A second way in which science suffers is that there are fewer role models for aspiring
scientists. When scientists like Turing or Franklin are dishonored or not acknowledged
for their research breakthroughs, they are not available for younger people to look up to.
When groups of people are systematically underrepresented and marginalized in science,
young people may not see themselves as potential participants in science. Third, who
participates in science and who is excluded can even affect the nature of scientific knowl-
edge. The variable features of scientists—nationality, gender, socioeconomic background,
race, religious belief, political affiliation—all help determine what questions scientists are
interested in investigating, what bold conjectures they come up with, and perhaps also
which experimental or modeling approaches they use. When scientists are diverse, all of
the differences among them contribute to the range of questions, richness of ideas, and
ultimately the success of science. If instead only certain kinds of people dominate science,
like middle-class and wealthy white men from developed countries, the kinds of questions
posed and ideas generated may reflect the limitations of the scientists.
As an illustration of how diversity contributes to scientific success, consider Temple
Grandin, an American researcher in animal behavior. Grandin explicitly brought her
experiences as an autistic woman to bear in ways that led to dramatic improvements
in the efficiency and ethical treatment of animals in slaughterhouses. Another example
is Barbara McClintock (1902–1992), who significantly advanced our understanding of
genetic inheritance by discovering ‘jumping genes’, or transposons. These are parts of
chromosomes that are moved from one of the chromosomes in a pair to the other in an
early step of sexual reproduction. McClintock’s great insight arose from a simple decision
about how to study genes, and it has been suggested that this decision was motivated
in part by her outsider status in science and her identity as a woman (Keller, 1983).
McClintock departed from the well-established practice of focusing on mosquito genes
by instead studying the genes of corn, or maize. Mosquitos are targeted in such studies
because they are genetically simple, with only four chromosome pairs. Maize, in contrast,
has ten chromosome pairs. This added complexity makes them more difficult to study,
and McClintock was criticized for her decision. But this also made it possible to observe
jumping genes in action.
The issue is more than just who is recognized for what discovery, how breakthroughs
are received, and who gets to be a scientist. Different people bring different styles of
reasoning to the table, and scientific progress often demands creativity and seeing things
anew. For these reasons, the inclusion of diverse people—with different personalities and
backgrounds—in science doesn’t just benefit those individuals and society; it also makes
science itself more successful.
Values and the Value-Free Ideal

In Chapter 1, we emphasized how science has been developed to control for and over-
come human limitations in reasoning. One feature of this is that science is supposed to
provide a way to subject our pre-existing ideas to rigorous testing. Wanting something
to be true isn’t good grounds for believing it is true, and science provides methods of
hypothesis-testing, reasoning, and theorizing to evaluate the grounds for our beliefs.
This suggests what has been called the value-free ideal for science, which refers to the
idea that science should be free from the influence of our values—such as moral, social,
or political beliefs. Scientific theories and hypotheses should be accepted only when
evidence and reasoning favor them, not just because we want them to be true. This is an
ideal because it’s not something we think actually always happens. Science has a spotty
track record in this regard: it’s too often been used to further racist or sexist views, in
support of one nation’s dominion over others, or to contribute to corporate profits. But
the value-free ideal suggests that this shouldn’t happen, that any science influenced by
moral or political views is bad science. Ideally, science will be governed by evidence and
reasoning and not by the values of scientists, their funding sources, or societal trends.
Because the desirability of scientific theories—whether we want them to be true—
makes no difference as to whether they are true, the desirability of a theory also shouldn’t
make a difference as to whether the theory is accepted as true. Questions about the
reality of climate change, about evolutionary changes in species, or whether GMO foods
are safe, and how male brains and female brains are different, and so forth, may evoke
emotional reactions from people, but those emotions aren’t part of answering these ques-
tions. These questions can only be answered by conducting experiments or observational
studies, constructing models, evaluating evidence, and applying statistical reasoning—by
applying the recipes for science.
Still, there are some challenges to the value-free ideal for science. First, science’s spotty
track record in achieving freedom from the influence of political and economic values
might inspire skepticism of the value-free ideal. Careful historical studies of episodes in
science reveal the influence of values. We mentioned how Darwin’s evolutionary theory
encodes Victorian morality and how science has sometimes directly abused marginalized
people. In Chapter 5, we mentioned Galton’s work on eugenics, which was just the tip
of the iceberg of racist and sexist uses of science aimed to affirm white male superiority
over others. If the ideal science is value-free, it’s unclear that science has come close to
that ideal.
Second, even if we just think about what we want science to be like—the ideal—it’s
unclear that values should be entirely absent. Is discovering a vaccine for the Zika virus,
which is easily transmitted to humans by mosquito bites and leads to serious birth defects
when pregnant women are infected, more important than discovering new facts about
quasars, pulsars, supernovas, or other astral phenomena? If you think so, this reflects
a value you hold. If you think Zika research should be prioritized over astronomical
research, then you think this value should influence science. You probably agree that the
US Public Health Service should not have withheld syphilis treatment from 399 impov-
erished African-American men without their knowledge. This too is wishing for people’s
shared values to influence science.
The value-free ideal suggests that science is simply a source of objective facts about
the world, immune to influence by human values. On the other extreme, some believe
science only serves pre-existing values. The right view of how values should influence
science is somewhere between these extremes.
Scientists are human beings with different moral, political, and religious values, and
we suggested just a moment ago that the social context for science and who partici-
pates in science can both influence scientific findings. And yet, the recipes for science
are designed to limit the kinds of influence social and individual values have on science.
Science is not, and should not be, value-free. Nonetheless, there are ways that our values
should influence science, and there are ways in which science should limit or eliminate
the influence of values.
How Values Shape Science

Scientists, in their capacity as scientists, cannot avoid making value judgments. Values guide
scientists’ judgments about what types of research to pursue, as well as which studies to
perform and which to put on the back burner. Funding agencies also regularly employ val-
ues in determining which research to support. Guidelines for experiments on animals and
humans are grounded in ethical values, and these kinds of experiments are carefully regu-
lated. And values also influence how scientific results are communicated and employed.
None of these roles for values seems to interfere with science’s objectivity. We might
think of these, then, as legitimate influences of values on science. Some other ways in
which values sometimes have influenced science seem to be illegitimate. This includes,
for example, endorsing a scientific theory not because evidence and reasoning support it,
but just because you want it to be true. It also includes doctoring experimental results
so the data sets support a hypothesis that you want to believe is true. Reflecting on these
legitimate and illegitimate influences of values suggests a dividing line. When scientists’
values lead them to violate the recipes for science—acceptable approaches to data col-
lection and modeling, to hypothesis-testing and abductive reasoning, to statistical analysis,
and to causal reasoning, and so on—this is illegitimate. When scientists use values to
supplement or guide the use of the recipes for science, this can be legitimate.
In his book A Tapestry of Values, philosopher Kevin Elliott (2017) divides the legitimate
roles values can play in science into five categories, as helping to answer five different
questions. These questions are summarized in Table 8.2. Answers to these questions are
needed in order to know which scientific methods to employ, on which phenomena, and
to which end. These roles for values thus align with our suggestion that legitimate uses
of values supplement or guide the methods of science but do not violate those methods.
To begin, scientists’ individual values and societies’ collectively held values help answer
the first question about what to study. Individually, a researcher’s interests and values
surely shape what field of science she pursues, which lab she works in, and what problems
she tackles. Collectively, we choose what kinds of scientific research to support when
funding agencies, including tax-supported governmental agencies, designate the areas of
research they fund and which specific research projects in those areas to fund.
Beyond what to study, values also inform decisions about how phenomena should be
studied. Scientists can bring different questions, methods, and auxiliary assumptions to
bear on any given topic, and these choices in how research is pursued reflect research-
ers’ and society’s values. One instance of this influence is how the initial hypothesis and
assumptions about the causal background both guide experimental design, including the
TABLE 8.2 Five questions that arise when doing science that our values help us answer
(Elliott, 2017)
Question Example
1. What should we study? What kind of research is prioritized for funding

2. How should we study it? How the initial hypothesis and assumptions about the
causal background both guide experimental design
3. What are we trying to Getting the most accurate information versus accurate-
accomplish? enough information quickly enough to guide policy
4. What if we are uncertain? How much evidence to require before accepting or
rejecting a given hypothesis
5. How should we talk about The level of certainty conveyed to the public about some
the results? scientific finding
nature of the intervention and which extraneous variables to control. As with the initial
choice of what to study, funding agencies can also influence how phenomena are studied.
For example, research into depression can focus on the efficacy of cognitive therapy; the
role of sleep, diet, and exercise; or the efficacy of pharmaceutical intervention. The choice
to strongly prioritize pharmaceutical intervention to the exclusion of other focuses reflects
the outsized influence drug companies have had on medical science (Elliott, 2017).
The third question that a focus on values helps to answer is what, exactly, scientists
are trying to accomplish in studying some phenomenon. This is an even more fine-
grained decision than just what to study and how to study it. In climate research, for
example, scientists might prioritize getting information about climate trends available
quickly so that it can guide policy, or they might prioritize getting as accurate informa-
tion as possible, no matter how long it takes. This decision about the aim of research
is influenced by values, including views about the social role the scientific research is
expected to have.
Fourth, values influence how scientists proceed in the face of uncertainty. Scientific
results are never free from uncertainty. There’s the basic problem of measurement error.
We’ve also seen that, if observations don’t match expectations, it could be the fault of the
hypothesis, or it could be the fault of some auxiliary hypothesis. In an experiment, no mat-
ter how perfectly controlled, there’s always the chance that an unexpected confounding
variable has interfered with the finding. In statistical hypothesis-testing, scientists choose
whether to reject the null hypothesis, and either choice could be wrong. These are just a
few examples of the unavoidable uncertainty in science and the need to choose how to
proceed in the face of uncertainty.
These kinds of uncertainty are all forms of underdetermination. Recall from Chapter 2
that underdetermination is when the evidence available to scientists is insufficient to
determine which of multiple theories or hypotheses is true. Some believe that there is
even permanent underdetermination in science: that there will never be enough evidence
to conclusively decide in favor of one hypothesis or theory and against all possible alter-
natives. When scientists face underdetermination, they must choose what to believe or
whether to suspend judgment.
Because of this unavoidable uncertainty, scientists must decide how much evidence to
require before endorsing a theory or hypothesis (or before rejecting a theory or hypoth-
esis). Safety is very important to us, whether for drinking water or medications, so toxi-
cology tests must have a high bar for success. There is a lower bar for deciding whether
a new drug is more effective than an already available drug. Scientists also must decide
how to represent scientific uncertainty to the public. In 1988, climate scientist James
Hansen declared that climate change, global warming, was occurring. He described that
as a decision based on weighing ‘the costs of being wrong to the costs of not talking’
(Weart, 2014, referenced in Elliott, 2017). There was already enough evidence for Hansen
to be relatively confident in his choice to speak up. Decades later, of course, there is now
incontrovertible evidence of climate change.
This introduces a fifth category of values’ legitimate influence on science regarding how
scientists—and journalists and others who communicate scientific findings to the broader
public—should talk about those findings. As Elliott stresses, this isn’t just a decision to be
accurate. Scientific findings also can be discussed in their relationship to previous findings,
their potential social effects, or—picking up on the previous point—their level of certainty.
This framing influences whether and how the public will engage with research, and this
is a choice not dictated by scientific methods but by scientists’ and society’s priorities.
So, what scientists should study, how they should study it, what they aim to accomplish,
how much evidential support should be required, and how scientists should communicate
their results all depend on moral considerations—on values. These are legitimate influ-
ences of values on science. Recognizing these roles for values in science is crucial. This
enables us, as a society, to critically assess what values are employed at each of these
junctures. The influence of values on science can be problematic or even nefarious if the
wrong values are employed at any one of these stages. Figuring out the right and wrong
values is tricky and, it is not a matter for scientists alone to decide. Instead, this is an issue
that needs to be engaged with broadly in our society.
Examples of problematic values influencing science are, unfortunately, very easy to
come by. Here’s one. In 2017, US President Donald Trump proposed that NASA resources
should be dedicated to exploring the solar system instead of to climate change research.
This research priority—a decision about what to study—is a reflection of values endorsed
by a small but vocal contingent of the Republican Party. Choosing not to fund climate
change research amounts to deciding that knowledge about the rate and impact of climate
change is relatively unimportant. But because climate change is already having disastrous
effects on populations, the environment, and economies across the world, and because
the long-term costs of ignoring it will be disastrous, this decision was arguably the wrong
decision on moral grounds. Pulling funding from NASA’s Earth science division in order
to avoid investigating climate change and its effects upheld the wrong values. (Notice this
doesn’t mean that space exploration is unimportant! It too should be funded.)
Other examples of problematic values influencing science through the proper chan-
nels include the outsized influence the pharmaceutical industry has on medical research,
the continuing exploitation of at-risk communities due to the approaches used to study
them, and powerful corporations controlling what messages the public gets about climate
change and the risks of fossil fuel extraction. We’ll engage with some of these problems
in the next section.
We have suggested that science doesn’t have to be free from values to be trustworthy
and objective. What matters is that values influence science in the right ways and that
science effectively resists the problematic influence of values. Values, even good values,
shouldn’t play the wrong roles in science; we should never decide a theory is true simply
because we wish it were true. Further, the wrong values shouldn’t influence science, even
through the proper channels. To better understand how science earns its trust and objec-
tivity, it’s important to acknowledge the many roles of sociopolitical and moral values
in scientific reasoning and to critically examine the values that influence our science. By
doing these things, we can clarify what values should influence science and in what ways.
Trust and Objectivity: Challenges Facing Science

Science achieves objectivity and is worthy of trust based on its characteristics outlined
in Chapter 1—especially its capacity for self-correction—and the ways in which these
characteristics play out in the methods described in the rest of the book. The capacity
for self-correction requires scientists’ openness to criticism and dissent, their sincere and
transparent communication of their results and uncertainties, and scientific communities’
welcoming of diverse perspectives. Objectivity in science can occur when scientists’ judg-
ments are critically and openly assessed in light of other data and investigations, as well
as competing interpretations and alternative possibilities.
However, this intersubjective process, and thus science’s capacity for self-correction,
faces significant challenges. Some of the most significant challenges relate to the incentive
structure in science and how it shapes scientific findings in ways that undermine trust
and objectivity. Facing up to these challenges requires us to think carefully about the
scientific process, the role of incentives in shaping it, and what values are thus finding
their way into science.
Let’s back up. What is science’s incentive structure, and how does it create challenges
for trust and objectivity? As we have seen, science is a social practice that occurs in insti-
tutions like universities and national research centers. Scientists are professionals who get
paid for teaching and for doing research. But university salaries are, in most cases, not
enough to fund scientific research. Scientists need extra money to pay for scientific instru-
ments and lab equipment, for experimental participants, and for their assistants. This extra
money is generally awarded by public agencies like the ERC (European Research Council)
in Europe and the NSF (National Science Foundation) and NIH (National Institutes of
Health) in the US. The competition is fierce; every year, the number of applicants for
funding grows, while, partly due to budget cuts, the number of available awards shrinks.
Scientists’ ability to secure grants determines their career prospects. And their chance
of securing grants depends on their quantity and quality of publications, frequency of
citations, previously awarded grants, and the public attention they are able to attract.
‘Publish or perish’ is the phrase coined to capture the increasing pressure in science to
rapidly and continually publish work in order to sustain one’s career. The competition for
space in prestigious journals is also fierce; many have rejection rates of greater than 90%.
Because scientific production has increased dramatically over the years, journal editors
usually prefer to publish novel results that support an exciting hypothesis rather than
very robust and well-documented negative results.
Consequently, scientists have a strong incentive to produce surprising, positive results.
Other types of scientific research are harder to place in top journals. These include the
negative result that a hypothesis wasn’t supported by the evidence, studies that replicate
or assess previously published results, and preliminary, exploratory investigations that
are not decisive. The tendency to reward only one form of scientific finding is called
publication bias. This is common across all scientific journals but especially strong in
the most prestigious journals. Publication bias, coupled with the scarcity of resources and
employment opportunities, generates a challenge to science’s capacity for self-correction

(Ioannidis, 2005).
For one thing, openness to criticism depends on researchers attempting to replicate
existing studies to see if the evidence holds up. Replication of previously published
studies can increase the credibility of scientific claims when the supporting evidence for
these claims is reproduced. When the supporting evidence is not reproduced, replica-
tion can instead foster innovation and can improve previous experimental designs and
data analysis. Since publication bias works against replication studies, scientists have
little incentive to perform them. This undermines self-correction and the accumulation
of trustworthy knowledge across science and can have negative social consequences in
some fields.
The literature on associations between specific foods and cancer risk, for example,
may be seriously distorted. Statistically significant associations with cancer risk have been
claimed for most food ingredients, from beef to tea. Careful analysis of this literature
highlights that many published studies report implausibly large effects, even when the
actual evidence is weak and effect sizes small (Schoenfeld & Ioannidis, 2013). Dissent
and work toward replicability would improve the reliability and validity of claims about
the role of food in cancer risk.
The incentive structure of current science also negatively impacts scientists’ communi-
cation of their results. The emphasis on producing exciting, publishable results may lead
scientists to cut corners in how experiments are designed, and how data are analyzed and
presented. Whether or not these are conscious decisions, scientists may not randomize
their experiment or not control for some known confounding factors. Another common
shortcut is data dredging, where data mining techniques are used to uncover patterns
in data sets that support a hypothesis not under investigation. This makes it more likely
that a claimed pattern is actually a type I error (see Chapter 6) and the supported
hypothesis is false. Relatively few studies report effect sizes and measures of uncertainty
in a transparent way, so it’s often hard for others to assess the quality of a study and the
soundness of the methods.
Fierce competition in science also leads more and more scientists to abandon academia
for jobs in industry. IT, AI, and pharmaceutical, chemical, and agricultural industries
have been hiring more and more scientists. This raises another worry about sincere and
transparent communication of scientific results. Being funded by a private company to
carry out research may pose conflicts of interest, which introduces funding bias. Scientific
studies are more likely to have findings supporting the interests of the study’s financial
sponsor. This can happen because of how values influence science—what to study and
how, what the aim is, how to handle uncertainty, and how to present the findings. It
can also happen via intended or unintentional improper influence on data or methods.
Regardless, this leads to corporations having outsized influences on the nature of our
scientific knowledge and, in some cases, unknowingly—or even knowingly—misleading
the public with bad information.
Another challenge regards communication: too much science is inaccessible to the
general public and even to many other scientists. Scientific studies get published by for-
profit journals, and these journals typically put articles behind pricey paywalls. Academic
institutions can pay for their faculty and students to have journal access, but not all
academic institutions can afford subscriptions to these journals. By the time science is
reported in newspapers and popular magazines, it is often characterized inaccurately and

misleadingly, full of hype and exaggeration. This too is due to an incentive structure,
this one for journalism: media outlets are rewarded for splash and clicks, not for careful
accuracy. Bad science journalism, along with the seductive allure of celebrities’ (often
misinformed) opinions on issues like nutrition and vaccinations, can fuel serious misun-
derstanding of scientific findings.
Diversity in scientific approaches fosters science’s capacity for self-correction. But
the institutional apparatus of contemporary scientific inquiry has reduced incentives for
undertaking and freedom to pursue research that challenges existing scientific ideas or
develops wholly new theories (Stanford, 2015). This has fostered specialization in the
sciences, and it has also shielded popular theories and methods from being challenged
by competing, and perhaps better, theories and methods. Demographic diversity and
diversity in political views are also important for science’s capacity for self-correction
(Duarte et al., 2015), and yet science has historically been, and remains, limited in both
of these kinds of diversity.
In this book, we have painted a picture of science as fallible but with the tools to
reliably generate knowledge. Some scientific knowledge has had dramatic practical
importance—just think about the outstanding progress of the medical sciences and of
computer science and AI. Other knowledge regards fascinating aspects of the faraway
universe and the strange behavior of microparticles. The value of science in producing
knowledge requires openness to criticism and dissent, the pursuit of meaningful questions,
and the communication of results in a sincere and transparent way. Only then can science
live up to its self-correcting ideal, generate objective knowledge, and thus deserve our
trust. It’s looking like some features of science, including its current incentive structure,
need to be changed to promote these ends.
EXERCISES
8.18 Define diversity in your own words. Choose three characteristics of people (for
example, gender, nationality, and political views), and, for each, describe how sci-
entists’ diversity in that characteristic might contribute positively to science. You can
think about any field(s) of science that will help in answering this question.
8.19 Describe two or three steps that you think could be taken to increase diversity in
science. Mention also any concerns or downsides you can think of for each of these
steps.
8.20 Describe in detail an example of when values have influenced science in an illegiti-
mate way. Then diagnose what went wrong. What was wrong about the values or
the nature of their influence, and what was the detrimental effect to science?
8.21 State the value-free ideal of science. Then, summarize the view of how values can
legitimately factor into science outlined by Kevin Elliott. In your view, does that view
of values’ influence violate the value-free ideal, or is it consistent with that ideal?
Give an argument in support of your answer.
8.22 Suppose you are working for an NGO (non-governmental organization) on the task
of measuring poverty levels across countries. For each of the following decisions,
describe at least two ways to proceed, and say how values are relevant to making
the decision.
a. Which countries will you include in the study?
b. How will you define and measure poverty?
c. What extraneous variables will you take into account?
d. How will you make comparisons across countries?
e. How will your results be publicized?
8.23 List several potential ethical problems arising from scientific research funded by the
pharmaceutical industry. For at least three of these problems, describe a concrete
action to address that ethical problem that could be taken by governments, pharma-
ceutical companies, or some other party.
8.24 Describe a real example of when scientists need to act in the face of uncertainty.
Describe the nature of the uncertainty and explain how social, economic, and moral
considerations might factor into the decision of how to proceed.
8.25 Choose three of the main contemporary challenges to science’s objectivity described
in this section, and rank them in importance from 1 to 3, where 1 is the most important.
For each, describe why it is a problem, including some considerations not provided
in the text; then suggest one step that you think could help address the challenge. You
should also assess how practical it is to implement each of your suggestions.
8.26 In recent years, there have been several retractions of published scientific articles that
have captured the world’s attention. In 2015, it was the retraction of a paper about
gay marriage that was initially published in the prominent scientific journal Science.
Read the description of this case on Retraction Watch (https://retractionwatch.
com), and then answer the following questions.
a. What risks do people who report misconduct in science (whistleblowers) face?
b. Were human subjects ‘harmed’ in this case, and if so, how?
c. Describe how data management issues influenced this case.
d. Describe how authorship issues influenced this case.
e. Does this case raise any conflict of interest?
f. What issues does the case raise about collaborating with others?
g. Describe how replication issues influenced this case.
8.27 Look back at the case of climate change discussed in Chapter 1. Identify at least five
ways in which values are likely to have affected that research, and describe how
those values have impeded or promoted scientific knowledge of climate change.
FURTHER READING
For an influential overview of philosophical conceptions of scientific explanation, see W.

Salmon’s (1989). Four decades of scientific explanation. Minneapolis: University of
Minnesota Press. See also Psillos, S. (2006). Past and contemporary perspectives on
explanation. In T. Kuipers (Ed.), Handbook of the philosophy of science: Focal issues.
(pp. 121–196). Amsterdam: Elsevier.
For more on explanatory reasoning, see Lombrozo, T. (2012). Explanation and abductive
inference. In K. J. Holyoak and R. G. Morrison (eds.), Oxford handbook of thinking and
reasoning (pp. 260–276). Oxford: Oxford University Press.

For Kuhn’s view on theory change, see Kuhn, T. (1962). The structure of scientific revolu-
tions. Chicago: University of Chicago Press.
For more on the relationship between social institutions, values, and objectivity, see Paul
Feyerabend’s Against method (1975) and Science in a free society (1978). London:
Verso.
For an introduction to the roles values play in science, see Elliott, K. (2017). A tapestry of
values: An introduction to values in science. Oxford: Oxford University Press.
For an account of how values factor into science, see Longino, H. E. (1990). Science as
social knowledge: Values and objectivity in scientific inquiry. Princeton: Princeton Uni-
versity Press.
For an exploration of how social conditions influence science, see Merton, R. K. (1973).
The sociology of science: Theoretical and empirical investigations. Chicago: University of
Chicago Press.
For an account of values in science focused especially on underdetermination, see Douglas,
H. (2000). Inductive risk and values in science. Philosophy of Science, 67, 559–579.
For an overview of objectivity in science, see Reiss, J., & Sprenger, J. (2014). Scientific
objectivity. In Stanford encyclopedia of philosophy. Retrieved from https://plato.stanford.
edu/archives/win2017/entries/scientific-objectivity/.
Glossary
abductive inference: a commonly used type of ‘backward’ scientific inference that attributes special
status to explanatory considerations; also called inference to the best explanation
abstraction: leaving out or ignoring known features of a system from a representation or account of it
accuracy: the extent to which a model correctly represents the true value of a target system
addition rule: the probability that one of a number of mutually exclusive outcomes will occur is the sum
of their individual probabilities
affirming the antecedent: using the truth of a conditional statement and its antecedent as grounds for
concluding the consequent is also true; a deductively valid form of inference
affirming the consequent: using the truth of a conditional statement and its consequent as grounds for
concluding the antecedent is also true; a deductively invalid form of inference
algorithm: step-by-step procedure for obtaining some outcome
alternative hypothesis: in statistical hypothesis-testing, the a bold and risky conjecture that, contrary
to the null hypothesis, the variables in question are statistically dependent
ampliative inferences: when conclusions express content that, in some sense, goes beyond what is
present in the premises
analogical models: physical or abstract objects with features analogous to focal features of a target
phenomenon used to model the phenomenon
anomaly: a deviation from expectations that resist explanation by the reigning scientific theory; (Kuhnian)
motivation for scientific revolution and paradigm shifts
antecedent: the left side of a conditional claim; a condition that guarantees some consequence; logically
prior
appeal to ignorance: an informal fallacy; concluding that a certain statement is true because there is
no evidence proving that it is not true
appeal to irrelevant authority: an informal fallacy; appealing to the views of an individual who has
no expertise in a field as evidence for some claim
applied research: scientific knowledge used to develop some product, like techniques, software, pat-
ents, pharmaceutical drugs, or new materials; often, a central motivation is to generate products for
profit
argument: a set of statements in which some of the statements, the premises, are intended to provide
rational support or empirical evidence in favor of another statement, the conclusion
assumption: a specification that a target system must satisfy for a given model to be similar to it in the
expected way
auxiliary assumptions: a set of assumptions about how the world works that often go unnoticed but
are needed for a hypothesis or theory to have the expected implications; also called background
assumptions
average: see mean
Glossary 311
axioms: statements that are accepted as self-evident truths about some domain, used as a basis for deduc-
tively inferring other truths (theorems) about the domain
bar chart: visual representation of statistical outcomes in which bars of different heights are used to show
the frequency of different values for some discrete variable
basic research: scientific research that aims at knowledge for its own sake; also called pure research
Bayes factor: a compact, numerical way of measuring the statistical evidence for a hypothesis H0 with
respect to alternative H1. It is defined by the formula: B01(E) = Pr(H0|E) x Pr(H1) / Pr(H1|E) x Pr(H0) =
Pr(E|H0) / Pr(E|H1)
Bayes nets: causal Bayes networks, or nets; a kind of probabilistic model that provides a compact, visual
representation of the causal relationships in a system and the strength of those relationships by using
joint probability distributions
Bayes’s theorem: a mathematical formula used for calculating conditional probabilities. It is defined by
the formula: Pr(H|O) = Pr(O|H) x Pr(H) / Pr(O). Another form of Bayes’s Theorem that is generally
encountered when comparing two competing hypotheses H and not-H is: Pr(H|O) = Pr(O|H) x P(H) /
Pr(O|H) x P(H) + Pr(O|not-H) x Pr(not-H); the heart of Bayesian statistics
Bayesian conditionalization: a probabilistic rule of inference. It says that, upon observing new evi-
dence O, the new degree of belief in a hypothesis H ought to be equal to the posterior probability
of H: Prnew(H) = Pr(H|O)
bell curve: see normal distribution
biased variable: a random variable that is not fair, that is, for which some outcomes are more likely
than others
big data: very large data sets that cannot be easily stored, processed, analyzed, and visualized with
standard statistical methods
bimodal distribution: two values in a range are the most common; in a histogram, there are two peaks
blind experiment: an experiment or study designed so that the scientists recording or taking measure-
ments don’t know which subjects are in the control group and which are in the experimental group
calibration: comparing the measurements of one instrument with those of another to check the instru-
ment’s accuracy so it can be adjusted if needed
case study: a detailed examination of a single individual or system in a real-life context
causal background: the other factors that in fact do or in principle might causally influence two events,
thereby also potentially affecting the causal relationship between the two events
causal conception of explanation: the view that explanation involves appealing to causes that
brought about the phenomenon to be explained
central limit theorem: a statistical theory that samples with a large enough size will have a mean
approximating the mean of the population
central tendency: a distribution with one peak at the center, corresponding to the most common group
of values of a variable
cluster indicators: identify several markers of some trait in order to more precisely define the trait while
not oversimplifying it
cohort study: a study in which researchers select a group of subjects and track them over time, at set
intervals, to observe the effects of some condition they experience
collecting data: gathering and measuring information about variables
collectively exhaustive outcomes: when at least one outcome of a set of outcomes must occur at
any given time
common cause: when neither of two covarying types of events causes the other but a third event causes both
computer simulation: a program run on a computer using algorithms to explore the dynamic behavior
of a target system; also called computer model
conclusion: in an argument, a statement that is supported by the premises
312 Glossary
conditional probability: the probability of an event’s occurrence given that some other event has
occurred; expressed Pr(X|Y) where X is conditional on Y
conditional statements: statements in which one circumstance, the antecedent, is given as a condition
for another circumstance, the consequent; the antecedent guarantees the occurrence of the consequent
confederate: in an experiment, an actor who pretends to be a subject
confirmation: the observation matches the expectation based on the hypothesis, providing evidence in
favor of the hypothesis
confirmation bias: the tendency we all have to look for, interpret, and recall evidence in ways that
confirm and do not challenge our existing beliefs
conflicts of interest: financial or personal gains that may inappropriately influence scientific research,
results, or publications; scientists are obligated to disclose any potential conflicts of interests
confounding variables: extraneous variables that have varied in an uncontrolled way and influenced
the dependent variable under investigation
consequent: the right side of a conditional claim; the condition that arises from, or is guaranteed by,
the antecedent
contributing cause: a cause that is neither necessary nor sufficient to bring about an effect; also called
a partial cause
control group: a group that is similar to the experimental group but experiences other value(s) of the
independent variable, i.e., does not receive the intervention
correlated variables: the value of one variable raises or lowers the probability of the other variable
taking on some value
correlation coefficient: describes the direction and strength of correlation; a positive or negative sign
indicates positive or negative correlation, and a number between 0 and 1 indicates the strength of
the correlation
correlation strength: how predictable the values of one variable are based on the values of the other
variable
counterexamples: situations you can describe, whether real or imagined, in which the premises of an
argument are true but the conclusion false; shows that a deductive argument is invalid
crisis: a period in which widespread failure of confidence in the ability of a (Kuhnian) paradigm to fulfill
its scientific function
cross-sectional study: a study in which different individuals are measured for some property or con-
dition at a single, given time; helpful in investigating relationships among a number of different
variables
crucial experiment: an experiment that decisively adjudicates between two hypotheses, settling once
and for all which is true
curve fitting: extrapolating from a data set to the expected data for measurements that weren’t actually
taken by fitting a continuous line through a data plot; there are always multiple different lines consis-
tent with the data

data: public records produced by observation or by some measuring device; allow observations to be
recorded and compared
data cleansing: identifying and correcting errors in a data set by deciding which data are questionable
and should be eliminated
data dredging: using data mining techniques to uncover patterns in their data that support some hypoth-
esis that one did not set out to test in advance
data model: a regimented representation of some data set, often with the aim of highlighting whether or
not the data count as evidence for a given hypothesis
deception: when researchers actively misinform participants about some aspect of an experiment or
study
Glossary 313
deductive inference: an inference in which the relationship between premises and conclusion is pur-
ported to be one of necessitation; in a valid deductive argument, the truth of the premises necessitate
the conclusion; in an invalid deductive argument, they do not
denying the antecedent: using the falsity of an antecedent and the truth of a conditional as grounds
for concluding the consequent is false; a deductively invalid form of inference
denying the consequent: using the falsity of a consequent and the truth of a conditional as grounds for
concluding the antecedent is false; a deductively valid form of inference
dependent variable: a variable that is expected to depend on, or be the effect of, the independent
variable
descriptive claim: a statement about how things are without making any value judgments
descriptive statistics: tools for summarizing, describing, and displaying data in a meaningful way
difference-making: the idea that if the occurrence of one event makes a difference to the occurrence of
a second event, the first is a cause of the second
direct correlation: greater values for one variable increase the probability of greater values for a sec-
ond variable; also referred to as positively correlated
direct variable control: when all extraneous variables are all held at constant values during an
intervention
directed acyclic graphs: graphs in which all the causal relationships are one-directional (none of a
cause’s effects are also among its causes) and do not move in a circle (following a series of cause-
effect relationships will not lead you back to an earlier cause as a later effect)
distal causes: causes that occurred further back in time from the effect and perhaps further away as well
double-blind experiment: an experiment or study in which both scientists and subjects are unaware of
which subjects are in which group (control or experimental) because of randomization
Duhem-Quine problem: the idea scientific hypotheses can never be tested in isolation; instead, scien-
tific hypotheses are tested only against the background of auxiliary assumptions
ecological validity: the degree to which experiment circumstances are representative of real-world
circumstances
effect size: a quantitative, scale-free measure of the strength of a phenomenon
empirical evidence: information gathered through the senses, including with the use of technology to
extend the reach of the senses, that weighs in favor or against some hypothesis
estimation: predicting properties of a population on the basis of a sample
eugenics: the idea that a human population can be improved by controlling breeding; historically linked
to racist and classist science that threatened human liberties and human dignity
evidence: fact or information that makes a difference to what one is justified in believing
evidentialism: the idea that a belief’s justification is determined by how well the belief is supported by
evidence
exemplar: a model that is one of the target systems it is used to represent
expectations: conjectural claims about observable phenomena based on some hypothesis; expectations
should be true if the hypothesis is true, false if the hypothesis is false
experiment: a method of testing hypotheses that involves intervening on one or more variables of interest
and observing what effects this has
experimental group: a group that receives the intervention to the independent variable or otherwise
experiences the intended value of the independent variable
explanatory knowledge: generating answers to questions about how things work and why things are
the way they are
exploratory experiment: an experiment that does not rely on existing theory and may not be aimed to
test a specific hypothesis; used to suggest novel hypotheses or to assess whether a poorly understood
phenomenon actually exists
314 Glossary
external experimental validity: the extent to which experimental results generalize from the experi-
mental conditions to other conditions—especially to the phenomena the experiment is supposed to
yield knowledge about
extraneous variables: other variables besides the independent variable that influence the value of the
dependent variable; if uncontrolled, these may become confounding variables
fair variable: a random variable that has independent outcomes and is unbiased, that is, its outcomes
are all equally likely
faithfulness: the requirement that probabilistically independent variables are not directly causally
related; an assumption of causal Bays nets
falsifiable: evidence can be described that, if found, would show the claim to be false; a key feature of
scientific claims
falsificationism: the idea, due to Karl Popper, that scientific reasoning proceeds by attempting to dis-
prove ideas rather than to prove them right
field experiment: an experiment conducted outside of a laboratory, in the experimental subjects’ every-
day environment
frequency distribution: how often a variable takes on each range of values in a data set
frequentist interpretation: the idea that the probability of an outcome is the limit of its relative fre-
quency; an element of classical statistics
full control: creating the conditions such that no variables other than the target independent variable and
the dependent variable change as a result of an intervention
funding bias: when a scientific study is more likely to support the interests of its financial sponsor(s)
gambler’s fallacy: fallacious reasoning from a past variation from the expected frequency of outcomes
that there will be a future variation from the expected frequency in the opposite direction; errantly
supposing statistically dependence of outcomes
Gaussian distribution: see normal distribution
generality: a desirable feature of models; a model’s ability to apply to a greater number of target systems
Hawthorne effect: a confounding variable in experiments involving human participants, where experimen-
tal participants change their behavior, perhaps unconsciously, in response merely to being observed;
see also observer bias
histogram: visual representation of statistical outcomes in which bars of different heights are used to
represent the frequency of different values of a continuous variable
hypothesis: a conjectural statement based on limited data; a guess about what the world is like, which
is not (yet) backed by sufficient, or perhaps any, evidence
hypothetico-deductive method: a method of hypothesis-testing; an expectation is deductively inferred
from a hypothesis and compared with an observation; violation of the expectation deductively refutes
the hypothesis, while a match with the expectation non-deductively boosts support for the hypothesis
idealization: assumption made without regard for whether it is true, often with full knowledge that it is
false
illusion of explanatory depth: believing that one understands the world more clearly and in greater
detail than actually is the case
illusion of understanding: a lack of genuine understanding of some topic linked to a lack of apprecia-
tion for the depth of one’s ignorance about the topic
independent outcomes: the probability of the outcome of one trial is not conditional on the outcomes
of any other trials; e.g., numbers rolled on two different dice rolls are independent from one another
independent variable: a variable that is changed or observed at different values in order to investigate
the effect of the change
indirect correlation: greater values for one variable increase the probability of smaller values for a
second variable; negatively correlated
Glossary 315
indirect variable control: causing the influence of extraneous variables to vary in a way that is inde-
pendent from the value of the independent variable
inductive: an inferential relationship from premises to conclusion that is one of probability not necessity
inductive generalization: inference to a general conclusion about the properties of a class of objects
based on the observation of some number of objects in the same class
inductive projection: inference to a conclusion about the feature of some object that has not been
observed based on the observation that some objects of the same kind have that feature
inductive strength: the probabilistic extent to which the conclusion of an inductive inference is true
given that its premises are all true.
inference: a logical transition from one thought to another that can be characterized in terms of abstract
rules
inference to the best explanation: see abductive inference
inferential statistics: using statistical reasoning to draw broader conclusions on the basis of limited data
informal fallacies: inference patterns that involve a problem with the content of an inference; a deduc-
tive argument that commits an informal fallacy may be valid, but it will not be sound
instruments: technological tools or other kinds of apparatus used in experiments
intelligent design: the idea that life forms are so complex that they couldn’t possibly have come about
without the help of an intelligent designer (such as the Judeo-Christian God)
internal experimental validity: the degree to which scientists can draw accurate conclusions about
the relationship between the independent and dependent variables
intervention: a direct manipulation of the value of the independent variable
isomorphism: one idea of the relationship a model bears to its target system(s); a one-to-one correspon-
dence between each part or feature of the model and of the target
joint method of agreement and difference: one of Mill’s methods; considering cases where the
suspected effect occurs to see what they have in common (method of agreement), as well as consider-
ing cases where the suspected effect does not occur to see what those have in common (method of
difference)
joint probability distribution: the probability distribution for each of a set of variables, taking into
account the probability of the other variables in the set
justification: reasons for belief; one requirement for a belief to qualify as knowledge
knowledge: traditionally, a belief that is at least both true and sufficiently justified
laboratory experiments: experiments conducted in a laboratory, giving scientists control over inter-
ventions performed and direct and indirect control of many extraneous variables
likelihood: often used as a synonym for ‘probability’, or to refer to the probability of observed data given
the truth of a specific hypothesis. More precisely, a likelihood is a function of the parameters of a
statistical model given observed data
logic: the study of the rules and patterns of good and bad inference
longitudinal study: a study in which the same subjects are measured (for some property or condition)
repeatedly over a period of time, sometimes many years, allowing the researchers to track a subject’s
change
Markov condition: the requirement that the probability of causal variables conditional on their parent
causes are probabilistically independent of all their other ancestors; an assumption of causal Bays nets
Matilda effect: the bias against recognizing the achievements of women scientists, whose work is often
uncredited or else attributed to their male colleagues instead
material conditional: a conditional statement (with an antecedent and consequent) that is false only if
the antecedent can be true while the consequent is false
mathematical models: mathematical formulas that relate variables, parameters, and constants to one
another to represent one or more target systems
316 Glossary
mean: a measure of the central tendency of a data set; the sum of all values in the data set divided by the
number of instances; also called average
measurement error: the difference between a measured value quantity and its true value
mechanisms: complex hierarchical systems consisting of component parts and operations that are orga-
nized so as to causally produce a phenomenon
mechanistic conception of explanation: the view that phenomena are explained by showing how
they are mechanistic activities composed of the organized operations of those mechanisms’ component
parts
median: a measure of the central tendency of a data set; the middle value in a distribution when the
values are arranged from the lowest to the highest
method of agreement: one of Mill’s methods; considering cases where the suspected effect occurs to
see what they have in common
method of concomitant variations: one of Mill’s methods; using the observation that the value of
one variable changes in tandem with changes to the value of a second variable to infer that the two
are causally related
method of difference: one of Mill’s methods; considering cases where the suspected effect does not
occur to see what those have in common
method of residues: one of Mill’s methods; comparing cases in which a set of causes brings about a
set of effects to cases in which some of those causes bring about some of those effects and inferring,
on that basis, that the absent cause(s) are responsible for the absent effect(s)
methodological naturalism: the idea that scientific theories shouldn’t postulate supernatural or other
spooky kinds of entities
mode: a measure of the central tendency of a data set; the most frequent value in the data set
modularity: the assumption that interventions on some causal relationship will not change other causal
relationships in the system
modus ponens: see affirming the antecedent
modus tollens: see denying the consequent
monotonic: the addition of new information never invalidates the inference
multiplication rule: the probability that two independent events both occur is the result of multiplying
their individual probabilities
mutually exclusive outcomes: a set of outcomes, only one of which can occur in a given trial; e.g.,
rolling a one and a three are mutually exclusive outcomes
natural experiments: interventions on independent variables occur naturally without experimenters
influencing the system
natural explanations: explanations that invoke features of the world to account for the phenomena
under investigation
natural phenomena: objects, events, regularities, and processes that are sufficiently uniform to make
them susceptible to systematic study

necessary cause: a cause that must occur in order for the effect to come about
necessary condition: a condition that must be satisfied in order for a specified outcome to occur
negatively correlated: greater values for one variable decrease the probability of greater values for a
second variable; also known as indirect correlation
nodes: used to represent variables in causal Bayes nets
nomological conception of explanation: the idea that a phenomenon is explained by deductively
inferring it from a scientific law and some initial conditions
non-ampliative: an inference in which the conclusion doesn’t add any new content beyond what’s
explicitly or implicitly contained in the premises; a property of valid deductive inference
non-monotonic: the addition of new information can invalidate the inference
Glossary 317
normal distribution: a symmetric, unimodal distribution with the most common values at the middle
and decreasingly common outcomes as the values get higher and lower; also called a bell curve or
Gaussian distribution
normal science: the most common (Kuhnian) phase of science, within which scientific research is stable
and based on widespread agreement about basic assumptions; this follows either pre-paradigm sci-
ence or scientific revolution
normative claim: a statement about how things ought to be, which might or might not correspond to
how they in fact are
null hypothesis: a reasonable default assumption about how the world is, which is not a bold and risky
conjecture; in statistical hypothesis-testing, the null hypothesis generally states that the variables in
question are statistically independent
observable: capable of being perceived or detected with the use of one’s senses under appropriate
circumstances; observability is relative to specific epistemic communities, their scientific theories, and
technical apparatus
observation: any information gained from your senses—not only what you see but also what you hear,
smell, touch, and sense in any other way you can experience the world
observational study: collecting and analyzing data without performing interventions or, often, aiming
to control extraneous variables
observer bias: See Hawthorne effect
observer-expectancy effect: when a scientist’s expectations lead her to unconsciously influence the
behavior of experimental subjects
ontological naturalism: the idea that no supernatural entities exist
openness to falsification: the willingness to abandon any claim or theory when the preponderance of
evidence suggests it’s wrong; a key feature of science
operational definition: a specification of the conditions when some concept applies, enabling mea-
surement or other kinds of precision
outcome space: the set of all values a random variable can take on, also called sample space
outliers: measured values for a variable that are notably different from the other values in the data set
p-value: the probability of the observed data assuming the null hypothesis is true
paradigm: according to Kuhn, a way of practicing science; provides scientists with a stock of assump-
tions about the world, concepts and symbols for effective communication, methods for gathering and
analyzing data, and other habits of research and reasoning
parameter: a quantity whose value can change in different applications of a mathematical equation but
that only has a single value in any one application of the equation
pattern conception of explanation: the idea that a phenomenon is explained by fitting it into a more
general framework of laws and principles
perfectly controlled experiment: an experiment in which all variables are controlled except for the
independent variable, an intervention is performed on the independent variable, and the effects on
the dependent variable are measured; no confounding variables are possible
Persian Golden Age: period of rapid intellectual achievements in science, philosophy, literature, and
art spanning from Central Asia to the Arabian Peninsula between the 8th and 13th centuries, which
was the core part of the so-called Islamic Golden Age more generally; arguably the most important
period in the development of science prior to the Scientific Revolution
phenomena: things or processes as we experience them; appearances of objects, events, regularities,
or processes that exist or occur
philosophy of science: the investigation of science, focused especially on questions of what science
should be like in order to be a trustworthy route to knowledge and to achieve the other ends we want
it to have, such as usefulness to society
318 Glossary
physical constants: quantities that are universal and unchanging over time
physical process: an account of causation in which causation consists in some continuous physical
process, such as energy transfer
pie chart: visual representation of statistical outcomes in which a circle is divided into slices that used to
show the relative frequency of the outcome space of different values for some variable
placebo effect: when an experimental subject’s expectations lead to the outcome the subject expects;
this can be an extraneous variable
plagiarism: stealing somebody else’s ideas, data, or words by presenting them as one’s own work and
failing to give appropriate credit
population: a collection of entities that are grouped together, often in virtue of exhibiting common
features
population validity: the degree to which experimental entities are representative of the broader class of
entities of interest; for experiments with human subjects, this is the broader population
positively correlated: greater values for one variable increase the probability of greater values for a
second variable; also known as direct correlation
post hoc, ergo propter hoc: the mistaken conclusion that one event causes another simply because
the events occur in succession close to each other; translated from Latin, ‘after this, therefore because
of this’
posterior probability: the probability of a hypothesis conditional on an observation that has been
made; Bayes’s theorem can be used to calculate this
power: the probability that the test will reject a false null hypothesis
precision: the extent to which a model finely specifies features of a target system
premises: statements that provide support for some conclusion; the starting points for an inference
pre-paradigmatic: the earliest phase of science according to Kuhn; characterized by the existence of
different schools of thought that debate very basic assumptions, including research methods and the
nature and significance of data
prior probability: the rational degree of belief in a hypothesis before making a given observation
probability distribution: how often a variable is expected to take on each of a range of values
probability theory: a mathematical theory developed to deal with random variables, or outcomes that
are individually unpredictable but that behave in predictable ways over many occurrences
problem of induction: the idea that inductive inference cannot be logically justified, since any possible
justification would need to employ inductive reasoning and would thus be circular
prospective study: a study in which researchers identify a group of subjects with some property or
condition and track their development forward in time
proximate causes: causes that occur closely in time and perhaps in space to their effect
pseudoscience: a non-scientific activity that masquerades as science, but is not; often designed to
deceive people into believing it has scientific legitimacy
publication bias: the tendency to publish surprising, new results more often than negative results, repli-
cation studies, and exploratory work
qualitative data: information that is non-numerical and without some other standard that makes it easily
comparable, such as diary accounts, unstructured interviews, and observations of animal behavior
qualitative variables: variables with values that are not numerical but descriptive, such as the variable
sport, with the values basketball, hockey, and so on.
quantitative analysis: the use of mathematical techniques to measure or investigate phenomena
quantitative data: data that is easily comparable, often in numerical form, such as numbers, vectors,
or indices
quantitative variables: variables with numerical values, such as height or percent correct on an exam
random sampling: the individuals composing the sample are selected randomly from the population
Glossary 319
random variables: variables that take on different values that are individually unpredictable but predict-
able in the aggregate
randomization: randomly assigning experimental entities to experimental and control groups
rational degree of belief: the interpretation of posterior probability in Bayesian statistics; believing a
hypothesis to the same degree as the probability it is true given the observations that have been made
range: a measure of the variability; the difference between the smallest and largest values in a data set
reasoning: psychological processes leading to beliefs; could be inferential or not
refutation: one consequence possible on the H-D method; the observation contradicts the expectation
deductively inferred from the hypothesis; the hypothesis is deductively proven to be false
regression analysis: finding the best-fitting line through the points on a scatterplot
regression to the mean: the tendency for outlier values to relate to less extreme values in the future
or past
relative frequency distributions: frequency distributions that record proportions of occurrences of
each value of a variable rather than absolute numbers of occurrences
replication: performing an experiment again—often with some modification to its design—in order to
check whether the result remains the same
representative: the experimental entities studied do not vary in any systematic way from the general
population
retrospective study: a study in which researchers first identify a group of subjects who have the
target property or condition, and then investigate their past in an attempt to isolate the cause of
the condition
robustness: a desirable feature of models; a measure of insensitivity to features that differ from the target
in a given model
robustness analysis: analyzing multiple models or different versions of a model to determine whether
and to what extent their results are consistent
sample: a subset of a population about which data are gathered
sample data: data about individuals in a sample
sample mean: the most likely average value of the trait of interest in a population
sample size: the number of individual sources of data in a study, often the number of experimental enti-
ties or subjects
sample space: see outcome space
sample standard deviation: an estimate of the spread of the probability distribution for the random
variable; s = √[∑(value−mean)2 / (n−1)]
sampling error: incorrect conclusion due to a non-representative sample
scale model: a concrete physical object that serves as a representation of one or more target
systems
scatterplot: visual representation of statistical outcomes in which the values of one variable are plotted
against the values of another variable

science: an inclusive social project of developing natural explanations for natural phenomena; these
explanations are tested using empirical evidence and should be subject to additional open criticism,
testing, refinement, or even rejection; science regularly, but not always, employs mathematics in both
the formulation and testing of its explanations
scientific breakthrough: radical shift in the theories in some field of science
scientific revolution: a radical change of a reigning theory being overturned in favor of a new theory,
often involving an alternative worldview; Kuhn’s view of the nature of scientific change
Scientific Revolution (the): beginning with the work of Copernicus and ending with the work of New-
ton; a fundamental transformation in ideas about how knowledge claims ought to be justified, which
led to the development of the scientific method
320 Glossary
scientific theory: a large-scale system of ideas about a natural phenomenon supported by a variety of
evidence
self-explanation effect: the observation that generating explanations to oneself or to others can facili-
tate the integration of new information into existing bodies of knowledge and can lead to deeper
understanding
set: a grouping of objects (called elements)
significance level: how improbable, given the null hypothesis, an experimental result must be to warrant
rejecting the null hypothesis
Simpson’s paradox: a correlation between two events that disappears, or is reversed, when data are
grouped in a different way
68–95–99.7 rule: the percentages of values that lie within one, two, and three standard deviations around
the mean of a normal distribution
soundness: a property that deductive arguments have when they are both valid and have all true
premises
spurious correlations: two events are correlated but aren’t causally related in any obvious way
standard deviation: the square root of the variance; for a population, s = √[∑(value − mean)2 / n]
standard error: the standard deviation of the sampling distribution of the mean; SE = s/√(sample size)
statistical description: summarizing, describing, and displaying data in a meaningful way
statistically independent: two events for which the occurrence of one does not increase or decrease
the probability of the other; that is, when Pr(Y|X) = Pr(Y) and Pr(X|Y) = Pr(X)
statistically significant: data with a p-value below the chosen significance level; grounds for rejecting
the null hypothesis
strawman fallacy: an informal fallacy; caricaturing an argument in order to criticize the caricature
rather than the actual view
subjects: humans, non-human animals, or inanimate objects in an experiment or non-experimental study;
also called experimental entities
subtraction rule: the probability that some outcome doesn’t occur is the result of subtracting the prob-
ability of that outcome from the total probability (Pr = 1)
sufficient causes: causes that always bring about the effect
sufficient condition: a condition that, if met, guarantees a specified outcome will occur
super-observational: enhancement of our powers of observation far beyond what they ordinarily
include through the use of tools or other implements
target system: a selected part of the structure of world, about which scientists aim to gain knowledge;
the phenomenon intended to be represented by a model
theorems: statements deductively inferred from a set of axioms
theoretical claims: claims made about entities, properties, or occurrences that are not directly observable
thought experiments: devices of the imagination that scientists can use to learn about possible effects
of an intervention; may supplement or replace empirical evidence

total probability: the whole set of values in an outcome space for some random variable; always Pr = 1
tractability: the degree of ease in developing or using a model
type I error: a false positive; the erroneous rejection of the null hypothesis when it is true
type II error: a false negative; the erroneous acceptance of the null hypothesis when it is false
underdetermination: when evidence is insufficient to determine which of multiple theories or hypoth-
eses is true
understanding: grasping why or how something came about or is the way it is
uniform distribution: all values in a range are equally likely; a histogram shows a flat line
unimodal distribution: one value in a range is the most common; in a histogram, there is one peak
Glossary 321
valid: a property of deductive inference in which the truth of the premises logically guarantees or neces-
sitates the truth of the conclusion
value of a variable: the particular state or quantity that a variable has taken on in some instance
value-free ideal: the idea that good science should not rely on moral and political beliefs in assessing
the evidence for scientific models, theories, or hypotheses
variability: the distribution of values in a data set; measures of variability like standard deviation and
variance indicate how spread out the data set is; also called spread
variable: anything that can vary, change, or occur in different states and that can be measured
variance: a measure of how far a set of data is spread out from the average value of the data set; the
average of the squared differences of the values of a random variable from its mean
References
Ahmed, M., Anchukaitis, K., Asrat, A., Borgaonkar, H., Braida, M., Buckley, B., …, & Curran, M. (2013).
Continental-scale temperature variability during the past two millennia. Nature Geoscience, 6, 339–346.
Al-Khalili, J. (2015). In retrospect: Book of optics. Nature, 518(7538), 164–165.
American Association for the Advancement of Science. (2001). Designs for science literacy. New York:
Oxford University Press.
Anderegg, W. R. L., Prall, J. W., Harold, J., & Schneider, S. H. (2010). Expert credibility in climate change.
Proceedings of the National Academy of Sciences, 107, 12107–12110.
Arrhenius, S. (1908). Worlds in the making: The evolution of the universe. London: Harper & Brothers.
Axelrod, R. (1984). The evolution of cooperation. New York: Basic Books.
Bao, X., & Eaton, D. W. (2016). Fault activation by hydraulic fracturing in western Canada. Science, 354,
1406–140.
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., . . . &
Cesarini, D. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6–10.
Blackawton, P. S., Airzee, S., Allen, A., Baker, S., Berrow, A., Blair, C., . . . & Hackford, C. (2011).
Blackawton bees. Biology Letters, 7, 168–172.
Broca, P. (1861). Remarques sur le siège de la faculté du langage articulé, suivies d’une observation
d’aphémie (perte de la parole). Bulletins de la Société d’anatomie, 2e serie, 6, 330–357.
Callaway, E. (2017). Oldest Homo sapiens fossil claim rewrites our species’ history. Nature News, 8 June
2017.
Callendar, G. S. (1939). The composition of the atmosphere through the ages. Meteorological Magazine,
74(878), 33–39.
Camerer, C. F. (1997). Taxi drivers and beauty contests. Engineering and Science, 60(1), 10–19.
Camerer, C. F., Babcock, L., Loewenstein, G., & Thaler, R. (1997). Labor supply of New York City cabdriv-
ers: One day at a time. Quarterly Journal of Economics, 112, 407–441.
Capra, F. (1975). The Tao of physics. Boston: Shambhala Publications.
Cartwright, N. (1989). Nature’s capacities and their measurement. Oxford: Oxford University Press.
Chatrchyan, S., Khachatryan, V., Sirunyan, A. M., Tumasyan, A., Adam, W., Aguilo, E., . . . & Friedl,
M. (2012). Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC.
Physics Letters B, 716(1), 30–61.
Chattopadhyay, R., & Duflo, E. (2004). Women as policy makers: Evidence from a randomized policy
experiment in India. Econometrica, 72(5), 1409–1443.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Cumming, G. (2013). Understanding the new statistics: Effect sizes, confidence intervals, and meta-
analysis. New York: Routledge.
Darley, J. M., & Latane, B. (1968). Bystander intervention in emergencies: Diffusion of responsibility.
Journal of Personality and Social Psychology, 8, 377–383.
References 323
Darwin, C. (1872). On the origin of species by means of natural selection, or the preservation of favoured
races in the struggle for life, 6th edition. London: John Murray.
Dockery, D. W., Pope, C. A., Xu, X., Spengler, J. D., Ware, J. H., Fay, M. E., . . . & Speizer, F. E. (1993).
An association between air pollution and mortality in six US cities. New England Journal of Medicine,
329(24), 1753–1759.
Donovan, A. (1993). Antoine Lavoisier: Science, administration, and revolution. Oxford: Blackwell.
Duarte, J. L., Crawford, J. T., Stern, C., Haidt, J., Jussim, L., & Tetlock, P. E. (2015). Political diversity will
improve social psychological science. Behavioral and Brain Sciences, 38, 1–13.
Dyson, F. W., Eddington, A. S., & Davidson, C. R. (1920). A determination of the deflection of light by the
sun’s gravitational field, from observations made at the solar eclipse of May 29, 1919. Philosophical
Transactions of the Royal Society A, 220, 571–581.
Eberhardt, F. (2009). Introduction to the epistemology of causation. The Philosophy Compass, 4(6),
913–925.
Eddington, Sir Arthur. (1935/2012). New pathways in science: messenger lectures (1934). Cambridge:
Cambridge University Press.
Elliott, K. C. (2017). A tapestry of values: An introduction to values in science. Oxford: Oxford University
Press.
Enten, H. (2017). What Harry got wrong in 2016. FiveThirtyEight. Retrieved from http://fivethirtyeight.
com/features/what-harry-got-wrong-in-2016/
Fisher, R. A. (1956). Mathematics of a lady tasting tea. In J. R. Newman (Ed.), The world of mathematics
(pp. 1512–1521). New York: Simon & Schuster. (Original work published in Fisher, R. A. (1935). The
design of experiments. Edinburgh: Oliver & Boyd).
Fizeau, H. (1849). Sur une expérience relative à la vitesse de propagation de la lumière. Comptes rendus,
29, 90–92.
Floridi, L. (2012). Big data and their epistemological challenge. Philosophy and Technology, 25, 435–437.
Galton, F. (1889). Natural inheritance. London: Macmillan.
Gelman, A., & Hennig, C. (2017). Beyond subjective and objective in statistics (with discussion). Journal
of the Royal Statistical Society, 180(4), 967–1033.
Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself
statistically significant. The American Statistician, 60(4), 328–331.
Gillham, N. W. (2001). Sir Francis Galton and the birth of eugenics. Annual Review of Genetics, 35,
83–101.
Glymour, C. (2007). When is a brain like the planet? Philosophy of Science, 74(3), 330–347.
Gopnik, A. (1998). Explanation as orgasm. Minds and Machines, 8(1), 101–118.
Guéguen, N., Jacob, C., Le Guellec, H., Morineau, T., & Lourel, M. (2008). Sound level of environ-
mental music and drinking behavior: A field experiment with beer drinkers. Alcoholism: Clinical and
Experimental Research, 32(10), 1795–1798.
Güth, W., Schmittberger, R., & Schwarze, B. (1982). An experimental analysis of ultimatum bargaining.
Journal of Economic Behavior and Organization, 3, 367–388.
Haddad, D., Seifert, F., Chao, L. S., Possolo, A., Newell, D. B., Pratt, J. R., . . . & Schlamminger, S. (2017).
Measurement of the Planck constant at the National Institute of Standards and Technology from 2015
to 2017. Metrologia, 54, 633–641 (arXiv: 1708.02473).
Harlow, J. M. (1848). Passage of an iron rod through the head. Boston Medical and Surgical Journal, 39,
389–393.
Harlow, J. M. (1868). Recovery from the passage of an iron bar through the head. Publications of the
Massachusetts Medical Society, 2, 327–347.
Hempel, C. G. (1966). Philosophy of natural science. Englewood Cliffs: Prentice-Hall.
324 References
Herschel, W. (1801). Observations tending to investigate the nature of the sun, in order to find the causes
or symptoms of its variable emission of light and heat: With remarks on the use that may possibly be
drawn from solar observations. Philosophical Transactions of the Royal Society of London, 91, 265–318.
Hesslow, G. (1976). Two notes on the probabilistic approach to causality. Philosophy of Science, 43(2),
290–292.
Hodges, J., & Tizard, B. (1989). Social and family relationships of ex-institutional adolescents. Journal of
Child Psychology and Psychiatry, 30, 77–97.
Hubble, E. (1929). A relation between distance and radial velocity among extra-galactic nebulae.
Proceedings of the National Academy of Sciences, 15(3), 168–173.
Hublin, J. J., Ben-Ncer, A., Bailey, S. E., Freidline, S. E., Neubauer, S., Skinner, M. M., . . . & Gunz, P.
(2017). New fossils from Jebel Irhoud, Morocco and the pan-African origin of Homo sapiens. Nature,
546(7657), 289–292.
Hume, D. (1738/2007). A treatise of human nature (D. F. Norton & M. J. Norton, eds.). Oxford: Clarendon
Press.
Hume, D. (1748/1999). An enquiry concerning human understanding (T. L. Beauchamp, ed.). Oxford and
New York, NY: Oxford University Press.
Huygens, C. (1690/1962). Treatise on light (S. P. Thompson, trans.). New York: Dover Publications.
Intergovernmental Panel on Climate Change (IPCC). (2014). Climate change 2014: Synthesis report.
Retrieved from www.ipcc.ch/news_and_events/docs/ar5/ar5_syr_headlines_en.pdf
Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Med, 2(8), e124.
Kahneman, D. (2011). Thinking, fast and slow. New York: Farrar, Straus, & Giroux.
Keller, E. F. (1983). A feeling for the organism: The life and work of Barbara McClintock. San Francisco:
W.H. Freeman and Co.
Khang, Y.-H. (2013). Two Koreas, war and health. International Journal of Epidemiology, 42, 925–929.
Knight, J. (2002). Sexual stereotypes. Nature, 415, 254–256.
Korb, K., & Nicholson, A. (2010). Bayesian artificial intelligence (2nd ed.). Boca Raton: Chapman & Hall/
CRC Press.
Kragh, H., & Smith, R. W. (2003). Who discovered the expanding universe? History of Science, 41(2), 141–162.
Kuhn, T. (1962/1970). The structure of scientific revolutions. Chicago: University of Chicago Press (1970,
2nd ed., with postscript).
Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., . . . Zwaan, R. A. (2018).
Justify your alpha. Nature Human Behavior, 2, 168–171.
Lawson, R. (2006). The science of cycology: Failures to understand how everyday objects work. Memory
& Cognition, 34(8), 1667–1675.
Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flu: Traps in big data
analysis. Science, 343(6176), 1203–1205.
Le Cam, L. (1986). The central limit theorem around 1935. Statistical Science, 78–91.
Lee, T. M., Markowitz, E. M., Howe, P. D., Ko, C. Y., & Leiserowitz, A. A. (2015). Predictors of public
climate change awareness and risk perception around the world. Nature Climate Change, 5(11),
1014–1020.
Levins, R. (1966). The strategy of model building in population biology. American Scientist, 54, 421–431.
Levitt, S., & Dubner, S. J. (2005). Freakonomics: A rogue economist explores the hidden side of everything.
New York: William Morrow.
Lindley, D. V. (1993). The analysis of experimental data: The appreciation of tea and wine. Teaching
Statistics, 15, 22–25.
Lord, C. G., Ross, L., & Lepper, M. R. (1979). Biased assimilation and attitude polarization: The effects
of prior theories on subsequently considered evidence. Journal of Personality and Social Psychology,
37(11), 2098–209.
References 325
Manson, M. (1893). Geological and solar climates: Their causes and variations. San Francisco: G.
Spaulding & Co.
McMullin, E. (1985). Galilean idealization. Studies in the History and Philosophy of Science, 16, 247–273.
Mendel, G. (1865/1996). Experiments in plant hybridization (W. Bateson, Trans.). Electronic scholarly
publishing project. (Original work published as Versuche über Plflanzenhybriden. Verhandlungen des
naturforschenden Vereines in Brünn, Bd. IV für das Jahr 1865, Abhandlungen, 3–47). Retrieved from
www.esp.org/foundations/genetics/classical/gm-65.pdf
Michotte, A. (1962). The perception of causality. Andover, MA: Methuen.
Milgram, S. (1963). Behavioral study of obedience. Journal of Abnormal and Social Psychology, 67(4),
371–378.
Mill, J. S. (1893). A system of logic, ratiocinative and inductive: Being a connected view of the principles
of evidence and the methods of scientific investigation. New York: Harper & Brothers.
Morgan, M., & Boumans, M. J. (2004). Secrets hidden by two-dimensionality: The economy as a hydraulic
machine. In S. de Chadarevian & N. Hopwood (eds.), Model: The third dimension of science
(pp. 369–401). Stanford: Stanford University Press.
National Research Council. (1979). Carbon dioxide and climate: A scientific assessment. Washington DC:
National Academies Press.
Newton, I. (1671/1672). A letter of Mr. Isaac Newton, Professor of the Mathematicks in the University
of Cambridge; containing his new theory about light and colors: Sent by the author to the publisher
from Cambridge, Febr. 6. 1671/72; In order to be communicated to the R. Society. Philosophical
Transactions, 6, 3075–3087.
Newton, I. (1704/1998). Opticks: Or, a treatise of the reflexions, refractions, inflexions and colours of
light: Also two treatises of the species and magnitude of curvilinear figures. Commentary by Nicholas
Humez (Octavo ed.). Palo Alto: Octavo.
Oreskes, N. (2004). The scientific consensus on climate change. Science, 306(5702), 1686.
Oreskes, N., & Conway, E. (2010). Merchants of doubt. New York: Bloomsbury.
Parsons, H. M. (1974). What happened at Hawthorne? Science, 183(4128), 922–932.
Pashler, H., & Wagenmakers, E. J. (2012). Editors’ introduction to the special section on replicability in
psychological science: A crisis of confidence?. Perspectives on Psychological Science, 7(6), 528–530.
Peirce, C. S. (1903/1904) (1931–1936). The collected papers (Vols. 1–6, C. Hartshorne & P. Weiss,
eds.). Cambridge: Harvard University Press.
Pfungst, O. (1911). Clever Hans (The horse of Mr. von Osten): A contribution to experimental animal and
human psychology (C. L. Rahn, Trans.). New York: Henry Holt (Originally published in German, 1907).
Popper, K. (1963). Conjectures and refutations: The growth of scientific knowledge. London: Routledge
and Kegan Paul.Pukelsheim, F. (1994). The three sigma rule. The American Statistician, 48(2), 88–91.
Rapoport, A., Seale, D. A., & Colman, A. M. (2015). Is tit-for-tat the answer? On the conclusions drawn
from Axelrod’s tournaments. PLoS One, 10(7), e0134128.
Reichenbach, H. (1938). Experience and prediction. Chicago: University of Chicago Press.

Retraction Watch. Tracking retractions as a window into the scientific process. Retrieved from http://
retractionwatch.com/
Rozin, P., Fischler, C., & Shields-Argelès, C, (2012). European and American perspectives on the meaning
of natural. Appetite, 59, 448–455.
Rudder, C. (2014). Dataclysm: Who we are when we think no one’s looking. New York: Crown Publishers.
Schaffer, S. (1989). Glass works: Newton’s prisms and the uses of experiment. In D. Gooding, T. Pinch, &
S. Schaffer (eds.), The uses of experiment: Studies in the natural sciences (pp. 67–104). Cambridge:
Cambridge University Press.
Schelling, T. C. (1969). Models of segregation. American Economic Review, 59, 488–493.
Schelling, T. C. (1971). Dynamic models of segregation. Journal of Mathematical Sociology, 1, 143–186.
326 References
Schoenfeld, J. D., & Ioannidis, J. P. (2013). Is everything we eat associated with cancer? A systematic
cookbook review. The American Journal of Clinical Nutrition, 97(1), 127–134.
Semmelweis, I. (1861/1983). The etiology, the concept and the prophylaxis of childbed fever (K. C.
Carter, Trans.). Madison: University of Wisconsin Press.
Simon, V. (2005). Wanted: Women in clinical trials. Science, 308(5728), 1517–1517.
Snow, J. (1855). On the mode of communication of cholera. London: John Churchill.
Squire, P. (1988). Why the 1936 Literary Digest poll failed. Public Opinion Quarterly, 52(1), 125–133.
Stanford, P. K. (2015 online first). Unconceived alternatives and conservatism in science: The impact of
professionalization, peer-review, and big science. Synthese, 1–18.
Stanziani, A. (2008). Defining natural product between public health and business, 17th to 21st centuries.
Appetite, 51, 15–17.
Teigen, K. H. (2002). One hundred years of laws in psychology. The American Journal of Psychology,
115, 103–118.
Thorgeirsson, T. E., Gudbjartsson, D. F., Surakka, I., Vink, J. M., Amin, N., Geller, F., . . . & Gieger,
C. (2010). Sequence variants at CHRNB3–CHRNA6 and CYP2A6 affect smoking behavior. Nature
Genetics, 42(5), 448–453.
Ullman, A. (2007). Pasteur-Koch. Distinctive ways of thinking about infectious diseases. Microbe, 2,
383–387.
United States Environmental Protection Agency. (2015). High lead levels in flint, Michigan. Retrieved from
www.epa.gov/sites/production/files/2015-11/documents/transmittal_of_final_redacted_report_to_
mdeq.pdf
Volterra, V. (1928). Variations and fluctuations of the number of individuals in animal species living
together. Journal du Conseil. Conseil Permanent International pour l’Exploration de la Mer, 3, 3–51.
Walton, D. (1989/2008). Informal logic: A pragmatic approach. Cambridge: Cambridge University Press.
Watson, J. D. (1968). The double helix. New York: Atheneum Press.
Weart, S. (2014). The public and climate change (since 1980). Retrieved from https://history.aip.org/
climate/public2.htm
Wegener, A. (1929/1966). The origin of continents and oceans. New York: Dover Publications.
Weisberg, D. S., Keil, F. C., Goodstein, J., Rawson, E., & Gray, J. R. (2008). The seductive allure of neu-
roscience explanations. Journal of Cognitive Neuroscience, 20(3), 470–477.
Weisberg, M. (2013). Simulation and similarity: Using models to understand the world. Oxford: Oxford
University Press.
Woodruff, G., & Premack, D. (1979). Intentional communication in the chimpanzee: The development of
deception. Cognition, 7(4), 333–362.
Woodward, J. (2016). The problem of variable choice. Synthese, 193(4), 1047–1072.
Index
Page numbers in italics indicate figures and in bold indicate tables on the corresponding pages.
abductive inference 156–159, 157; distinctive asymmetric distribution 187, 188

characteristics of 159–161, 162 atmospheric CO2 11, 11
abductive reasoning 156–161 autism 265–266
abstraction 118 auxiliary assumptions 58–59, 146–147
accuracy 119–120, 121 average 191
addition rule 174–175, 176 Axelrod, Robert 113–114, 116, 122
affirming the antecedent 133 axiomatic methods 147–148
affirming the consequent 135 axioms 147
al-Bīrūnī, ibn Ah·mad 19
alchemy 292 background conditions 56
algorithms 113 Bacon, Francis 93
al-Khwārizmī, ibn Mūsā 19 bacteria: streptococcus 168, 170
allergies, peanut 2 bad arguments, uncovering 134–137
American Academy for the Advancement of bar charts 184, 185, 186–187
Science (AAAS) 31 basic research 14
ampliative inferences 153 Bayes, Thomas 234, 235
analogical models 108–109, 109 Bayes factor 236
Andromeda Nebula 127 Bayesian inference: Bayesian belief updating in
anomaly 80 236–238; Bayesian conditionalization in
antecedents 130, 130; affirming of 133; 236–238; Bayes’s theorem and 234–235;
denying the 135 comparing support for different hypotheses
anti-vaccination advocacy 28 using 235–236; problems with 238–239
appeal to ignorance 136–137 Bayes nets 266, 266–271, 267–269, 269
appeal to irrelevant authority 136 Bay Model 89–90, 90, 94, 97, 102, 117, 128;
applied research 14 analysis of 100; construction of 98; as scale

archaeology 161, 162 model 106–107, 108
arguments 129; characteristics of inductive bell curve 187
153–155, 154; uncovering bad 134–137 bias: confirmation 33, 38; controlling for 68–70;
Aristotle 19, 86–87, 125, 129, 289–290, 292 funding 306; observer 50; publication 305;
Arrhenius, Svante August 8, 9, 281 random variables and 173
‘Artificial Production of Carbon Dioxide and Its bibliometric study 283
Influence on Temperature, The’ 8 Big Bang theory 137
assumptions 99; auxiliary 58–59, 146–147 big data 84–85, 104
astrology 16, 28 bimodal distribution 186, 186, 186–187
328 Index
Blackawton Bees project 36 Chattopadhyay, Raghabendra 76

black holes 39, 40–41 chemical revolution 292–294, 293
blind experiments 69 cholera 78–80, 79
brain: activity 42, 270; areas or regions 82, 270; classical statistics 233–234
damage 66, 82; metabolism 60 cleansing, data 104
Broca, Paul 82 Clever Hans 33–34, 34
climate change 7–11, 9–11, 28–29, 278–279;
calibration of instruments 57, 60 laboratory experiments on 74–75
Callendar, Guy 8–9, 9 Clinton, Hillary 207, 215, 217
California 89–91, 253 cluster indicators 65
Cal Tech 35 cohort studies 83
calx 292 collaborative experiments 55
cancer 218, 236–8, 246–7, 251, 258, collecting data 56–58
269–72, 269 collectively exhaustive outcomes 173
cannon thought experiment 86, 87 color 36, 47–52, 47, 54, 56, 58–61, 94–5,
carbon dioxide, atmospheric see climate change 168, 208–9, 209
Carlsson, Arvid 14, 299 common cause 248
Cartwright, Nancy 248 computer models 113–114
case studies 80–82 computer simulations 85
causal background 251 conclusions of arguments 129
causal Bayes nets 266, 266–271, 267–269, 269 conditional probability 177–180, 179
causal conception 285 conditionals 131
causal conception of explanation 284–286 conditional statements 130, 130–132
causal hypotheses: germ theory of disease and confederates 70
259–260; intervention and difference-making confirmation bias 33, 38
255–257; Mill’s methods of testing 257–258, conflicts of interest 35
259; testing 255–260 confounding variables 49
causal modeling 262–272; approaches to 263–266, consequent 130, 130; affirming the 135; denying
264; assumptions of 271–272; causal Bayes the 133
nets in 266, 266–271, 267–269, 269 continental drift 158
causation 242–253; correlation as guide to contributing cause 252
247–249, 248; fracking and 242–245, control groups 68, 78
243; nature of 249–250; necessary and Copernicus, Nicolaus 18, 19–20, 93, 290
sufficient causes 250–251; probability correlated variables 196
and 251–253; scientific reasoning about correlation 195–197, 196; coefficient of 200; as
245; skepticism about 246; spatiotemporal guide to causation 247–249, 248; measures
contiguity as guide to 246–247 of 197, 197–201, 198–199; spurious 248;
Centers for Disease Control and Prevention (CDC) strength of 198

153, 248 counterexamples 135
central limit theorem 210–212 counterfactual statements 249
central tendency 187–191, 188–189, 190, 192 Craik, Kenneth 93
CERN (European Organization for Nuclear creationism and intelligent design 16–17, 28–29
Research) 55–56, 67, 84, 162, 221, 224; Crick, Francis 102, 107, 107, 108, 295, 299
developing a probability distribution 225–227, crisis 291, 292
227; using statistics to test hypotheses cross-sectional studies 84
221–230 crucial experiments 58–60
Cepheid variable 127 curve-fitting 104, 105
Index 329
Darwin, Charles 137, 197, 289, 294, 297, 301 DNA (deoxyribonucleic acid) 11, 21, 94, 102,
data 41; big 84–85, 104; collection and analysis 295, 298–299, 299; analogical models of
of 56–58; curve-fitting 104, 105; models 108; scale model of 107, 107
of 103–105, 104; overfitting 104, 106; Doppler, Christian 127
qualitative 57; quantitative 57, 183–184; Doppler effect 127
questionnaire 57–58; sample 169; double-blind experiments 69
visualization of 84 drinking water 8, 76, 91, 150, 303
data cleansing 104 Dubner, Stephen 83
data dredging 306 Du Châtelet, Émilie 53, 55
deception 69–70 Duflo, Esther 76
deductive arguments 129 Duhem, Pierre 147
deductive reasoning: on age of the universe Duhem-Quine problem 147, 156
125–128, 126; in case of puerperal fever DuPont 36
142–146, 143–144, 145; conditional dyspnoea 268–269, 269
statements in 130, 130–132; Flint,
Michigan, water crisis and 150–151, 151; Early Childhood Longitudinal Study 83
in hypothesis-testing 141–148; hypothetico- earthquakes 158, 242, 244–246, 251, 287; and
deductive (H-D) method 141–142; inference, fracking 242, 244–247, 251, 260
argument and 128–129 ecological validity 75
defining science: by its history 18–21, 19; by its economics 17, 20, 30, 71, 76, 266, 275,
methods 23–26, 31–32; by its subject matter 280–281
21–23; tricky work of 16–17 Eddington, Arthur 64, 65, 146
denying the antecedent 135 Edwards, Marc 151, 162
denying the consequent 133 Edwards v. Aguillard 151, 162
dependent variables 49–50, 66 effect size 230
Descartes, René 24 Einstein, Albert 64, 65, 145–146, 289, 290
descriptive statistics 169–170; correlation electromagnetic radiation 61
in 195–201, 196, 197, 198–199; Elements of Geometry 147–148
generalizing from 207–217; measures of Elliott, Kevin 302–304
central tendency in 187–191, 188–189, empirical evidence 23–25
190, 192; measures of variability in 191–195, empiricism 24
192, 193, 195; variables and their values in Environmental Protection Agency (EPA) 151
182–184; visual representation of values of errors, sampling 216–217
variables in 184–187, 185–187 estimating from samples 212–215, 213, 214
de Vlamingh, Willem 154 Ethyl Corporation 35–36, 298
Dianetics 136 Euclid 147–148, 290, 295
difference-making 249–250; intervention and eugenics 201, 301
255–257 Europe 18–19, 21, 28, 154, 305

Digges, Thomas 93 European Organization for Nuclear Research: see
direct correlation 196 CERN (European Organization for Nuclear
directed acyclic graphs 268–269 Research)
direct variable control 67 evidence: definition of 25; empirical 23–25;
disease: germ theory of 259–260, 265; heart falsification of 25–26
77, 248, 253, 258, 272; hereditary 48, evidentialism 23–25
95, 200; Parkinson’s 14, 116; sexually evolution, theory of 289, 294, 297, 301
transmitted 1; syphilis 66, 298, 301 exemplar 95
distal causes 247 exemplification 95
330 Index
expectancy bias 38 Gallup polls 216

expectations 40–41; in perfectly controlled Galton, Francis 197–201, 199, 200, 301
experiments 63–66, 65 gambler’s fallacy 180
experimental groups 68; choices in 77–78 game theory 64
experimentation, modeling as 115–116 Gauss, Carl Friedrich 210
experiments: blind 69; case studies and natural Gaussian distribution 187, 210, 211
80–83; collaborative 55; contributing to generality 119, 120
science 46–48, 47; crucial 58–60; double generalizations, inductive 152
blind 69; experimental setup of 55–56; General Motors 36
exploratory 61; field 75–76; intervention 49, genetically modified organisms (GMOs) 22, 301
66–67; laboratory 74–75; on light 51–54, geocentrism 43
52, 54–55; other roles for 60–61; perfectly geometry 147–148, 295
controlled 63–70; replication of 37–38, 59–60; Gianotti, Fabiola 222
thought 85–86; variables in 48–51, 51 glaciers 8, 29, 55–56
explanation: causal 284–286; natural 22–23; global warming 8, 13, 15, 23, 138, 163, 245,
nomological 279–284, 280; as pattern-fitting 278, 281, 303
282–284, 284; and understanding 275–286 Google Flu Trends 84
explanatory knowledge 14, 277 Gopnik, Alison 278
exploratory experimentation 61 Grandin, Temple 300
external experimental validity 75 greenhouse gases 7–8, 27, 29, 128, 281
extraneous variables 49–50
Hansen, James 9
faithfulness 272 Harlow, John 81
false positives 236–238 Harvard Six Cities Study 78, 82, 83
falsifiable claims 25 Harvard University 35
falsification 25, 26, 25–26, 154–155; Hauser, Marc 35
openness to 26 Hawthorne effect 50–51
field experiments 75–76 Heezen, Bruce 157, 158
‘final theory of everything’ 25 heliocentrism 20–21, 42, 43, 93
Fisher, Ronald 225, 225 Hempel, Carl 142, 279–280
FiveThirtyEight 215, 216–217 heredity 200
Fizeau, Hippolyte 61 Herschel, William 53, 54, 55, 61
Flint, Michigan, water crisis 150–151, 151, 153, Hesse, Mary 94
162, 163 Higgs boson 222, 222–223
food allergies 2 Hindu-Arabic numeral system 18–19
fracking 242–245, 243 histograms 186, 186, 186–187, 186–187, 193
Franklin, Rosalind 295, 298–299, 299 history: of modeling 93–94; of science
Freon 36 18–21, 19
frequency distributions 208–212, 209, 211 Homo sapiens 161, 162
frequentist interpretation 233 Hubbard, L. Ron 136
Freud, Sigmund 63 Hubble, Edwin 126, 126–128, 131, 159
fruit flies (Drosophila melanogaster) 95–96 human reasoning, flaws in 33–34, 34
functional magnetic resonance imaging (fMRI) 60 Hume, David 24, 155, 246, 249, 255
funding bias 306 Huygens, Christian 159
hypotheses 39–40; alternative 223; deductive
Gage, Phineas 80–82, 81 reasoning in testing 141–148; null 223–224,
Galilei, Galileo 20, 86–87 226, 228, 229, 260; testing causal 255–260;
Index 331
underdetermination of 58, 59; using statistics institutional care for children 83

to test 221–230 instruments 55, 56–57; calibration of 57, 60
hypothetico-deductive (H-D) method 141–142, Intergovernmental Panel on Climate Change
159–160, 223; auxiliary assumptions in (IPCC) 12, 29
146–147; in case of puerperal fever internal experimental validity 74, 76
142–146, 143–144, 145 intervention 49, 66–67; computer simulations 85;
difference-making and 255–257; thought
Ibn al-Haytham 19, 51–52, 53 experiments 85–86
Ibn Rushd 19 investigators, norms of 35–36
Ibn Sina 19 isomorphism 117
ice cores 9, 10
idealizations, model 99, 118 James, LeBron 168, 174, 179–180, 183–184, 208
illusion of explanatory depth 278, 279–280 Jebel Irhoud (Morocco) 161, 162
illusion of understanding 12–13 Jenner, Edward 265
importance of science 13–14 joint method of agreement and difference 258, 259
incentive structure in science 305 joint probability distributions 266
independent outcomes 174 justification 13
independent variables 49–50, 66, 77, 256–257
indirect correlation 196 Kahlo, Frida 1
indirect variable control 67–68, 78 Kahneman, Daniel 32, 128
induction, problem of 155–156 Keeling, C. David 9, 9
inductive arguments 153–155, 154 Keeling Curve 9, 10
inductive generalizations 152 Kehoe, Robert A. 36, 298
inductive inference 151–152, 170 Kekulé, Friedrich August 31–32, 128
inductive projections 152 Kepler, Johannes 20
Industrial Revolution 11, 11 Kibble balance 57
inferences: abductive 156–161; ampliative 153; Kitab al-Manazir (Book of Optics) 51–52
bad reasons to reject 137–138; deductive Kitzmiller v. Dover Area School District 29
reasoning 125–148; definition of 129; knowledge 13; explanatory 14, 277; pure 13–14;
evaluating 132–134; inductive 151–152, scientific 13–14, 276–279
170; non-ampliative 150; problem of Koch, Robert 259–260
induction and 155–156; sound 134; strength Kolletschka, Jakob 144–145
of 153; testimony and 162–163 Korea 82
inference to the best explanation 158 Kuhn, Thomas 290–291, 292, 293–294
inferential statistics 169–170; Bayesian inference Kyoto Protocol 7
234–239; classical statistics and its problems
in 233–234; considerations in designing laboratory experiments 74–75
statistical tests in 229–230; definition of 208; La Divina Commedia 28

estimating from samples in 212–215, 213, Landon, Alfred 216
214; frequency distributions and probability Large Hadron Collider 42, 55, 67, 84, 221–222
distributions in 208–212, 209, 211; Larsen effect 284–285
generalizing from descriptive statistics Lavoisier, Antoine-Laurent 292–294, 293
207–217; representative samples in 215–216; Lavoisier, Marie-Anne Paulze 292–294, 293
used to test hypotheses 221–230 lead (pb) 35–36, 150–151, 293, 298
informal fallacy 136 Leborgne, Louis 82
ingenuity 36 Leibniz, Gottfried Wilhelm 24
Inhofe, James 278, 279 Levitt, Steven 83
332 Index
life sciences 17 models: accuracy in 119–120, 121; analogical

light 51–54, 52, 54–55, 61; speed of 127 108–109, 109; analysis of 100–101;
limitations of science 14 assumptions in 99; Bay Model 89–90,
Literary Digest 216 90; causal 262–272; characteristics of
Locke, John 24 good 118–122, 121; computer 113–114;
logic 132, 134, 138, 147, 186, 245, 298, 315 construction of 97–100, 99; of data
longitudinal studies 83 103–105, 104; as experimentation and
Lotka, Alfred 98 theorizing 115–116; generality of 119, 120;
Lotka-Volterra Model 98–99, 99, 102, 108; history of 93–94; idealizations in 99, 118;
abstraction in 118; analysis of 100; mathematical 110–113, 112; mechanistic
as mathematical model 110–111; as 109–110, 110; of phenomena 105–106;
theoretical use of modeling 116 precision of 119, 120, 121; robustness of
lung cancer 269, 269–270 119, 122; role of 90–93, 92; scale 106–107,
107; similarity and difference 93–96, 96;
McClintock, Barbara 300 specification of target system(s) 96–97; three
Malthus, Thomas Richard 289, 294, 297 features shared by all 117–118; tractability
mammography 236–237 of 119, 121–122; trade-offs in building 122;
Manson, Marsden 8 types of 102–103
Markov condition 271–272 Modern Synthesis 294–295
material conditionals 131 modularity 271
mathematical models 110–113, 112 modus ponens 133
Matilda effect 299 modus tollens 133
Mauna Loa Observatory (Hawai’i) 9 Monetary National Income Analogue Computer
Maxwell, James 61 (MONIAC) 108, 109, 110
mean 191–192, 195; regression to the 200 monotonicity 132–133
measles, mumps, and rubella (MMR) vaccine 265 Montagu, Kathleen 14, 299
measurement error 57 Möstlin, Michael 20
mechanistic explanation 285 Mount Wilson Observatory (California)
mechanistic models 109–110, 110 126, 127
median 191 multiplication rule 175–176, 176
Mendel, Gregor 47–48, 95 mutually exclusive outcomes 173
mental models 93–94
meteorology 200 NASA (National Aeronautics and Space
Mill’s methods: of agreement 257, 259; of Administration) 73, 162, 304
concomitant variations 257, 259; of National Institutes of Health 153
difference 257, 259; joint 258, 259; of National Institute of Standards and Technology
residues 258, 259 (NIST) 57
methodological naturalism 22 National Research Council 9

methods: axiomatic 147–148; defining science by natural experiments 82–83
its 23–26; explanation 40–41; hypotheses natural explanations 22–23
39–40; myth of the scientific method and naturalism 22–23
31–32; observation 41–42; in science naturalistic inquiry 22–23
38–39 natural phenomena 21–22
Michotte, Albert 246 natural selection 294
Milgram, Stanley 69–70, 75 nature of science 26–29, 27
Mill, John Stuart 257–258, 259 Nazi Germany 298
mode 186, 190 necessary and sufficient causes 250–251
Index 333
necessary condition 130, 130 Pangaea 156–158, 157

negative correlation 196–197 paradigms 291; pre-paradigmatic phase of
Neptune 11 science 291, 292
Newton, Isaac 18, 52, 55, 56, 60, 145–146, paradox of inquiry 97
290; cannon thought experiment 86, parameters 98
87; controlling variables and 67; light Paris Agreement 7
experiments 52, 52–53, 55, 61, 67; partial cause 252
scientific laws and 281–282; on space 64 participation in science, women’s 298–300
nodes 263, 264 Pasteur, Louis 259, 265
nomological conception of explanation 279–284, pattern conception of explanation 282–284, 284
280 Patterson, Clair 35–36
non-ampliative inferences 150 Payne-Gaposchkin, Cecilia 298
non-experimental studies: case studies and natural payoff matrix 112, 112
experiments 80–83; cholera outbreak of 1854 Peano, Giuseppe 148
78–80, 79; extending over time 83–84; peanut allergies 2
variation from the perfect experiment 73 Peirce, Charles Sanders 159
non-monotonic arguments 153 perfectly controlled experiments 63–70;
non-revolutionary scientific change 294–295 controlling for bias in 68–70; controlling
normal distribution 187, 194–195, 195 variables in 67–68; defining expectations
normal science 291, 292 in 63–66, 65; intervention in 66–67;
normative versus descriptive claims in science 32 variation from 73
norms: of investigators 35–36; social 37–38, 38 Persian Golden Age 18, 18, 18–19, 20, 51
Pfungst, Oskar 34
objectivity 238, 297, 302, 304–305, 308 phenomena 14; models of 105–106; natural
observable phenomena 21–22 21–23; observable 21–22
observational studies: case studies and natural Phillips, William 108
experiments 80–83; cholera outbreak of Phillips machine (MONIAC) 108, 109,
1854 78–80, 79; definition of 79; extending 110, 117
over time 83–84 Philosophiæ Naturalis Principia Mathematica 18
observations 41–42; statistical significance of philosophy of science 3
227–229 phlogiston 292–293
observer bias 50 phrenology 33
observer-expectancy effect 33–34 physical constants 60–61
OKCupid 84, 85 physical processes 249–250
Oklahoma 242, 243, 244–246, 251, physical sciences 17
278, 278 pie charts 184, 185
ontological naturalism 22 placebo effect 69
openness to falsification 26 plagiarism 35

operational definitions 65 Planck, Max 57
Opticks 60, 61 plant fertilization 47, 47–48
Oreskes, Naomi 12 plate tectonics 158
Origin of Species 294 polio 1–2
outcome space 172 pollution 78, 104, 268–271, 268–269
outliers 191 Popper, Karl 24, 26
overfitting 104, 106 populations and samples 169
oxygen 60, 250, 293; as dephlogisticated population validity 75
air 293 positive correlation 196
334 Index
posterior probability 234–235 rationalism 24

post hoc, ergo propter hoc fallacy 247 reasoning: abductive 156–159; causal 242–272;
power 230 deductive 125–148; definition of 128–129;
precision of models 119, 120, 121 statistical (see statistics)
predictions 168 Reber, John 91, 92, 93–95, 97, 100, 116–117,
premises of arguments 129 123, 128, 277
Priestley, Joseph 293, 293 Reber Plan 91–93, 92, 116, 117, 128; analysis
prior probability 234–235 of 100
prisoner’s dilemma 111–113 recipes for science 3–4, 31–32, 39
probability and causation 251–253 reciprocal altruism 114
probability distributions 208–212, 209, 211, redshift 127
225–227, 227 regression analysis 197–201, 198
probability theory 170–171; addition rule 174–175, regression to the mean 200
176; conditional probability 177–180, 179; relative frequency distributions 208
multiplication rule 175–176, 176; random relativity, theory of general 64, 65, 289, 295
variables in 172–174; subtraction rule 176, Renaissance 17, 125
176–177 replication 37–38, 59–60
problem of induction 155–156 representative samples 215–216
projections, inductive 152 Retraction Watch 35
prospective studies 83 retrograde motion 19, 19, 20
proximate causes 247 retrospective studies 83
pseudoscience 16–17, 28–29 robustness analysis 101
psychology 17, 20, 28, 32, 60, 93, 134, robustness of models 119, 122
284, 287 role of science 11–13
psychoanalytic theory 63, 64 Rømer, Ole 61
PsycLit database 283–284, 284 Roosevelt, Franklin D. 1, 216
Ptolemy 19–20 Royal Society 56, 60
publication bias 305
puerperal fever 142–146, 143–144, 145 Safe Drinking Water Act 150
pure knowledge 13–14 Salk, Jonas 1
p-value 228–229, 233 sample data 169
samples: estimating from 212–215, 213, 214;
qualitative data 57 populations and 169; representative 215–216
qualitative variables 183 sample size 68; choices in 77
quantitative analysis 26 sample space 172
quantitative data 57, 183–184 sample standard deviation 213–214, 214
quantitative variables 183 sampling, random 216
questionnaires 57–58 sampling distribution 215

Quine, Willard van Orman 147 sampling errors 216–217
San Francisco Bay Model see Bay Model
randomization 68, 69, 77–78 scale models 106–107, 107
random sampling 216 scatterplots 196, 196, 196–198, 198, 199, 200
random variables 172–174 science: climate change 7–11, 9–11; contributions
range 192 of experiments to 46–48, 47; defined by its
Rapaport, Anatole 114 history 18–21, 19; defined by its methods
rapid strep test 168–169 23–26, 31–32; defined by its subject matter
rational degree of belief 234 21–23; effects on daily life 1–2; expectations
Index 335
in 40–41; flaws in human reasoning and standard error 215

33–34, 34; hypotheses in 39–40; illusion Stapel, Diederik 35
of understanding in 12–13; importance of State Research Centre of Virology and
13–14; limitations of 14; methods in 38–39; Biotechnology 153
models in (see models); nature of 26–29, 27; statistical description 217
normative versus descriptive claims in 32; statistical evidence 168
norms of investigators in 35–36; observations statistically independent variables 177
in 41–42; philosophy of 3; recipes for 3–4, statistical significance 227–229, 230
31–32, 39; role of 11–13; self-correction in statistical thinking, importance of 167–169
306–307; in social context 297–298; tricky statistics: descriptive (see descriptive statistics);
work of defining 16–17; trust and objectivity importance of 167–169; inferential (see
challenges facing 304–307; value-free ideal inferential statistics); populations and samples
in 300–301; values shaping 301–304, in 169; probability theory in 170–180
302; why learn about 2–3; women in Stellar Atmospheres 298
298–300, 299 strawman fallacy 136
scientific breakthroughs 289–290 strength, inference 153
scientific law 281 string theory 25
scientific method 31–32 Structure of Scientific Revolutions, The 290
scientific progress 295–296 Stumpf, Carl 34
Scientific Revolution 18, 20–21, 290; chemical subject matter, defining science by its 21–23
revolution and 292–294, 293; data subjects, experimental 55
collection during 56; Kuhn and 290–291, subtraction rule 176, 176–177
292; scientific methods during 23 sufficient causes 250–251
scientific theories 288–289 sufficient condition 130, 130
self-correction 306–307 supernatural entities and occurrences 22
self-explanation effect 277 super-observational access 42
self-interest 64 surgical intervention 66–67, 76, 78
Semmelweis, Ignaz 142–145, 145 survey data 57–58
Seoul National University 35
significance level 227 Tapestry of Values, A 302
similarity and difference 93–96, 96 target systems 93–96, 96, 103; specification of
Simpson’s paradox, Edward 253 96–97
68–95–99.7 rule 214, 214 taxi drivers 275–276, 280, 280–281, 282
skepticism 37–38; about causation 246 Tertullian 93
Snow, John 79–80, 80 testimony 162–163
Snydor, Rick 162 Tharp, Marie 157, 158
social context, science in 297–298 theology/religion 14, 19–21, 82
social norms 37–38, 38 theorems 147

social sciences 17 theoretical claims 39
sound inferences 134 theories, scientific 288–289
space exploration 73, 74 theorizing and theory change: chemical revolution
spatiotemporal contiguity as guide to causation and 292–294; Kuhn’s scientific revolutions
246–247 and 290–291, 292; non-revolutionary
spurious correlations 248 scientific change and 294–295; scientific
Stahl, Georg Ernst 292–293, 293 breakthroughs and 289–290; scientific
standard deviation 194–195, 195, 227; sample progress and 295–296; scientific theories
213–214, 214 and 288–289
336 Index
thought experiments 85–86 values: shaping science 301–304, 302; trust and
time, studies extending over 83–84 objectivity 304–307; value-free ideal and
Tit-for-Tat 114, 116 300–301
Tolman, Edward 93 variability 188; measures of 191–195, 192,
total probability 173 193, 195
total trihalomethane (TTHM) 150 variables 48–51, 51, 66–67; choices in
tractability of models 119, 121–122 76–78; controlling 67–68; correlated 196;
Trinity College, Cambridge 56 definition of 183; in descriptive statistics
Trump, Donald 207, 215–216, 304 182–202; qualitative 183; quantitative
Trust 11–12, 15–16, 37–38, 38, 42, 60, 100, 183; random 172–174; value of 49,
117, 162, 290, 297, 304–305, 307 182–187, 185–187
Truth 28, 40–41, 58, 132–135, 137, 141–142, variance 192–194
150, 152–154, 158, 161, 163, 233, 249, variation 167
295, 297–298 virus: cowpox 265; ebola 83, 229; human
Turing, Alan 298–299 immunodeficiency virus (HIV) 11; human
Tuskegee Syphilis Experiment 298, 301 papilloma virus (HPV) 1–2; influenza
Tversky, Amos 32 273; smallpox (variola) 153–154, 265;
type I error 229, 233 Zika 301
type II error 229, 233 visualization, data 84
visual representation of values of variables 184–187,
underdetermination 58, 59 185–187
understanding 276–279; definition of 277; Vitruvius 93
illusion of 12–13; illusion of explanatory Volterra, Vito 98
depth and 279–280 von Osten, Wilhelm 33–34, 34
unification conception of explanation 282
uniform distribution 187, 188–189 Wallace, Alfred Russel 32, 294
uniformity of nature 155–156 water crisis, Flint, Michigan 150–151, 151, 153,
UC Berkeley 253 162, 163
US Army Corps of Engineers 91, 93 Watson, James 102, 107, 107, 108, 295, 299
US Dairy Association (USDA) 248 Wegener, Alfred 156–158
US Public Health Service 298, 301 Wells, Herbert George 167
Western Electric Hawthorne Factory 50, 51, 75
vaccinations 1–2, 28; causal modeling of ‘Women as policy makers’ 76
immunity and 263–266, 264 women in science 298–300, 299
validity: deductive reasoning and 132; ecological Woo-suk, Hwang 35
75; population 75 World War 8, 113, 298
value-free ideal 300–301 World Health Organization (WHO) 153
value of a variable 49, 182–184; visual

representation of 184–187, 185–187 Zakariyya al-Razi, Bakr Muhammad ibn 19

Dokumen - Pub Recipes For Science An Introduction To Scientific Methods and Reasoning 1nbsped 1138920738 9781138920736

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dokumen - Pub Recipes For Science An Introduction To Scientific Methods and Reasoning 1nbsped 1138920738 9781138920736

Uploaded by

Copyright:

Available Formats

Copyright © 2018. Taylor & Francis Group. All rights reserved.

Today, scientiﬁc literacy is an essential aspect of any undergraduate education. Recipes

Public Engagement with Science at the University of Cincinnati, USA.

Cory Wright is Professor of Philosophy and Director of Graduate Studies at California

Visit the companion website: www.routledge.com/cw/potochnik

List of Figures and Tables ix

Introduction: Science and Your Everyday Life 1

1.1 The Importance of Science 7

2 Experiments and Studies 46

2.1 Experiment: Connecting Hypotheses to Observations 46

3 Models and Modeling 89

3.1 Models in Science 89

4 Patterns of Inference 125

4.1 Deductive Reasoning 125

5 Statistics and Probability 167

5.1 The Roles of Statistics and Probability 167

6 Statistical Inference 207

6.1 Generalizing From Descriptive Statistics 207

7 Causal Reasoning 242

7.1 What Is Causation? 242

8 Explaining, Theorizing, and Values 275

8.1 Understanding the World 275

3.1 View of the San Francisco Bay Model 90

7.5 Causal graph of the relationships between posting copyrighted material

1.1 Checklist for evaluating whether an idea or project qualifies as scientific 27

6.2 Imagined questionnaire scores of 100 university students 213

POLIO, HPV, AND OTHER ILLNESSES

WHY LEARN ABOUT SCIENCE?

RECIPES FOR SCIENCE

INTENDED AUDIENCES AND HOW TO USE THE BOOK

1.1 THE IMPORTANCE OF SCIENCE

After reading this section, you should be able to do the following:

• Describe how scientific research supports the finding of human-caused climate

A Serious Practical Concern

15031-1864-Fullbook.indb 7 6/24/2018 7:38:22 AM

15031-1864-Fullbook.indb 8 6/24/2018 7:38:22 AM

1935, which he determined with temperature measurements from 200 meteorological

Unfortunately, Callendar’s prescient recognition of the role of human activity on atmo-

15031-1864-Fullbook.indb 9 6/24/2018 7:38:22 AM

FIGURE 1.2 Keeling curve: ongoing increase in atmospheric concentrations of CO2

May 27th 2018 CO2 recording: 411.39

300 Ross Ice

FIGURE 1.3 Ice core data from Antarctica

15031-1864-Fullbook.indb 10 6/24/2018 7:38:24 AM

carbon dioxide level (parts per million)

FIGURE 1.4 Unprecedented increases in atmospheric CO2 in the past century

The Role of Science

15031-1864-Fullbook.indb 11 6/24/2018 7:38:24 AM

15031-1864-Fullbook.indb 12 6/24/2018 7:38:25 AM

What Science Is Good For

15031-1864-Fullbook.indb 13 6/24/2018 7:38:25 AM

15031-1864-Fullbook.indb 14 6/24/2018 7:38:25 AM

1.2 DEFINING SCIENCE

After reading this section, you should be able to do the following:

• Define pseudoscience and give examples

15031-1864-Fullbook.indb 15 6/24/2018 7:38:25 AM

The Tricky Work of Defining Science

15031-1864-Fullbook.indb 16 6/24/2018 7:38:25 AM

religiously inspired beliefs intended to compete as scientific alternatives to evolutionary

15031-1864-Fullbook.indb 17 6/24/2018 7:38:25 AM

Defining Science by Its History

FIGURE 1.5 Scientists in the Persian Golden Age

15031-1864-Fullbook.indb 18 6/24/2018 7:38:25 AM