You are on page 1of 302

Design and Analysis

in Educational Research
Using jamovi
Design and Analysis in Educational Research Using jamovi is an integrated approach
to learning about research design alongside statistical analysis concepts. Strunk and
­Mwavita maintain a focus on applied educational research throughout the text, with
practical tips and advice on how to do high-quality quantitative research.
Based on their successful SPSS version of the book, the authors focus on using jamovi
in this version due to its accessibility as open source software, and ease of use. The book
teaches research design (including epistemology, research ethics, forming research ques-
tions, quantitative design, sampling methodologies, and design assumptions) and intro-
ductory statistical concepts (including descriptive statistics, probability theory, sampling
distributions), basic statistical tests (like Z and t), and ANOVA designs, including more
advanced designs like the factorial ANOVA and mixed ANOVA.
This textbook is tailor-made for first-level doctoral courses in research design and
analysis. It will also be of interest to graduate students in education and educational
research. The book includes Support Material with downloadable data sets, and new case
study material from the authors for teaching on race, racism, and Black Lives Matter,
available at www.routledge.com/9780367723088.

Kamden K. Strunk is an Associate Professor of Educational Research at Auburn Univer-


sity, where he primarily teaches quantitative methods. His research focuses on intersec-
tions of racial, sexual, and gender identities, especially in higher education. He is also a
faculty affiliate of the Critical Studies Working Group at Auburn University.
Mwarumba Mwavita is an Associate Professor of Research, Evaluation, Measurement,
and Statistics at Oklahoma State University where he teaches quantitative methods. He
is also the founding Director of Center for Educational Research and Evaluation (CERE)
at Oklahoma State University.
“It is clear the authors have worked to write in a way that learners of all levels can understand
and benefit from the content. Notations are commonly recognized, clear, and easy to follow.
Figures and tables are appropriate and useful. I especially appreciate that the authors took
the time not only to address important topics and steps for conducting NHST and various
ANOVA designs but also to address social justice and equity issues in quantitative research
as well as epistemologies and how they connect to research methods. These are important
considerations and ones that are not included in many design/analysis textbooks.
This text seems to capture the elements often found in multiple, separate sources (e.g.,
epistemology, research design, analysis, use of statistical software, and considerations for
social justice/equity) and combines them in one text.
This is so helpful, useful, and needed!”
— Sara R. Gordon, Ph.D.,
Associate Professor
Center for Leadership and Learning
Arkansas Tech University, USA

“The ability to analyze data has never been more important given the volume of informa-
tion available today. A challenge is ensuring that individuals understand the connectedness
between research design and statistical analysis. Strunk and Mwavita introduce fundamental
elements of the research process and illustrate statistical analyses in the context of research
design. This provides readers with tangible examples of how these elements are related and
can affect the interpretation of results.
Many statistical analysis and research design textbooks provide depth but may not
situate scenarios in an applied context. Strunk and Mwavita provide illustrative examples
that are realistic and accessible to those seeking a strong foundation in good research
practices.”
— Forrest C. Lane, Ph.D.,
Associate Professor and Chair
Department of Educational Leadership
Sam Houston State University, USA

“Strunk and Mwavita provide a sound introductory text that is easily accessible to readers
learning applied analysis for the first time.
The chapters flow easily through traditional topics of null hypothesis testing and p-
values. The chapters include hand calculations that assist students in understanding
where the variance is and case studies at the end to develop writing skills related to each
analysis. In addition, software is integrated toward the end of the chapters after readers
have seen and learned to interpret the techniques by hand. Finally, the length of the book
is more manageable for readers as a first introduction to educational statistics.”
— James Schreiber, Ph.D., Professor
School of Nursing, Duquesne University, USA
Design and
Analysis in
Educational
Research Using
jamovi
ANOVA Designs

Kamden K. Strunk
and Mwarumba Mwavita
First published 2022
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN

and by Routledge
605 Third Avenue, New York, NY 10158

Routledge is an imprint of the Taylor & Francis Group, an informa business

© 2022 Kamden K. Strunk and Mwarumba Mwavita

The right of Kamden K. Strunk and Mwarumba Mwavita to be identified as authors


of this work has been asserted by them in accordance with sections 77 and 78 of the
Copyright, Designs and Patents Act 1988.

All rights reserved. No part of this book may be reprinted or reproduced or utilised in
any form or by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying and recording, or in any information storage or
retrieval system, without permission in writing from the publishers.

Trademark notice: Product or corporate names may be trademarks or registered


trademarks, and are used only for identification and explanation without intent to infringe.

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library

Library of Congress Cataloging-in-Publication Data


Names: Strunk, Kamden K., author. | Mwavita, Mwarumba, author.
Title: Design and analysis in educational research using Jamovi : ANOVA
designs / Kamden K. Strunk & Mwarumba Mwavita.
Identifiers: LCCN 2021002977 (print) | LCCN 2021002978 (ebook) |
ISBN 9780367723064 (hardback) | ISBN 9780367723088 (paperback) |
ISBN 9781003154297 (ebook) Subjects: LCSH: Education–Research–Methodology. |
Educational statistics. | Quantitative research.
Classification: LCC LB1028 .S8456 2021 (print) | LCC LB1028 (ebook) |
DDC 370.72–dc23 LC record available at https://lccn.loc.gov/2021002977
LC ebook record available at https://lccn.loc.gov/2021002978

ISBN: 978-0-367-72306-4 (hbk)


ISBN: 978-0-367-72308-8 (pbk)
ISBN: 978-1-003-15429-7 (ebk)

Typeset in Minion
by Straive, India
Access the Support Material: www.routledge.com/9780367723088
Contents

Acknowledgements vii

Part I: Basic issues 1

1 Basic issues in quantitative educational research 3

2 Sampling and basic issues in research design 21

3 Basic educational statistics 37

Part II: Null hypothesis significance testing 57

4 Introducing the null hypothesis significance test 59

5 Comparing a single sample to the population using the


one-sample Z-test and one-sample t-test 73

Part III: Between-subjects designs 81

6 Comparing two samples means: The independent samples


t-test 83

7 Independent samples t-test case studies 105

8 Comparing more than two sample means: The one-way ANOVA 113

9 One-way ANOVA case studies 145

v
vi • Contents

10 Comparing means across two independent variables:


The factorial ANOVA 155

11 Factorial ANOVA case studies 179

Part IV: Within-subjects designs 191

12 Comparing two within-subjects scores using the paired


samples t-test 193

13 Paired samples t-test case studies 205

14 Comparing more than two points from within the same


sample: The within-subjects ANOVA 213

15 Within-subjects ANOVA case studies 231

16 Mixed between- and within-subjects designs using the mixed


ANOVA 239

17 Mixed ANOVA case studies 255

Part V: Considering equity in quantitative research 265

18 Quantitative methods for social justice and equity:


Theoretical and practical considerations 267

Appendices 275

References 287

Index 291
Acknowledgments

We wish to thank Wilson Lester and Payton Hoover, doctoral students working with
Kamden Strunk, for their help in locating and sorting through potential manuscripts
for the case studies included in this book. We also thank Dr. William “Hank” Murrah
for his help in locating, verifying, and utilizing code for producing simulated data that
replicate the results of those published manuscripts for the case studies. Finally, thank
you to Payton Hoover and Hyun Sung Jang for their assistance in compiling the index.
We also wish to thank the many graduate students who provided feedback on various
pieces of this text as it came together. Specifically, Auburn University graduate students
who provided invaluable feedback on how to make this text more useful for students
included: Jasmine Betties, Jessica Broussard, Katharine Brown, Wendy Cash, Haven
Cashwell, Jacoba Durrell, Jennifer Gibson, Sherrie Gilbert, Jonathan Hallford, Elizabeth
Haynes, Jennifer Hillis, Ann Johnson, Minsu Kim, Ami Landers, Rae Leach, Jessica Mil-
ton, Alexcia Moore, Allison Moran, Jamilah Page, Kurt Reesman, Tajuan Sellers, Daniel
Sullen, Anne Timyan, Micah Toles, LaToya Webb, and Tessie Williams. For their help
and support throughout the process, we thank Hannah Shakespeare and Matt Bickerton
of Routledge/Taylor & Francis Group.
Kamden Strunk wishes to thank his husband, Cyrus Bronock, for support throughout
the writing process, for listening to drafts of chapters, and for providing much needed
distractions from the work. He also wishes to thank Auburn University and, in par-
ticular, the Department of Educational Foundations, Leadership, and Technology for
supporting the writing process and for providing travel funding that facilitated the col-
laboration that resulted in this book.
Mwarumba Mwavita wishes to thank his wife, Njoki Mwarumba, for the encourage-
ment and support throughout the writing process, reminding me that I can do it. In
addition, to my sons, Tuzo Mwarumba and Tuli Mwarumba for cheering me along the
way. This book is dedicated to you, family.

vii
Part I
Basic issues

1
1
Basic issues in quantitative
educational research

Research problems and questions 3


Finding and defining a research problem 4
Defining and narrowing research questions 5
Reviewing the literature relevant to a research question 6
Finding published research 6
Reading published research and finding gaps 7
Types of research methods 8
Epistemologies, theoretical perspectives, and research methods 10
Epistemology and the nature of knowledge 10
Connecting epistemologies to perspectives and methods 13
Overview of ethical issues in human research 14
Historical considerations 14
The Belmont Report 15
The common federal rule 16
Conclusion 20

The purpose of this text is to introduce quantitative educational research and explore
methods for making comparisons between and within groups. In this first chapter, we
provide an overview of basic issues in quantitative educational research. While we can-
not provide comprehensive coverage of any of these topics, we briefly touch on several
issues related to the practice of educational research. We begin by discussing research
problems and research questions.

RESEARCH PROBLEMS AND QUESTIONS


Most educational research begins with the identification of a research problem. A
research problem is usually a broad issue or challenge. For example, a research problem
might be the racial achievement gap in K-12 schools, low retention rates in science, tech-
nology, engineering, and mathematics (STEM) majors, or lower educational attainment
of LGBTQ students. The research problem is broad in that it might lead to many different
questions, have many different facets, and give rise to many different interpretations. In
research careers, it is often the case that a researcher spends years or even decades pur-
suing a single research problem via many different questions and studies.

3
4 • BASIC ISSUES

Finding and defining a research problem


Often, people arrive at a research problem based on their observations or experiences
with educational systems. There may be some issue that sticks with an individual that
they ultimately pursue as a research problem. They may arrive at that problem because of
noticing a certain pattern that keeps repeating in classrooms, or by experiencing some-
thing that does/does not work in education, or by reading about current issues in educa-
tion. They may feel personally invested in a particular problem or simply be bothered by
the persistence of a particular issue. Sometimes, students might find a research problem
through working with their advisor or major professor. They might also find a problem
by working with one or more faculty on research projects and finding some component
of that work fascinating. However, individuals end up being drawn toward a particular
topic or issue, making that into a research problem and creating related research ques-
tions requires some refinement and definition. We will next explore the major features
of a good research question.

Broad
First, research problems should be broad and have room for multiple research questions
and approaches. For example, racialized achievement gaps are a broad research prob-
lem from which multiple questions and projects might emerge. On the other hand, the
question of whether summer reading programs can reduce racialized gaps in reading
achievement is quite narrow—and, in fact, is a research question. The problem of how
to increase the number of women who earn advanced degrees in STEM fields is a broad
research problem. On the other hand, asking if holding science nights at a local elemen-
tary school might increase girls’ interest in science is quite narrow and is also likely a
research question. While we will work to narrow down to a specific research question,
it is usually best to begin by identifying the broad research problem. Not only does that
help position the specific study as part of a line of inquiry, but it also helps contextualize
the individual question or study within the broader field of literature and prior research.

Meaningful
Research problems should be meaningful. In other words, “solving” the research prob-
lem should result in some real impact or change. While research problems are often
too big to be “solved” in any one study (or perhaps even one lifetime), they should be
meaningful for practical purposes. Closing racialized achievement gaps is a meaningful
problem because those gaps are associated with many negative outcomes for people of
color. Increasing the number of women with advanced degrees in STEM is meaningful
because of its impact on gender equity and transforming cultures of STEM fields.

Theoretically driven
Research problems should also be theoretically driven. These problems do not exist in
a vacuum. As we will emphasize throughout the text, one important part of being a
good researcher is to be aware of, and in conversation with, other researchers and other
published research. One way in which this happens is that other researchers will have
produced and proliferated theoretical models that aim to explain what is driving the
problems they study. Those theories can then inform the refinement and specification
of the problem. Our selection of research problems must be informed by this theoretical
QUANTITATIVE EDUCATIONAL RESEARCH • 5

landscape, and the most generative research problems researchers select tend to be those
that address gaps in existing theory and knowledge.

Defining and narrowing research questions


Having defined a broad, meaningful, and theoretically driven research problem, the next
step is to identify a specific research question. Research questions narrow the scope of a
research problem to that which can be answered in a single study or series of studies. Good
research questions will be answerable, specific, operationally defined, and meaningful.

Answerable
Good research questions are answerable within the scope of a single study. Part of this
has to do with the narrowness of the question. Questions that are too broad (have not
been sufficiently narrowed down from the bigger research problem) will be impossible to
answer within a study. Moreover, an answerable question will be one that existing research
methods are capable of addressing. Taking one of our earlier examples of a research prob-
lem, the persistence of racialized achievement gaps in schools, we will give examples of
research questions that are answerable and those that aren’t. An example: What is the best
school-based program for reducing racialized gaps in fifth-grade reading achievement?
This is not an answerable question because no research design can determine the “best”
school-based program. Instead, we might ask whether a summer reading program or an
afterschool reading program were associated with lower racialized reading achievement
gaps in the fifth grade. This is a better question because we can compare these two pro-
grams and determine which of those two was associated with better outcomes.

Specific and operationally defined


Part of the difference between these two questions is the level of specificity. Narrowing down
to testing specific programs creates a better question than an overly broad “what is best” kind
of question. Both examples are specific in another way: they specify the kind of achievement
gap (reading) and the point in time at which we intend to measure that gap (fifth grade). It is
important to be as specific as possible in research questions to make them more answerable
and to define the scope of a study. Researchers should specify the outcome being studied
in terms of specific content, specific timeframe, and specific population. They should also
specify the nature of the program, intervention, or input variable they intend to study—in
our case specifying the kinds of programs that might close reading achievement gaps.
Importantly, each of those elements of the research question will need to be operation-
ally defined. How do we plan to define and measure reading achievement? How will we
define and measure race? What is our definition of a racialized achievement gap? What are
the specific mechanisms involved in the after-school and the summer reading programs?
Each of these elements must be carefully defined. That operational definition (operational
because it is not necessarily a universal definition, but is how we have defined the idea or
term for this particular “operation” or study) shapes the design of the study and makes it
more possible for others to understand and potentially reproduce our study and its results.

Meaningful
Finally, as with broad research problems, research questions should be meaningful. In
what ways would it be helpful, lead to change, or make some kind of difference if the
6 • BASIC ISSUES

answer to the research question was known? In our example, is it meaningful to know
if one of the two programs might be associated with reductions in racial achievement
gaps in reading at the fifth-grade level? In this case, we might point to this as an impor-
tant moment to minimize reading gaps as students progress from elementary to middle
school environments. We might also point to the important societal ramifications of
persistent racialized achievement gaps and their contribution to systemic racism. There
are several claims we might make about how this question is meaningful. If we have a
meaningful research problem, it is likely that most questions arising from that problem
will also be meaningful. However, researchers should evaluate the meaningfulness of
their questions before beginning a study to ensure that participants’ time is well spent,
and the resources involved in conducting a study are invested wisely.

REVIEWING THE LITERATURE RELEVANT TO A RESEARCH QUESTION


As we explained earlier, one important way to decide on research problems and ques-
tions is through reading and understanding what has already been published. Knowing
what has already been published allows us to situate our work in the literature, and also
ensures our work is contributing to ongoing scholarly conversations, not simply repeat-
ing what has already been written. In order to do so, researchers must first know the pub-
lished literature and be able to review and synthesize prior work. That means reading—a
lot. However, first, how can we find existing literature?

Finding published research


If you are a university student, it is likely your university library has many resources
for finding published research. Many universities employ subject-area librarians who
can help you get acquainted with your university’s databases and search programs. Your
faculty may also be able to guide in this area. There are many different databases and
systems for accessing them across universities. Some common systems include Pro-
Quest and EBSCOhost, but there are many others. Typically, these systems connect with
specific databases of published work. These might include databases such as: Academic
Search Premier, which has a wide range of journals from various topic areas; ERIC,
which has a lot of educational research, but includes many non-peer-reviewed sources
as well; PsycINFO and PsycARTICLES, which have psychological research and many
educational research journals; SportDISCUS, which has many articles related to physical
activity and kinesiology. There are several other databases your library might have access
to that might prove relevant depending on the topic of your research. An important
feature common to most of these databases is the ability to narrow the search to only
peer-reviewed journal articles. In most cases, peer-reviewed journal articles will be what
you mostly want to read.
Peer-reviewed journal articles are articles published in a journal (typically very disci-
pline and subject-area specific publications) that have undergone peer review. Peer review
is a process wherein a manuscript is sent to multiple acknowledged experts in the area for
review. Those reviewers, typically between two and four of them, provide written com-
ments on a manuscript that the author(s) have to address before the manuscript can be
published as an article. In other cases, the reviewers might recommend rejecting the manu-
script altogether (in fact, most journals reject upwards of 80% of manuscripts they receive).
QUANTITATIVE EDUCATIONAL RESEARCH • 7

Often, manuscripts go through multiple rounds of peer review before publication. This is
a sort of quality control for published research. When something is published in a peer-re-
viewed journal, that means it has undergone extensive review by experts in the field who
have determined it is fit to be published. Ideally, this helps weed out questionable studies
or papers using inadequate methods. It is not a perfect system, and some papers are still
published that have major problems with their design or analysis. However, it is an impor-
tant quality check that helps us feel more confident in what we are reading and, later, citing.
Other ways of finding published research exist too. Many public libraries have access
to at least some of the same databases as universities, for example. There are also sources
like Google Scholar (as opposed to the more generic Google search) that pull from a
wide variety of published sources. Google Scholar includes mostly peer-reviewed prod-
ucts, though it will also find results from patent filings, conference proceedings, books,
and reports. While journal databases, like those listed above, will require journals to
meet certain criteria to be included (or “indexed”), Google Scholar has no such inclusion
criteria. So, when using that system, it is worth double-checking the quality of journals in
which articles are published. Another way to keep up to date with journals in your spe-
cific area of emphasis is to join professional associations (like the American Educational
Research Association [AERA], the American Educational Studies Association [AESA],
and the American Psychological Association [APA]) relevant to your discipline. Many of
them include journal subscriptions with membership dues and reading the most recent
issues can help you keep updated.

Reading published research and finding gaps


As you read more and more recently published research, you will begin to notice what’s
missing. Sometimes it will be a specific idea, variable, or concept that starts to feel stuck
in your head but isn’t present in the published research. Other times, you will have some
idea you hope to find what the research has to say about but will be unable to find much
on the topic. Of course, sometimes what we think is missing in the literature is really just
stated in another way. So, it is important to search for synonyms, different phrasings, and
alternative names for things. However, over time, as you read, you will start to notice the
areas that have not yet been addressed in the published research. Those gaps could be a
population that has not been adequately studied, a component of theoretical models that
seems to be missing, a variable researcher have not adequately included in prior work,
or a reframing of the existing questions. Those gaps are often a great way to identify
research problems and questions for future research. Finding gaps is not the only way to
identify necessary new research, but it is a common method for doing so.
So, how much should new researchers read? The answer is, simply, a lot. Specific guide-
lines on how much to read in the published research vary by discipline and career goal.
For our advisees who are in Ph.D. programs with research-oriented career goals, we rec-
ommend a pace of three to five journal articles per week during the Ph.D. program. That
pace helps students read enough before beginning the dissertation to be able to construct a
comprehensive literature review. However, again, the specific rate will vary by person, field,
and career goals. The important part is to continue reading. Note, too, that this reading
is likely to be in addition to required reading for things like coursework, as this reading is
helping you develop specific and advanced expertise in the area of your research problem.
One key point in reading the published research is that authors write from a variety
of different perspectives. Some are engaging different theoretical models that seek to
8 • BASIC ISSUES

explain the same problem using different tools. Other times, the differences reflect qual-
itative, quantitative, and mixed-method differences in how research is conceptualized
and presented. The differences also might relate to theories of knowledge or epistemol-
ogies. Those epistemologies shape vastly different ways of writing and different ways
of presenting data and findings or results. In the next section, we briefly describe the
methodological approaches that are common in educational research: qualitative and
quantitative methods. We then turn to questions of epistemology.

TYPES OF RESEARCH METHODS


There are two main approaches to educational research: qualitative and quantitative.
Importantly, the distinctions between these are not as cut and dried as they might appear
in our description of them. While we provide some ways of distinguishing between
these kinds of research, they do not exist in opposition to one another. Many researchers
make use of elements of qualitative and quantitative methods (multimethod research)
and others blend qualitative and quantitative approaches and analyses (mixed-method
research). In practice, the lines between methods can become blurry, but the purpose of
this section is to provide some basic sense that there are different kinds of approaches that
answer different kinds of questions with different sorts of data. In general, quantitative
research deals with numbers and things that can be quantified (turned into numbers).
This textbook focuses on quantitative research. Qualitative research deals with things
that are not numbers or that cannot be quantified (like textual or interview data), though
some qualitative research also includes numbers, especially frequencies or counts. These
two kinds of research also ask different kinds of questions. We will briefly explain both
using the questions: What kinds of questions can be asked? What kinds of data can be
analyzed? How are the data analyzed? What kinds of inferences are possible?

• What kinds of questions can be asked? In quantitative research, questions typ-


ically center around group differences, changes in scores over time, or the rela-
tionship among variables. Usually, these questions are focused on explaining
or predicting some kind of quantifiable outcome. How are test scores different
between groups getting treatment A versus treatment B? How does attention
change across three kinds of tasks? What is the relationship between attention and
test score? These questions are all quantitative sorts of questions, and all involve
specifying a hypothesis beforehand and testing if that hypothesis was correct.
Qualitative research answers very different kinds of questions. They usually do not
involve pre-formulated hypotheses that are subjected to some kind of verification
test. Instead, qualitative research usually seeks deep description and understand-
ing of some idea, concept, discourse, phenomenon, or situation. How do students
think about the purpose of testing? How do teachers think about attention in plan-
ning lessons? Qualitative work will normally not test group differences or evaluate
the association between variables but will instead seek to provide a deeper under-
standing of a specific moment, situation, concept, person, or idea.
• What kinds of data can be analyzed? In quantitative research, the data must
be numeric. These data might be scores from survey items or scales, test scores,
counts of observable phenomena, demographic information, group member-
ship, self-report scores, and other types of numeric information. Data that are not
inherently numeric must be converted to numeric data through some kind of cod-
ing, measurement, or labelling process. In qualitative research, data are normally
QUANTITATIVE EDUCATIONAL RESEARCH • 9

non-numeric. They might include interviews, focus groups, documents, observa-


tions (including participant observation), or texts. In most cases, none of the data
under analysis will be numeric, though there are times that some qualitative stud-
ies include some information (usually to describe participants) that is numeric or
categorical.
• How are the data analyzed? In quantitative research, the analysis is almost always
done via one or more statistical tests. In most cases, the researcher will specify a
hypothesis ahead of time and test whether the data support that hypothesis using
statistical analysis. One study might involve multiple hypotheses and tests, as well.
Most of this textbook is devoted to a set of those statistical analyses. Qualitative
data analysis can take multiple forms. The data are not numeric, so the kind of
statistical testing mentioned above simply does not fit with this kind of research.
In one way of approaching qualitative data analysis, sometimes called deductive
analysis, researchers approach the data through an existing theory and look for
how that theory might make sense of the data. Sometimes this is done through a
process called coding, where specific kinds of information are marked, or coded, in
the data to look for commonalities. In another kind of analysis, sometimes called
inductive analysis, the researchers work through the data to see what ideas come
up in multiple data sources (e.g., multiple interviews, several documents) that
can be used to understand common threads in the data. In most cases, qualitative
manuscripts will present themes from the data, whether derived from a theoretical
model or inductive analysis, that helps summarize the data and make sense of pat-
terns that emerged. Qualitative analysis is also usually iterative, meaning research-
ers work through the data multiple times in an attempt to identify key themes or
codes.
• Finally, what kinds of inferences are possible? In quantitative work, researchers
often attempt to make claims about causation (i.e., cause–effect relationships) and
generalization (i.e., that those relationships would be present in people outside
the study as well). Both of these are subject to limitations and align with particu-
lar views about what is possible in research (more on that later in this chapter).
However, those are relatively common claims in quantitative research. Quantita-
tive researchers also make claims about differences between groups and associa-
tions between variables. In qualitative research, there is usually no attempt to claim
causation or generalization. The inferences are more focused on how a theoretical
model helps make sense of patterns in data, or how those data might offer a deeper
understanding of some idea, concept, or situation. It can initially seem like qual-
itative work makes narrower inferences than quantitative work. In actuality, they
make different kinds of inferential claims that are meant to serve different aims.

In the next section, we discuss the formation of research questions. It is very impor-
tant to realize that some kinds of research questions are well-matched with quantita-
tive methods, and others are well-matched with qualitative methods. We cannot stress
enough that neither kind of research is “better” than the other—they simply answer
different kinds of questions, and both are valuable. We also strongly recommend that
anyone planning to do mostly quantitative research should still learn at least the basics
of qualitative research (and vice versa). There are several well-written introductory texts
on qualitative research that might help build a foundation for understanding qualita-
tive work (Bhattacharya, 2017; Creswell & Poth, 2017; Denzin & Lincoln, 2012), and
students should also consider adding one or more qualitative methods courses to gain
10 • BASIC ISSUES

a deeper understanding. If you find yourself asking research questions that are not well
matched with quantitative methods, but might be matched with qualitative methods, do
not change your question. Methods should be selected to match questions, and if your
questions are not quantitative kinds of questions, then qualitative methods will provide
a more satisfying answer. The remainder of this text focuses on quantitative methods,
however. Next, we turn to the question of epistemologies and their potential alignments
with research methods.

EPISTEMOLOGIES, THEORETICAL PERSPECTIVES,


AND RESEARCH METHODS
It might occur to you while reading the previous sections that there are many differ-
ent research approaches and strategies that people engage. Those approaches are often
related to underlying beliefs about knowledge, the production of knowledge, truth, and
what constitutes data. In this section, we briefly overview the issues of epistemologies,
theoretical perspectives, and their connection to research methods. There are resources
to learn more about these epistemological perspectives available, and Crotty (1998) is an
excellent resource for going deeper with these ideas. We also suggest Lather (2006) and
Guba and Lincoln (1994) as additional resources for learning more about epistemologies
and their relationship to research methods and knowledge production.

Epistemology and the nature of knowledge


One way in which various research approaches differ is in their epistemology. While
many quantitative methods courses and books avoid this topic entirely, knowing about
the major epistemological perspectives can help clarify how research approaches differ.
Epistemology refers to an individual’s beliefs about truth and knowledge. For our pur-
poses, we focus on some key questions: What can we know? How do we generate and
validate knowledge? What is the purpose of research? We will briefly overview several
major perspectives. We do want to be clear that our brief treatment in this section cannot
adequately capture the nuance, diversity, or depth of any of these perspectives, but we
intend to highlight the basics of each.

Positivism
In positivism, there is an absolute truth that is unchanging, universal, and without
exception. Not only does such an absolute truth exist, but it is possible for us to know
that truth with some degree of certainty. The only limitation in our ability to know the
absolute truth is the tools we use to collect and analyze data. This perspective holds there
is absolute truth to everything—physics, sure, but also human behavior, social relations,
and cultural phenomena. As a point of clarification, sometimes students hear about pos-
itivism and associate it with religious or spiritual notions of universal moral law or uni-
versal spiritual truths. While there are some philosophical connections, positivism is not
about religious or spiritual beliefs, but about the “truth” of how the natural and social
worlds work.
QUANTITATIVE EDUCATIONAL RESEARCH • 11

To turn to our guiding questions: What can we know? In positivism, we could know
just about anything and with a high degree of certainty. The only barrier is the adequacy
of our data collection and analysis tools. How do we generate and validate knowledge?
Through empirical observations, verifiable and falsifiable hypotheses, and replication.
While much work in positivistic frames is quantitative, some qualitative approaches and
scholars subscribe to this epistemology as well. So, the tools don’t necessarily have to be
quantitative, but a positivist would be concerned with the error in their tools and analy-
sis, with sampling adequacy, and other issues that might limit claims to absolute truth. In
verifying knowledge, there is an emphasis on reproducibility and replication, subjecting
hypotheses to multiple tests to determine their limits, and testing the generalizability of
a claim. Finally, what is the purpose of research? In positivistic work, the purposes of
research are explanation, prediction, and control.

Post-positivism
In post-positivism, many of the same beliefs and ideas from positivism are present. The
main difference is in the degree of confidence or certainty. While post-positivists believe
in the existence of absolute truth, they are less certain about our ability to know it. It
might never be fully possible, in this perspective, to fully account for all the variation,
nuance, detail, and interdependence in things like human social interaction. While this
perspective suggests there is an absolute truth underlying all human interaction, it might
be so complex that we will never fully know that truth.
What can we know? In theory, anything. However, in reality, our knowledge is very
limited by our perspectives, our tools, our available models and theories, and the finite
nature of human thought. How do we generate and validate knowledge? In all of the same
ways as in positivism. One difference is that in the validation of knowledge, post-posi-
tivistic work tends to emphasize (perhaps even obsess over) error, the reduction of and
accounting for error, and improving statistical models to handle error better. That trend
makes sense because, in this perspective, error is the difference between our knowledge
claims and absolute truth. Finally, what is the purpose of research? As with positivism, it
is explanation, prediction, and control.

Interpretivism
Interpretivism marks a departure from post-positivism in a more dramatic way. In inter-
pretivism, there is no belief in an absolute truth. Instead, truths (plural) are situated in
particular moments, relationships, contexts, and environments. Although this perspec-
tive posits multiple truths, it is worth noting that it does not hold all truth claims as
equal. There is still an emphasis on evidence, but without the idea of a universal or abso-
lute truth. There are a variety of interpretivist perspectives (like constructivism, symbolic
interactionism, etc.), but they all hold that in understanding social dynamics, truths are
situated in dynamic social and personal interactions.
What can we know? We can know conditional and relational truths. Though there is
no absolute universal truth underlying those truth claims, there is no reason to doubt
these conditional or relational truths in interpretivism. How do we generate and validate
knowledge? We can generate knowledge by examining and understanding social rela-
tions, subjectivities, and positional knowledges. In validating knowledge, interpretivist
researchers might emphasize factors like authenticity, trustworthiness, and resonance.
12 • BASIC ISSUES

Finally, what is the purpose of research in interpretivism? To understand and provide


thick, rich description.

Critical approaches
Critical approaches are diverse and varied, so this term creates a broad umbrella. How-
ever, in general, these approaches hold that reality and truth are subjective (as does
interpretivism) and that prevailing notions of reality and truth are constructed on the
basis of power. Critical approaches tend to emphasize the importance of power, and that
knowledge (and knowledge generation and validation systems) often serve to reinforce
existing power relations. A range of approaches might fall into this umbrella, such as
critical theory, feminist research, queer studies, critical race theory, and (dis)ability stud-
ies. Importantly, each of those perspectives also has substantial variability, with some
work in those perspectives falling more into deconstructivism. Because in reality, there
is wide variability in how people go about doing research, the lines between these rough
categories are often blurred.
What can we know? We can know what realities have been constructed, and we can
critically examine how they were constructed and what interests they serve. How do we
generate and validate knowledge? Through tracing the ways that power and domination
have shaped social realities. There is often an emphasis on locating and interrogating
contradictions or ruptures in social realities that might provide insight into their role in
power relations. There is also often an emphasis on advocacy, activism, and interrupting
oppressive dynamics. What is the purpose of research? To create change in social reali-
ties and interrupt the dynamics of power and oppression.

Deconstructivism
Deconstructivism is another large umbrella term with a lot of diverse perspectives under
it. These might be variously referred to as postmodernism, poststructuralism, decon-
structivism, and many other perspectives. These perspectives generally hold that reality
is unknowable, and that claims to such knowledge are self-destructive. Although truths
might exist (or at least, truth claims exist), they are social constructions that consist of
signs (not material realities) and are self-contradictory. Work in this perspective might
question notions of reality and knowledge or might critique (or deconstruct) the ways
that knowledges and truth claims have been assembled. There is some overlap with criti-
cal perspectives in that many deconstructivist perspectives also hold that the assemblages
of signs and symbols that construe a social reality are shaped by power and domination.
What can we know? We cannot know in this perspective because there is a ques-
tioning of the existence of truth. We can, however, interrogate and deconstruct truth
claims, their origins, and their investment with power. How do we generate and validate
knowledge? In deconstructivist perspectives, researchers often critique or deconstruct
existing knowledge claims rather than generating knowledge claims. This is because of
the view that truth/knowledge claims are inherently contradictory and self-defeating.
What is the purpose of research? To critique the world, knowledge, and knowability. One
of the purposes of deconstructivist research is to challenge those notions, pushing others
to rethink the systems of knowledge that they have accumulated.
QUANTITATIVE EDUCATIONAL RESEARCH • 13

Connecting epistemologies to perspectives and methods


In briefly reviewing major epistemological frames, we want to emphasize that episte-
mologies often do not fit neatly into these categories, nor are there only four kinds of
epistemologies. These paradigms are quite expansive, and many researchers identify
somewhere between these categories or with parts of more than one. In other words,
the neatness with which we present these frames in this text is deceiving in that the
reality of research and researchers is much messier, richer, and more diverse. One dis-
tinction that is common between qualitative and quantitative work is the openness with
which researchers discuss their epistemological positions. Many qualitative researchers
describe in some depth their epistemological and ideological positions in their published
work. By contrast, the inclusion of that discussion is quite rare in published quantita-
tive work. However, the ideological and epistemological stakes very much matter to the
kinds of research a researcher does and the kinds of questions they ask.
One way that this happens is in the selection and mobilization of a theoretical per-
spective. As we described earlier in this chapter, good research questions are theoretically
driven. Those theories have ideological and epistemological stakes. In other words, the
selection of a theory or theoretical model for research is not a neutral or detached deci-
sion. Theories and their use emerge from particular epistemological stances and attempts
to engage theories apart from their epistemological foundations are often frustrated. A
key issue for this text, which focuses on quantitative analysis, is that most quantitative
methods come from positivist and post-positivist epistemologies. One reason quantita-
tive manuscripts often do not discuss epistemology is that there is a strong assumption
of post-positivism in quantitative work. In fact, as we will discover in later chapters, the
statistical models we have available are embedded with assumptions of positivism. That
is not to say that all quantitative work must proceed from a post-positivist epistemology.
However, being mindful of the foundations of quantitative methods in post-positivism,
researchers who wish to engage these methods from other epistemological foundations
will need to work with and in the tension that creates.
There is often some natural alignment between epistemology, theoretical perspective,
and research method. Each method was created in response to a specific set of theoret-
ical and epistemological beliefs. As a result, some methods more easily fit with certain
theoretical perspectives which more easily fit with a particular epistemology. We have
hinted at the fact that quantitative methods were designed for post-positivist work and
thus fit more easily with that epistemology. There is also an array of theoretical per-
spectives that emerge from post-positivist work that are thus more easily integrated in
quantitative work. But to reiterate it is possible to do interpretivist or critical work using
quantitative methods. In future chapters, we will highlight some case studies that do so.
Any such work requires careful reflection and thought, especially about the assump-
tions of quantitative work, and must be done carefully. Regardless of your position, we
strongly urge students and researchers to consider their own epistemological beliefs and
how they influence and shape the directions of their research.
14 • BASIC ISSUES

OVERVIEW OF ETHICAL ISSUES IN HUMAN RESEARCH


In this final section, we overview the landscape in the United States for research ethics.
Many of these principles are common to other contexts, but the language, specific reg-
ulations, and processes will vary. If you are in a context other than the United States, be
sure to consult your ethical regulations. Ethics comprises a broad field of philosophy,
but this section is much more narrowly defined. Research ethics with human research
specifically refers to the norms, traditions, and legal requirements of doing research with
human participants. In most locations, these regulations are referred to as Human Sub-
jects Research (HSR) regulations. As a note for writing about research, the convention is
to refer to humans as research participants (not subjects). Humans willingly participate
in research; they are not subjected to research. Animals are often referred to as subjects,
though, and the regulations date from a time where “subjects” was the common term for
humans as well.

Historical considerations
Entire books have been written on the historical context for modern research ethics
regulations. Here, we briefly describe a few key events that led to the system of regula-
tion currently in place in the United States. Of course, other nations have a history that
overlaps with and diverges from that of the United States, but many of the same events
shaped thinking about ethics regulations in many places. One moment often identified
as a key historical marker in research ethics is the Nuremberg Trials that followed World
War II. While these trials are best known as the trials in which Nazi leaders were con-
victed of war crimes, the tribunal also took up the question of research. In Nazi Germany
and occupied territories, doctors and researchers employed by the government carried
out gruesome and inhumane experiments on unwilling subjects, many of whom were
also in marginalized groups (such as Jewish people, LGBTQ people, and Romani peo-
ple). What emerged from the tribunals was a general condemnation of such work but not
much in the way of specific research regulations.
In the United States, the key moment in driving the current systems of regulation
was the U.S. Public Health Service (PHS) Tuskegee Syphilis Study. In the current U.S.
government, the PHS includes agencies like the National Institutes of Health (NIH)
and Centers for Disease Control and Prevention (CDC), plus multiple other parts of
the Department of Health and Human Services. Beginning in 1932, the PHS began a
study of Black men in Tuskegee, Alabama, who were infected with syphilis (Centers for
Disease Control and Prevention, n.d.). At the time, there was no known cure for syph-
ilis and few protective measures. The PHS set out to observe the course of the disease
through death in these men. An important note is that all men in the study were infected
with syphilis before being enrolled in the study (the PHS did not actively infect men in
Tuskegee with syphilis, though the PHS did actively infect men in Nicaragua for decades
in studies that only recently became known to the public). Tuskegee was selected as a site
for the study because it was very remote, very poor, and, in segregated Alabama, entirely
Black. PHS officials believed the site was isolated enough both physically and socially to
allow the study to go on without being discovered or interrupted. Shortly after the study
began, penicillin became available as a treatment, and it was extremely effective in treat-
ing syphilis. By 1943, it was widely available. However, the men enrolled in the Tuskegee
QUANTITATIVE EDUCATIONAL RESEARCH • 15

study were neither informed of the existence of penicillin nor treated with the antibiotic.
The PHS Tuskegee Syphilis Study continued for 40 years, finally ending in 1972 after
a whistleblower brought the study to light. The study had long-term ramifications for
medical mistrust among Black populations in the United States, especially in the South
(Hagen, 2005). Those continued effects of the study are associated with lower treatment
seeking and treatment adherence among Black patients in Alabama, for example (Ken-
nedy, Mathis, & Woods, 2007). In 1979, the Belmont Report was issued, leading directly
to the current system of ethical regulations in place in the United States, and we will see
clearly how that study is directly tied to the elements of the Belmont Report.

The Belmont Report


The Belmont Report was issued by the National Commission for the Protection of
Human Subjects of Biomedical and Behavioral Research in 1979 (U.S. Department of
Health and Human Services, n.d.). The commission was created by the U.S. Congress in
1974 in large part as a response to the PHS Tuskegee Syphilis Study. It outlined the broad
principles for conducting ethical research with human subjects. Its principles formed the
core of the research regulations in the United States and included: Respect for Persons,
Beneficence, and Justice.

Respect for persons


The first principle of the Belmont Report is respect for persons. Key to this principle is
that human beings are capable of and entitled to autonomy. That is, they are free to make
their own decisions about what to do, what happens to them, and how information about
them is used. As part of that recognition of autonomy, research must involve informed
consent. We will come back to the components of informed consent later in this chapter,
but broadly speaking it means that people should freely decide whether or not to partic-
ipate in research and that in order for that to happen people need adequate information
on which to decide their participation. The connection to Tuskegee is clear—those par-
ticipants consented but did so without adequate information. In fact, vital information
was withheld from Tuskegee participants. Respect for persons also requires that partici-
pants be free to withdraw from a study at any time and that their rights (including legal
rights) are respected at all times.

Beneficence
The principle of beneficence requires the minimization of potential harms and the max-
imization of potential benefits to participants. Put simply—this principle suggests that
participants ought to exit a study at least as well off as they entered it. Researchers should
not engage in activities, conduct, or methods that harm participants. Note that benef-
icence is about harm and benefits to participants, not broader society. This principle
connects to the Tuskegee study in that those researchers judged the harm to participants
as justified by the potential benefit to society. The Belmont Report makes clear that rea-
soning is not appropriate, and the welfare of individual participants must be the key
consideration. This principle requires researchers to think about how to reduce risks
to participants and maximize benefits. We will discuss both risks and benefits in more
detail later in this chapter.
16 • BASIC ISSUES

Justice
Finally, the principle of justice has to do with who should bear the burdens of participat-
ing in research in relation to who stands to benefit from research. In the Tuskegee study,
Tuskegee was not selected because the population there stood to benefit more than other
areas from any potential findings. Instead, researchers selected Tuskegee as a site for
the study because it was remote and largely isolated, and because its residents were low
income and Black. That meant that it was unlikely that participants would seek or receive
any outside medical treatment, and it also meant the researchers could operate with little
or no scrutiny. This is an unjust rationale for selecting participants. Doing research with
marginalized or vulnerable communities should be limited to those cases where those
communities will benefit from the results of research. In a related issue, it also means
that research with captive groups like prisoners should only be done if the research is
about their captivity. There is another side to this question, too, because there is a history
of some fields of research having almost exclusively White, or wealthy, or men partici-
pants. This is particularly problematic in fields like medical research, where treatments
might affect different groups of people in different and sometimes contradictory ways.
However, federal guidance, relying on the principle of justice, requires the adequate rep-
resentation of women and people of Color in research.

The common federal rule


These three broad principles are explained in the Belmont Report but did not have the
force of law. Following the completion of the Belmont Report, federal agencies wrote
regulations to enforce the three major principles. These became encoded in the Com-
mon Federal Rule. Called the common rule because it is a set of regulations common
to all federal agencies, 45 CFR part 46 is the federal regulation governing all federally
funded human subjects research except for medical trials. Because of the differences
between medical and social/behavioral research, there is a different common rule for
medical research (21 CFR part 56). Research other than medical research is overseen by
the Department of Health and Human Services’ (DHHS) Office of Human Research Pro-
tections (OHRP). Medical research is overseen by the Food and Drug Administration
(FDA). The two sets of regulation share much in common, but the FDA rule has more
specific guidance for clinical trials, medical devices, and drugs.
Although the common rule technically applies to federally funded research, in prac-
tice the rule applies to virtually all research conducted in the United States or by U.S.
researchers. Most institutions, research centers, and universities in the United States
have agreements known as federal-wide assurances in which they agree to subject all
research to the same scrutiny, regardless of funding source. So, in practice, it is usually
safe to assume that all research done in the United States or by researchers based in the
United States will be subject to the Common Federal Rule. Below, we briefly outline the
basic components of these regulations as they apply to social and behavioral research.

Informed consent
Human research participants must provide informed consent to participate in research.
This means both that participants must consent to their participation in research, and
QUANTITATIVE EDUCATIONAL RESEARCH • 17

they must do so with adequate information to decide on their participation. This relates
to the principle of respect for persons. In general, participants should be informed
about the purposes of research, the procedures used in the study, any risks they might
encounter, benefits they will receive, the compensation they will receive, information on
who is conducting the study, and contact information in case of questions or problems.
Informed consent documents cannot contain any language that suggests participants are
waiving any rights or liability claims—participants always retain all of their human and
legal rights regardless of the study design. In most cases, consent is documented through
the use of an informed consent form, typically signed by both the researcher and the
participant. However, signing the form is not sufficient for informed consent. Informed
consent involves, ideally, dialogue in which the researcher explains the information and
the participant is free to ask questions or seek clarification, after which they may give
their consent.
In some cases, documenting consent through the use of a signed form is not appro-
priate, in which case a waiver of documentation might be issued. In situations where the
only or primary risk to participants is that their participation might become known (a
loss of confidentiality) and the only record linking them to the study is the signed con-
sent form, a waiver of documentation might be appropriate. In that case, participants
receive an information about the study letter, which contains all elements of a consent
form, but without the participant signature.
One important note for people who do research involving children is that children
cannot consent to participate in research. Instead, their parent or legal guardian con-
sents to their participation in the research, and the child assents to participation. This
additional layer of protection (in requiring parental consent) is because children are con-
sidered as having a diminished capacity for consent. There are some scenarios in which
parental consent might also be waived, such as research on typical classroom practices or
educational tests that do not present more than minimal risk. Children are not the only
group regulations define as having diminished capacity for consent. Prisoners also have
special protections in the regulations because of the strong coercive power to which they
may be subjected. Research involving prisoners must meet many additional criteria, but
the research must be related to the conditions of imprisonment and must be impractical
to do without the participation of current prisoners.

Explanation of risks
An important component of informed consent is the explanation of risks. Participants
must be aware of the risks they could reasonably encounter during the study. A lot of
educational and behavioral research carries very little risk. The standard for what defines
a risk is whether the risks involved in the study exceed those encountered in daily life.
Studies with risks less than those of daily life are described as being no more than mini-
mal risk. In other cases, there are real risks. Common in educational and social research
are risks like the risk of a loss of confidentiality (the risk that people will find out what
a participant wrote in their survey, for example), discomfort or psychological distress
(for example, the experience of anxiety on answering questions about past trauma), and
occasionally physical discomfort or pain (for example, the risk of experiencing pain in a
study that involves exercise). There is a range of other risks that might occur depending
18 • BASIC ISSUES

on the type of research, like risks associated with blood collection, or electrographic
measurement.
Important in consideration of those risks, and whether they are acceptable, is the
principle of beneficence. Risks must be balanced by benefits to participants. In studies
involving no more than minimal risk, there is no need for a benefit to offset risks. How-
ever, in research where the risks are higher, the benefits need to match the risk level. In
an extreme example, there are medical trials where a possible risk is death from side
effects of treatment for a fatal disease. However, the benefit might be the possibility of
curing the disease or extending life substantially. In such cases, the benefit might be
deemed to exceed the risk. In most educational research, the risks are not nearly so high,
but when they are more than minimal, benefits must outweigh the risks.

Deception
Before we move on to discuss benefits and compensation, we briefly pause to discuss the
issue of deception. Informed consent requires participants to be aware of the purposes
and procedures of research before participation and that they freely consent to the study.
However, there are cases where a study cannot be carried out if participants fully know
the purposes of the research. For example, if a study aims to examine the conditions
under which people obey authority figures, if they explain that purpose to participants,
the study might be spoiled. Participants who know they are being evaluated for obedi-
ence might be more likely to defy instructions, for example. There are, then, occasions
where deception is allowed. The first criteria are that the study cannot be carried out
without deception. The scope of the deception must be limited to the smallest possible
extent that will allow the study to proceed. Finally, the risks associated with the decep-
tion must be outweighed by benefits to participants. Deception always increases the risks
to participants, if only for the distress that being deceived can cause. In most cases, a
study involving deception must also provide a debriefing—a session or letter in which
participants are fully informed of the actual purposes and procedures after the study.
Deception in educational research is rare, though the regulations do allow it under very
limited circumstances.

Benefits and compensation


Benefits and compensation are two very different things. Benefits are things participants
gain by being in the study. In the above example about medical treatment, the benefit
might be a reduction of symptoms or being cured of a condition. In educational research,
benefits might be things like improved curricula or an increased sense of community.
Benefits must be real and likely to occur for individual participants. Some studies have
no known direct benefits for participants. The outcome of the study might be unlikely
to benefit participants directly, but will advance the state of knowledge on some topic.
Such studies are still acceptable under the principle of beneficence so long as the study
presents no more than minimal risk.
Another element of many studies is compensation. There is no requirement in any
regulations that a study offer compensation, but it is often included to improve recruit-
ment efforts or to engage in reciprocity with participants. Compensation often takes the
form of monetary payments (e.g., $5 for taking a survey, or entry into a drawing for a
$100 gift card for participating in a study). Compensation can also involve an exchange
of goods or services (e.g., entry into a drawing for a video game system, or a pass for free
QUANTITATIVE EDUCATIONAL RESEARCH • 19

gym access). In some cases, compensation might also take the form of academic credit,
such as gaining extra credit in a course for research participation. Course credit is often
trickier because compensation must be equal for all participants, and courses often have
very different grading systems. Moreover, typically, any offer of course credit must be
matched with an alternative way to earn that course credit to avoid coercion. We will
discuss compensation more in the coming chapters as it relates to sampling strategies,
but compensation (or incentives) is allowed, so long as the amount is in line with the
requirements of the study.

Confidentiality and anonymity


In the vast majority of cases, data gathered from human participants must be treated
with strict confidentiality. That means that researchers take reasonable steps to secure
the data, like storing the data in a secure location, storing them on an encrypted drive,
transmitting them via a secure means. It also means that researchers take special care
to protect identifiable information like names, ID numbers, ZIP or postal codes, IP
addresses, and other potentially identifiable information. In some uncommon cases,
researchers might not be able to guarantee confidentiality, perhaps because of the nature
of the methods (for example, group interviews, where researchers cannot guarantee that
all participants in the room will maintain confidentiality), the nature of the participants
(like interviews with school superintendents where it might be difficult to mask their
identities adequately), or other factors. In those cases, participants should be informed
of the risk of a loss of confidentiality, and benefits should outweigh that risk. However,
in any case where it is possible to do so, researchers must maintain the confidentiality of
their data.
An additional layer of protection for participants’ identities is anonymity. Anonymity
means that even the researcher does not know the identity of the participants. It would
be impossible for the researcher or anyone else to determine who had participated in the
study. This means the researcher has collected no potentially identifying information.
Anonymity is often possible in survey-based research, where participants’ entire partici-
pation might occur online via an anonymous link. In other kinds of research, anonymity
might not be possible. In online research, one important setting to check is whether your
survey software collects IP addresses by default, as those data are personally identifiable.
Most survey systems allow researchers to disable IP address tracking so that data can
be treated as anonymous. Anonymity lowers the risk to participants because even if the
data were to be breached or accidentally exposed, the identity of participants would still
not be known.

Institutional Review Board processes


In the United States, research is reviewed by a group known as the Institutional Review
Board (IRB). IRBs are typically located within an institution, like a university, though
sometimes an institution might rely on an external IRB. IRBs are usually comprised
mostly of researchers, though regulations do require a community member representa-
tive and other non-researcher representatives for certain kinds of study proposals. IRBs
are diverse and differ somewhat from institution to institution. Their specific procedures
will also vary, though all will comply with the Common Federal Rule. Because of this
20 • BASIC ISSUES

variation, researchers should always consult their local IRB information before propos-
ing and conducting a study. In general, though, most IRBs follow a similar process.
First, researchers must design a study and describe that study design in detail. Most
IRBs provide a form or questionnaire to guide researchers in describing their study.
Typically, those forms ask for details about the study purpose, design, and who will be
conducting the research. They will also ask about risks and benefits, as well as compensa-
tion. IRBs typically require researchers to attach copies of recruitment materials, consent
documents, and study materials to the IRB so that reviewers can evaluate the appro-
priateness of those documents. IRB review falls into one of three categories: exempt,
expedited, and full board. There is much variation in how different IRBs handle those
categories, but typically exempt proposals are reviewed most quickly. Exemptions can
fall in one of several categories, but are usually no more than minimal risk and involve
anonymous data collection. Expedited applications are often reviewed more slowly than
exempt because they require a higher-level review than exempt applications. There are
multiple categories of expedited review in the Common Federal Rule as well, but often
school-based research can qualify as expedited, depending on the specifics of the study.
Finally, full-board reviews will be reviewed by an entire IRB membership at their regu-
lar meetings. Most IRBs meet once per month and will usually require several weeks of
notice to review a proposal. As a result, the full-board review can take several months.
Regardless of the level of review, it is very common for the IRB to request revisions to the
initial proposal to ensure full compliance with all regulations. When planning a study,
it is a good idea to plan in time for the initial review and one or two rounds of revision,
at a minimum.
We have avoided being overly specific about the IRB process because of how much it
varies across institutions. However, when planning a study, talk with people at your insti-
tution about the IRB process. Read your local IRB website or other documentation, and
always use their forms and guidance in designing a study. Once your IRB approves the
study, recruiting can begin. Researchers must follow the procedures they outlined in their
IRB application exactly. Any deviations from the approved procedures can result in sanc-
tions from the IRB, which can be quite serious. However, in the event a change to the
procedures is necessary, IRBs also have a process for requesting a modification to the orig-
inally approved procedures. In most institutions, modifications are reviewed quite quickly.

CONCLUSION
In this chapter, we discussed a range of basic issues in thinking about educational
research. We do not intend this chapter to be an exhaustive treatment of any of these
issues but to serve as an overview of the range of considerations in educational research.
We encourage students who feel less familiar or comfortable with these topics to seek
out more information on them. Questions of methodology, epistemology, and ethics
can be big and involve many considerations. We have recommended source materials
for several of these topics to allow further exploration. For more on research ethics in
your setting, consult local regulations and guidance. The purpose of this textbook is to
provide instruction on quantitative educational research, and in the next chapter, we will
begin exploring basics in educational statistics.
2
Sampling and basic
issues in research design

Sampling issues: populations and samples 22


Sampling strategies 22
Random sampling 22
Representative (quota) sampling 22
Snowball sampling 23
Purposive sampling 23
Convenience sampling 24
Sampling bias 24
Self-selection bias 24
Exclusion bias 25
Attrition bias 25
Generalizability and sampling adequacy 25
Levels of measurement 26
Nominal 26
Ordinal 27
Interval 27
Ratio 27
A special case: Likert-type scales 28
Basic issues in research design 29
Operational definitions 29
Random assignment 30
Experimental vs. correlational research 31
Basic measurement concepts 31
Conclusion 35

In the previous chapter, we discussed a number of basic issues in quantitative research,


the history and current state of research ethics, and the process of selecting and narrow-
ing research questions. In this chapter, we will review some concepts related to designing
quantitative research. We will begin with an overview of sampling issues and sampling
strategies, and then we will discuss basic concepts in research design, including defini-
tions and terminology.

21
22 • BASIC ISSUES

SAMPLING ISSUES: POPULATIONS AND SAMPLES


Often, quantitative researchers have a goal of generalizing their results. That means
that many researchers hope what they find within their study will apply to most people.
While this is not always a goal of quantitative research, it is common for researchers to
attempt to generalize the findings from their sample to the population, or at least to some
larger group than their sample. Because of that goal, sampling strategy and adequacy are
important in designing quantitative research.

Sampling strategies
When researchers design a study, they have to recruit a sample of participants. Those
participants are sampled from the population. First, researchers must define the pop-
ulation from which they wish to sample. Populations can be quite large, sometimes in
the millions of people. For example, imagine a researcher is interested in students’ tran-
sitions during the first year of college. There are almost 20 million college students in
the United States alone (National Center for Education Statistics, 2018). As many as
4 million of those students might be in their first year. If a researcher were to sample
250 first-year students, how likely is it the results from those 250 might generalize to the
4 million? That depends on the sampling strategy.

Random sampling
In an ideal situation, researchers would use random sampling. In random sampling,
every member of the population has an equal probability of being in the study sample.
Imagine we can get a list of all first-year college students in the United States, complete
with contact information. We could randomly draw 250 names from that list and ask
them to participate in our study. We are likely to find that some of those students will not
respond to our invitation. Others might decline to participate. Still others might start the
survey but drop out partway through it. Therefore, even with a perfect sampling strategy
designed to produce a random sample, we still face a number of barriers that make a true
random sample almost impossible. Of course, the other problem with this scenario is
that it is nearly impossible to get a list of everyone in the population. Random sampling
is impractical in research with human participants, but there are several other strategies
that researchers commonly use.

Representative (quota) sampling


Quota sampling method selects samples based on exact numbers or quotas of individ-
uals or groups with varying characteristics (Gay, Mills, & Airasian, 2016). We also call
this sampling method representative sampling. Many U.S. federal datasets are collected
using quota or representative sampling. In this sampling strategy, researchers set targets,
or quotas, for people with specific characteristics. Often, those characteristics are demo-
graphic variables. For example, some federal education datasets in the United States use
the census to determine sampling quotas for the combination of race, sex as assigned at
birth, and location. These quotas are usually set to be representative in each demographic
category. For example, the U.S. Census showed that Black women comprised 12.85% of
the population of Alabama (U.S. Census Bureau, 2017). A researcher with a goal of 1,000
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 23

participants from Alabama might set a quota of 129 Black women for that sample (taking
the population percentage and multiplying by the target sample size). The researcher
would then intentionally seek out Black women until 129 were enrolled in the study. The
researcher would set quotas for every demographic category and then engage in targeted
recruiting of that group until the quota was met. The end result is a sample that matches
the population very closely in demographic categories. However, the process of produc-
ing that representative sample involved many targeted recruiting efforts, which might
introduce sampling bias. However, this method is widely used to produce samples that
approach representativeness, especially in large-scale and large-budget survey research.

Snowball sampling
Another method of accessing population that is not easily accessible or hard to reach is
snowball sampling. Examples might include members of a secretive group, people with a
stigmatized health issue, or members of a group subject to legal restrictions or targeted by
law enforcement. A snowball sample begins with the researcher identifying a small num-
ber of participants to directly recruit. That initial recruiting might involve relationship and
trust building work as well. For example, if a researcher was interested in surveying undoc-
umented immigrants, they might find this population difficult to directly reach because of
legal and social factors. So, the researcher might need to invest in building relationships
with a small number or local group of undocumented immigrants. In that example, it would
be important for the researcher to build some genuine, authentic relationships and to prove
that they are trustworthy. Participants might be skeptical of a researcher in this circum-
stance, wondering about how disclosing their documentation status to a researcher might
impact their legal or social situation. It would be important for the researcher to prove they
are a safe person to talk to. After initial recruiting in a snowball sample, participants are
asked to recruit other individuals that qualify for the study. This is useful because, in some
circumstances, individuals who are in a particular social or demographic group might be
more likely to know of other people in that same group. It can also be useful because, if
the researcher has done the work of building relationships and trust, participants may be
comfortable vouching for the researcher with other potential participants. This approach
is used in quantitative and qualitative research. One drawback to snowball sampling is it
tends to produce very homogenous samples. Because the recruiting or sampling effort is
happening entirely through social contacts, the participants who enroll in the study tend
to be very similar in sociodemographic factors. In some cases, that similarity is acceptable,
but this only works when the criteria for inclusion in the study are relatively narrow.

Purposive sampling
In this sampling method a researcher selects the sample using their experience and
knowledge of the target population or group. This sampling method is also termed judg-
ment sampling. For example, if we are interested in a study of gifted students in middle
schools, we can select schools to study based on our knowledge of gifted schools. Thus,
we rely on prior knowledge to select the schools that meet specific criteria, such as pro-
portion of students who go into high school and take advanced placement (AP) courses
and proportion of teachers with advanced degrees. This is also sometimes referred to as
targeted sampling, because the researcher is targeting very particular groups of people,
rather than engaging in a broader sampling approach or call for participants.
24 • BASIC ISSUES

Convenience sampling
Probably the most common sampling method is convenience sampling. However,
though it is common, it is also one of the more problematic approaches. Convenience
sampling is, as the name implies, a sampling method where the researcher selects par-
ticipants who are convenient to them. For example, a faculty member might survey stu-
dents in a colleague’s classes. In fact, a problem in some of the published research is that
many samples are comprised entirely of first-year students at research universities in
large lecture classes. Those samples are convenient for many faculty members. They may
be able to gain several hundred responses from a single class section without leaving
their building. So, the appeal is clear—convenience samples are quicker, easier, and less
costly to obtain. However, these samples are usually heavily biased, meaning the findings
from such a sample are unlikely to generalize to other groups. These samples are not
representative samples. There is a place for convenience samples, but researchers should
carefully consider whether the convenience and ease of access are worth the cost to the
trustworthiness of the data and carefully evaluate sampling bias.

SAMPLING BIAS
When samples are not random (which they pretty much never are), researchers must
consider the extent to which their sample might be biased. Sampling bias describes the
ways in which a sample might be divergent from the population. As we have alluded
to already in this chapter, researchers often aim for representativeness in their samples
so that they can generalize their results. Sampling bias is, in a sense, the opposite of
representativeness. The more biased a sample, the less representative it is. The less repre-
sentative the sample, the less researchers are able to generalize the results outside of the
sample. Here, we briefly review some of the types of sampling bias to give a sense of the
kinds of concerns to think through in designing a sampling strategy.

Self-selection bias
Humans participate in research voluntarily. But most people invited to participate in a
study will decline. Researchers often think about a response rate of around 15% as being
relatively good. But that would mean 85% of people invited to respond did not. In other
words, compared to the population, people who volunteer for research might be con-
sidered unusual. Perhaps there are some characteristics that volunteers have in common
that differ from non-volunteers. In other words, the fact that participants self-select to
participate in studies means that their results might not generalize to non-volunteers.
This bias is especially pronounced when the topic of the study is relevant to volunteer
characteristics. For example, customer satisfaction surveys tend to accumulate responses
from people who either had a horrible experience or an amazing experience—people
without strong feelings about the experience as a customer are less likely to respond. The
fact that people whose experience was neither horrible nor wonderful are less likely to
respond biases the results. In another example, if a researcher is studying procrastina-
tion, they might miss out on participants who procrastinate at high levels because they
might never get around to filling out the survey. Self-selection is always a concern, but
particularly when the likelihood to participate is related to factors being measured in
the study.
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 25

Exclusion bias
Researchers always set inclusion and exclusion criteria for a given sample. For example,
a researcher might limit their study to current students or to certified teachers. Setting
those criteria is important and necessary. But sometimes the nature of the exclusion cri-
teria can exclude participants in ways that bias the results. For example, many researchers
studying college students will exclude children from their samples. They do so for reasons
related to ethical regulations (specifically, to avoid seeking parental consent) that would
make the study more difficult to complete. However, it may be that college students who
are not yet adults (say, a 17-year-old first-year student) might have perspectives and expe-
riences that are quite different from other students. Those perspectives get lost through
excluding children and can bias the results. It might make sense to accept that limitation,
that the results wouldn’t generalize to students who enroll in college prior to 18 years of
age, but researchers should consider the ways that exclusion criteria might bias results.

Attrition bias
Attrition bias is a result of participants leaving a study partway through. Most frequently,
this happens in longitudinal research, where participants might drop out of a study after
the first part of the study, before later follow-up measures are completed. In some cases,
this happens because participants cannot continue to commit their time to being part
of the study. In other cases, it might happen because participants move away, no longer
meet inclusion criteria, or become unavailable due to illness or death. For example, in
longitudinal school-based research, researchers might follow students across multiple
years. Students might move out of the school district over time, and this might be more
likely for some groups of students than others. Those students who move away cannot
be included in the analysis of change across years, but likely share some characteristics
that are also related to their leaving the study. In other words, the loss of those data via
attrition biases the results.
Another way that attrition can happen is via participants dropping out of a survey
partway through completing it. Perhaps the survey was longer than the participant
expected, or something suddenly came up, but the participant has chosen not to finish
participating in a single-time measurement. This is most common in survey research,
where participants might give up on the survey because they found it too long. It may
be that the participants who stopped halfway through share characteristics that both led
them to leave the study and were relevant to the study outcomes. Again, in this case, the
loss of those participants may bias the results.

GENERALIZABILITY AND SAMPLING ADEQUACY


As we have alluded to so far in this chapter, one of the reasons that sampling, and sam-
pling bias, are important is about generalizability. Usually, when a researcher conducts a
quantitative study, they hope to have results that mean something for the population. In
other words, researchers usually study samples to find things out about the population.
When samples are too biased or too unrepresentative, the results may not generalize
at all. That is, in a very biased sample, the results might only apply to that sample and
be unlikely to ever occur in any other group. Generalizability, then, is often a goal of
26 • BASIC ISSUES

quantitative work. Very few samples’ results would generalize to the entire population,
but researchers should think about how far their results might generalize. One way to
assess the generalizability of results is to evaluate sampling biases.
Another issue in generalizability is related to sample size. How many people comprise
a sample affects multiple layers of quantitative analysis, including factors we will come
to in future chapters like normality and homogeneity of variance. But the sample size
also impacts generalizability. Very small samples are much less likely to be representative
of the population. Even by pure chance in a random sample, smaller samples are more
likely to be biased. As the sample size increases, it will likely become more representative.
In fact, as the sample size increases, it gets closer and closer to the size of the population.
As a general rule, there are some minimum sample sizes in quantitative research. We’ll
return to these norms in future chapters. Most of our examples in this text will involve
very small, imaginary samples to make it easier to track how the analyses work. But
in general samples should have at least 30 people for a correlational or within-subjects
design. When comparing two or more groups, the minimum should be at least 30 peo-
ple per group (Gay et al., 2016). These are considered to be minimum sample sizes, and
much larger samples might be appropriate in many cases, especially where there are mul-
tiple variables under analysis or the differences are likely to be small (Borg & Gall, 1979).

LEVELS OF MEASUREMENT
The data we gather can be measured at several different levels. In the most basic sense, we
think of variables as being either categorical or continuous. Categorical variables place
people into groups, which might be groups with no meaningful order or groups that have
a rank order to them. Continuous variables measure a quantity or amount, rather than a
category. There are two types of categorical variables: nominal and ordinal. Likewise, there
are two types of continuous variables: interval and ratio. For the purposes of the analyses
discussed in this book, differentiating between interval and ratio data will not be impor-
tant. However, below we introduce each level of measurement and provide some examples.

Nominal
Nominal data involve named categories. Nominal data cannot be meaningfully ordered.
That is, they are categorical data with no meaningful numeric or rank-ordered values.
For example, we might categorize participants based on things like gender, city of res-
idence, race, or academic program. These categories do not have meaningful ordering
or numbering within them—they are simply ways of categorizing participants. It is also
important to note that all of these categories are also relatively arbitrary and rely on social
constructions. Nominal data will often be coded numerically, even though the numbers
assigned to each group are also arbitrary. For example, in collecting student gender, we
might set 1 = woman, 2 = man, 3 = nonbinary/genderqueer, 4 = an option not included
in this list. There is no real logic to which group we assign the label of 1, 2, 3, or 4. In fact,
it would make no difference if instead we labelled these groups 24, 85, 129, and 72. The
numeric label simply marks which groups someone is in—it has no actual mathematical
or ranking value. However, we will usually code groups numerically because software
programs, such as jamovi, cannot analyze text data easily. So, we code group membership
with numeric codes to make it easier to analyze later on. In another example, researchers
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 27

in the United States often use racial categories that align to the federal Census categories.
They do so in order to be able to compare their samples to the population for some region
or even the entire country. So, they might code race as 1 = Black/African American,
2 = Asian American/Pacific Islander, 3 = Native American/Alaskan Native, 4 = Hispanic/
Latinx, 5 = White. Again, the numbering of these categories is completely arbitrary and
carries no real meaning. They could be numbered in any order and accomplish the same
goal. Also notice that, although these racial categories are widely used, they are also prob-
lematic and leave many racial and ethnic groups out altogether. For most of the analyses
covered in this text, nominal variables will be used to group participants in order to
compare group means. Another example of a nominal variable would be experimental
groups, where we might have 1 = experimental condition and 0 = control condition.

Ordinal
Ordinal variables also involve categories, but categories that can be meaningfully ranked.
For example, we might label 1 = first year, 2 = sophomore year, 3 = junior year, 4 = senior
year to categorize students by academic classification. The numbers here are meaningful
and represent a rank order based on seniority. Letter grades might be labelled as 1 = A, 2 = B,
3 = C, 4 = D, 5 = F. The numbering again represents a rank order. Grades of A are considered
best, B are considered next best, and so on. Other examples of ordinal data might include
things like class rank, order finishing a race, or academic ranks (i.e., assistant, associate, full
professor). The analyses in this book will not typically include ordinal data, but there are
constructs that exist as ordinal and other sets of analyses that are specific to ordinal data.

Interval
Interval data are continuous and measure a quantity. Interval data should have the same
interval or distance between levels. For example, if we measure temperature in Fahrenheit,
the difference between 50 and 60 degrees is the same as the difference between 60 and 70
degrees. Temperature is a measure of heat, and it’s worth noting that zero degrees does
not represent a complete absence of heat. In fact, many locations regularly experience
outdoor temperatures well below zero degrees. Interval data do not have a true, meaning-
ful absolute zero—zero represents an arbitrary value. Another characteristic of interval
data is that ratios between values may not be meaningful. For example, in comparing 45
degrees and 90 degrees Fahrenheit, 45 degrees would not be exactly half the amount of
heat of 90 degrees, even though 45 is half of 90. The distance between increments (in this
case, degrees) is the same, but because the scale does not start from a true, absolute zero,
the ratios are not meaningful. Other examples of interval-level data might include things
like scores on a psychological test, grade point averages, and many kinds of scaled scores.

Ratio
The difference between ratio and interval data can feel confusing and a bit slippery. Luck-
ily, for the purposes of analyses covered in this book, the difference won’t usually matter.
Most analyses—and all of the analyses in this text—will use either ratio or interval data,
28 • BASIC ISSUES

making them largely interchangeable. But ratio data do have some characteristics that
set them apart from interval. The easiest to see is probably that ratio data have a true,
meaningful, absolute zero. That is, ratio data have a value of zero that represents the com-
plete lack of whatever is being measured. For example, if we measure distance a person
runs in a week, the answer might be zero for some participants. That would mean they
had a complete lack of running. Similarly, if we measure the percentage of students who
failed an exam, it might be that 0% failed the exam, representing a complete lack of stu-
dents who failed the exam. Those values of zero are meaningful and represent an absolute
absence of the variable being measured. Because ratio data have a true, meaningful, abso-
lute zero, the ratios between numbers become meaningful. For example, 25% is exactly
half as many as 50%. Other examples of ratio data include time on task, calories eaten,
distance driven, heart rate, some test scores, and anything reported as a percentage.

A special case: Likert-type scales


One of the most common measurement strategies in behavioral and educational research
is the Likert-type scale. This is familiar to most anyone who has ever seen a survey and
might look something like the following:

For each statement, indicate your level of agreement or disagreement using the pro-
vided scale:

I enjoy learning about quantitative analysis.

Strongly Disagree Somewhat Neither Agree Somewhat Agree Strongly


Disagree Disagree nor Disagree Agree Agree

1 2 3 4 5 6 7

Likert-type scales involve statements (or “stems”) to which participants are asked to
react using a scale. The most common Likert-type scales have seven response option
(as the example above), but that can vary anywhere from three to ten response options.
The response options represent a gradient, such as from strongly agree to strongly dis-
agree. As such, those individual responses might be considered ordinal data. We could
certainly order these, and in some sense have done so with the numeric values assigned
to each label. When a participant responds to this item, we could think of that as them
self-assigning to a ranked category (for example, if they select 3, they are reporting they
belong to the category Somewhat Disagree that is ranked third). Some methodologists
feel quite strongly about calling all Likert-type data ordinal.
However, there is a complication for Likert-type data. Namely, we almost never use
a single Likert-type item in isolation. More typically, researchers average or sum a set of
multiple Likert-type items. For example, perhaps we asked participants a set of six items
about their enjoyment of quantitative coursework. We might report an average score of
those six items and call it something like “quantitative enjoyment.” That scaled score is
more difficult to understand as ordinal data. Perhaps a participant might average 3.67
across those six items. A score of 3.67 doesn’t correspond to any category on the Lik-
ert-type scale. The vast majority of researchers will choose to treat those average or total
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 29

scores as interval data (not ratio, in part because there is no possible score of zero on a
Likert-type scale). There is some disagreement about this among methodologists, but
most will treat average or total scores as interval, especially for the purposes of the analy-
ses covered in this book. In more advanced psychometric analyses, perhaps especially in
structural equation modelling and confirmatory factor analysis, the distinction becomes
more important and requires more thought and justification. But for the purposes of the
analyses in this text, it will be safe to treat total or average Likert-type data as interval.

BASIC ISSUES IN RESEARCH DESIGN


While many people using a book like this one will have some level of previous exposure
to research terminology and ideas, others will have no prior experience at all. In the next
sections, we briefly overview several key terms and ideas in research design that we will
use throughout this text. These definitions and descriptions are not exhaustive, but we
hope that they provide enough information so that readers will have some shared under-
standing of how we are using these ideas throughout the text.

Operational definitions
In designing a study, researchers will determine what variables they are interested in
measuring. For example, they might want to measure self-efficacy, student motivation,
academic achievement, psychological well-being, racial bias, heterosexism, or any num-
ber of other ideas. An important first step in designing good research is to carefully define
what those variables mean for the purpose of a given study. When researchers say, for
example, they want to measure motivation, they might mean any of several dozen things
by that. There are at least four major theories of human motivation, each of which might
have a dozen or more constructs within them. A researcher would need to carefully define
which theory of motivation they are mobilizing and which variables/constructs within
that theory they intend to measure. If a researcher wants to measure racial bias, they will
need to define exactly what they mean by racial bias and how they will differentiate vari-
ous aspects of what might be called bias (implicit bias, discrimination, racialized beliefs,
etc.). If a researcher wants to study academic achievement, they might select grade point
averages (which are very problematic measures due to variance from school to school
and teacher to teacher, along with grade inflation), standardized test scores like SAT or
ACT (which are problematic in that they show evidence of racial bias and bias based on
income), or a psychological instrument like the Wide-Range Achievement Test (WRAT,
which also shows some evidence of cultural bias). However, the research defines the vari-
able and measures it will affect the nature of the results and what they mean. The way that
researchers define the variable or construct of interest is referred to as the operational
definition. It’s an operational definition because it may not be perfect or permanent, but
it is the definition from which the researcher is operating for a given project.
Part of operationally defining a variable involves deciding how it will be measured.
Many variables could be measured in multiple ways. In fact, for any given variable, there
might be dozens of different measures in common use in the research literature. Each
will differ in how the variable is defined, what kinds of questions are asked, and how
the ideas are conceptualized. Researchers have a tendency to at times write about vari-
ables and measures as if they were interchangeable. They might include statements like,
30 • BASIC ISSUES

“Self-efficacy was higher in the experimental group,” when what they actually mean is
that a particular measure for self-efficacy in a particular moment was higher for the
experimental group. As we advocate later in this chapter, most researchers will be well
served to select existing measures for their variables. But the selection of a way to meas-
ure a variable is a part of, and should align with, the operational definition.

Random assignment
Another key term in research design is random assignment. In random assignment,
everyone in the study sample has an equal probability of ending up in the various
experimental groups. For example, in a design where one group gets an experimental
treatment and the other group gets a placebo treatment, each participant would have a
50/50 chance of ending up in the experimental vs. control group. This is accomplished
by randomly assigning participants to groups. In many modern studies, the random
assignment is done by software programs, some of which are built into online survey
platforms. Random assignment might also be done by drawing or by placing participants
in groups by the order the sign up for the study (e.g., putting even-numbered sign ups in
group 1 and odd in group 2).
Random assignment matters for the kinds of inferences a researcher can draw from
a given set of results. By randomly assigning participants to groups, theoretically their
background characteristics and other factors are also randomized to groups. So, the only
systematic difference between groups will be the treatment or conditions supplied by
group membership. As a result, the inferences can be stronger. We would feel more con-
fident that differences between groups are due to group membership (or experimental
treatment) when the groups were randomly assigned, because there are theoretically no
other systematic differences between the groups. When researchers use intact groups
(groups that are not or cannot be randomly assigned), the inferences will be somewhat
weaker. For example, if we compare academic achievement at School A, which uses com-
puterized mathematics instruction, vs. School B, which uses traditional mathematics
instruction, there might be lots of other differences between the two schools other than
whether they use computerized instruction. Perhaps School A also has a higher budget,
or students with greater access to resources, or more experienced teachers. It would be
harder, given these intact groups, to attribute the difference to instruction type than if
students were randomly assigned to instruction type.
Random assignment, though, is not sufficient to establish a causal claim (that a cer-
tain variable caused the outcome). Causal claims require robust evidence. For a causal
claim to be supported, there must be: (1) A theoretical rationale for why the potential
causal variable would cause the outcome; (2) The causal variable must precede the out-
come in time (which usually means a longitudinal design); (3) There must be a reliable
change in the outcome based on the potential causal variable; (4) All other potential
causal variables must be eliminated or controlled (Pedhazur, 1997). Random assignment
helps with criterion #4, but the others would also need to be met for a causal claim.
One distinction to be clear about, as it can be confusing for some students, is that
random assignment and random sampling (described earlier in this chapter) are two
separate processes that are not dependent on one another. Random sampling means
everyone in the population has an equal chance of being in the sample. Random assign-
ment means everyone in the sample has an equal chance of being in each group. They
both involve randomness but for separate parts of the process.
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 31

Experimental vs. correlational research


The key difference between experimental and correlational (or observational) research is
random assignment. Experimental research involves random assignment, whereas cor-
relational research does not. We have described some of the advantages of experimental
research in the kinds of inferences that can be made. Why, then, do researchers do cor-
relational work? The simple answer is that lots of variables researchers might be inter-
ested in either cannot or should not be randomly assigned. Some variables should not
be ethically or legally randomly assigned. If researchers already know or have strong evi-
dence to believe that a treatment would harm participants, they cannot randomly assign
them to that treatment. So, if a researcher wants to examine the effects of smoking tobacco
while pregnant on infant brain development, they cannot randomly assign some preg-
nant women to smoke tobacco, because it causes known harms. Instead, they would likely
study infants of women who smoked while pregnant before the study even began. Other
variables simply cannot be randomly assigned. If a researcher wants to study gender dif-
ferences in science, technology, engineering, and mathematics (STEM) degree attainment,
the researcher cannot randomly assign participants to gender identities. Although gender
identities may be fluid, they cannot be manipulated by the researcher. So, the researcher
will study based on existing gender identity groups. That is the only practical approach.
But people in different gender identities also have a whole range of other divergent experi-
ences. People are socialized differently based on perceived or self-identified gender identi-
ties, they receive different kinds of feedback from parents, peers, and educators, and might
be subjected to different kinds of STEM-related experiences. So, it would be difficult to
attribute differences in STEM degree attainment to gender, but researchers might try to
understand mechanisms that drive differences that occur along gendered lines.
Because many variables cannot or should not be randomly assigned, much of the work
in educational and behavioral research is correlational or observational. Causal infer-
ences are still possible, though somewhat harder than with experimental methods. Some
of the most important and influential work has been correlational. Our point here is that
experimental vs. correlational research is not a hierarchy—neither approach is “better,”
but they offer different strengths and opportunities and have different limitations.

Basic measurement concepts


Examining concepts of measurement and psychometric theory is beyond the scope of
this text. However, below, we briefly introduce several key concepts of which it is impor-
tant to have at least a superficial understanding. In selecting measures for a study design,
researchers should ensure they have thought about score reliability and the validity of the
use and interpretation of those test scores. For a more thorough but beginner-friendly
treatment of measurement and psychometrics, we recommend books such as Shultz,
Whitney, and Zickar (2013) and DeVellis (2016).

Score reliability
Reliability is essentially about score consistency (Thorndike & Thorndike-Christ, 2010).
There are different ways of thinking about the consistency of a test score, though. It
might be consistent across time, consistent within a time point, or consistent across
people. When researchers write about reliability, they are most commonly referring to
internal consistency reliability. Here, though, we briefly review several forms of score
32 • BASIC ISSUES

reliability. First, it is important to know that reliability is not a property of tests, but of
scores. A test cannot be reliable, and it is always inappropriate to refer to a test as being
reliable (Thompson, 2002). Rather, test scores can be reliable and may be tested for reli-
ability in one of several ways.

Test–retest reliability
Test–retest reliability is a measure of consistency of test scores we obtain from participants
between two times. We also refer this form of reliability as stability of our measure. The
correlation between the two scores is the estimate of the test–retest reliability coefficient.
To calculate this reliability estimate, we would give the same scale or test to the same
people on two occasions. The assumption here is that there is no change in the construct
we are measuring between the two occasions, so any differences in scores are attributed
to unreliability of the test or scale. However, this is not always an accurate assumption.
Some constructs are not stable across time. For example, if we measure students’ anxiety
the day before the midterm and the first day of their winter holiday break, we would not
expect to find similar scores. Anxiety is a construct that changes rapidly within people.
On the other hand, if we measured personality traits using a Big Five inventory, we would
expect to find very similar scores across time. Another issue to consider is practice effects
on repeated administrations. If we give a memory test to a participant and then again, a
week later, using the same set of items to memorize, the participant will likely do much
better the second time. This again is not really an issue of unreliability but of practice and
the fact that the participant continued to learn. As a result, not all scales or variables are
suitable for a test-retest reliability estimate. It only applies to stable constructs (variables
that don’t change much within a person over time) and that do not have strong practice
effects. This coefficient is reported on a zero to one scale, with numbers closer to one
being better. An acceptable test–retest reliability coefficient will vary based on the kind of
test and its application but might be .6 or higher in many cases.

Alternate forms reliability


Sometimes researchers develop several forms of the same test to allow them to admin-
ister it several times without worrying about practice effects. Many neuropsychological
tests for things like memory or academic achievement tests have alternate forms. When
there are multiple forms of a test or scale, the natural question is how consistent (reliable)
the scores are across the forms of the test. This is assessed as alternative forms reliability.
To calculate this reliability coefficient, researchers administer multiple forms of the test
to the same participants and calculate the correlations among the scores. This, like test–
retest, only applies in specific situations and would be reported as a number between
zero and one, where higher numbers are better.

Internal consistency reliability


By far the most commonly reported form of reliability is internal consistency reliability.
This estimate is based on the consistency across the items that comprise a test or scale.
That is, if a test is supposed to measure, for example, mathematics achievement and
someone is low in mathematics achievement, they should do relatively poorly on most of
the items (show consistency across the items). So internal consistency reliability is based
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 33

on the correlation among the test items. In the past, researchers would calculate this by
randomly dividing the items on a test into two equal sets (split halves) and then calcu-
lating the correlation between those halves. This was called split halves reliability, and it
still shows up, though rarely, in published work. In modern research, most researchers
report a coefficient like coefficient alpha, more commonly known as Cronbach’s alpha.
This measure is better than split halves reliability because it is equivalent to the average
of all possible split halves. But it is based on the correlation among the test items. Often
reported simply as α, this coefficient ranges from zero to one, with higher numbers rep-
resenting higher internal consistency. Most researchers will consider an α of .7 or higher
acceptable, and .8 or higher good (DeVellis, 2016).

Validity of the interpretation and use of test scores


Once a researcher has determined that test scores demonstrate reliability, they would
then assess whether the interpretation and use of those scores was valid. A simple way of
thinking about validity is that it is about the accuracy of the scores for certain purposes.
We explained earlier in this chapter that tests cannot be reliable because reliability is a
property of scores, not tests. Similarly, tests cannot be valid, because validity is a property
of the interpretation and use of scores. In other words: is the way researchers are think-
ing about and using test scores accurate? There are several ways that researchers might
evaluate the validity of their interpretation and use of scores.

Content validity
The first question for validity is about the content of the test or scale. Is that content
representative of the content the scale is meant to assess? For example, if a scale is meant
to measure depression, is the information on that scale representative of the types of
symptoms or experiences that depression includes? If it is meant to be a geometry test,
does the test include content from all relevant aspects of geometry? Researchers would
also be interested in demonstrating the test has no irrelevant items (such as ensuring
the geometry test does not include linear algebra questions). Usually, content validity is
assessed by ratings from subject matter experts who determine whether each item in a
test or scale is content-relevant, and whether there are any aspects or facets of content
that have not been included.

Criterion validity
Criterion validity basically asks whether the test or scale score acts like the variable it is
meant to represent. Does it correlate with scores that variable is supposed to correlate
with? Does it predict outcomes that variable should predict? Can it discriminate between
groups that variable would differentiate between? If a researcher is evaluating a depres-
sion scale, they might test whether it correlates with related mental health outcomes, like
anxiety, at rates they would expect. They might also test whether there are differences in
the test score between people that meet clinical criteria for a major depressive diagno-
sis and people that do not. They might test whether the scale scores predict outcomes
associated with depression like sense of self-worth or suicidal ideation. Because “depres-
sion” should act in those ways, the researcher would test if this scale, meant to measure
34 • BASIC ISSUES

depression, acts in those ways. This is a way of determining of the interpretation of this
score as an indicator of depression is valid. Because this is not a text on psychometric
theory, we will not go into detail further on establishing criterion validity. But it might
involve predictive validity, discriminant validity, convergent validity, and divergent
validity. These are various ways of assessing criterion validity.

Structural validity
Another way researchers evaluate validity issues is via structural validity. This form of
validity evidence asks whether the structure of a scale or test matches what is expected
for the construct or variable. In our example of a depression scale, many psychologists
theorize three main components to depression: affective, cognitive, and somatic. So, a
researcher might analyze a scale meant to measure depression to determine if these three
components (often called factors in such analyses) emerge. They might do this with anal-
ysis such as principal component analysis (PCA), exploratory factor analysis (EFA), or
confirmatory factor analysis (CFA). In each of those approaches, the basic question will
be whether the structure that emerges in the analysis matches the theoretical structure
of the variable being measured.

Construct validity
There are several other ways that researchers might evaluate validity. However, their
goal will be to make a claim of construct validity. Construct validity would mean that
the scale can validly be interpreted and used in the ways researchers claim—that the
claimed interpretation and use of the scores is valid. However, construct validity cannot
be directly assessed. Instead, researchers make arguments about construct validity based
on various other kinds of validity evidence. Often, when they have multiple forms of
strong validity evidence, such as those reviewed above, they will claim construct validity
based on that assembly of evidence.

Finding and using strong scales and tests


In designing a research study, it is important to use measures and tests that have strong
evidence of score reliability and validity of the interpretation and use of scores. In many
cases, developing a new scale is not necessary or wise. Often, there are existing scales
with strong evidence that can be used directly (with permission) or adapted to a new
setting (again, with permission from the original authors). One way to become aware of
the scales and measures in a given field is to read the published research. Of course, there
are lots of reasons to read the published research, as we’ve suggested elsewhere in this
text. But among those reasons is to identify measures that could be useful in designing
research and collecting data. When reading published research, it can be useful to note
which scales or measures researchers use for different constructs/variables. After a while
of keeping track of this, patterns will likely emerge. There may be two or three competing
measures that are used by the vast majority of researchers in a certain field. Of course,
just because it’s being used often doesn’t mean it’s a good scale, but it is a scale worth
looking into.
When considering a scale for use in a research project, look for evidence around reli-
ability and validity. What have prior researchers written about these factors? Are there
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 35

validity studies available? What is the range of reliability estimates in those published
papers? While a strong track record doesn’t guarantee future success, and score reliabil-
ity in particular is very sample dependent, if a score has a fairly consistent record of good
reliability and validity evidence, it will likely produce reliable scores that can be validly
interpreted in similar future studies.

CONCLUSION
In this chapter, we have introduced sampling, sampling bias, levels of measurement,
basic issues in research design, and very briefly introduced some measurement concepts.
These basic concepts are important to understand as we move forward into statistical
tests. We will return to many of these concepts over and over in future chapters to under-
stand how to apply various designs and statistical tests. In the next chapter, though, we
will introduce basic issues in educational statistics.
3
Basic educational statistics

Central tendency 38
Mean 38
Median 39
Mode 40
Comparing mean, median, and mode 40
Variability 40
Range 41
Variance 41
Standard deviation 43
Interpreting standard deviation 43
Visual displays of data 44
The normal distribution 46
Skew 47
Kurtosis 47
Other tests of normality 48
Standard scores 49
Calculating z-scores 49
Calculating percentiles from z 49
Calculating central tendency, variability, and normality estimates in jamovi 50
Conclusion 54
Note 55

In this chapter, we will discuss the concepts of basic statistics in education. We discuss
two types of statistics: central tendency and variability. We also describe ways to display
data visually. For some students, these concepts are very familiar. Perhaps a previous
research class or even a general education class included some of these concepts. How-
ever, for many graduate students, it may have been quite some time since their last course
with any kind of mathematical concepts. We will present each of these ideas assuming no
prior knowledge, and use these basic concepts as an opportunity to learn some statistical
notation as well. These concepts are foundational to our understanding, use of statistical
analyses, and making inferences from the results of our analyses. Many of these concepts
will be used in later chapters in this text and are foundational to all of the analyses we will
learn. We strongly recommend students ensure they are very familiar and comfortable
with the concepts in this chapter to set themselves up for success in the rest of this text.

37
38 • BASIC ISSUES

CENTRAL TENDENCY
Measures of central tendency attempt to describe an entire sample (entire distribution)
with a single number. These are sometimes referred to as point estimates because they
use a single point in a distribution to represent the entire distribution. All of these esti-
mates attempt to find the center of the distribution, which is why they are called central
tendency estimates. We have multiple central tendency estimates because each of them
finds the center differently. Many of the test statistics we learn later in this text will test
for differences in central tendency estimates (for example, differences in the center of
two different groups of participants). The three central tendency estimates we will review
are the mean, median, and mode.

Mean
The most frequently used measure of central tendency is the mean. You likely learned
this concept at some point as an average. The mean is a central tendency estimate that
determines the middle of a distribution by balancing deviation from the mean on both
sides. Another way of thinking of the mean is that it is a sort of balance point. It might
not be in the literal middle of a distribution, but it is a balance point. One way to visualize
how the mean works is to think of a plank balanced on a point. In the below example,
there are two objects (or cases) on the plank, and they’re equally far from the point, so
the plank is balanced.

Finding a balance point gets trickier when we add more cases, though. In the below
example, if we add one case on the far right side of the plank, we have to move the bal-
ance point further right to keep the plank from tipping over. So, although the balance
point is no longer in the middle of the plank, it’s still the balance point.

The mean works in this way—by balancing the distance from the cases on each side of
the mean, it shows the “center” of the distribution as the point at which both sides are
balanced.
To calculate the mean, we add all the scores (∑X) and divide that sum by the total
number of the scores (N), as shown in the formula below:

X
X
N
In this formula, we have some new and potentially unfamiliar notation. The mean score
is shown as X. This is a common way to see the mean written. The mean will also some-
times be written as M. In this case, the letter X stands in for a variable. If we had multiple
variables, they might be X, Y, and Z, for example. The Greek character in the numerator
is sigma (∑), which means “sum of.” So the numerator is read as “the sum of X,” meaning
we will add up all the scores for this variable. Finally, N stands in for the number of cases
BASIC EDUCATIONAL STATISTICS • 39

or the sample size. The entire formula, then, is that the mean of X is equal to the sum of
all scores on X divided by the number of scores (or sample size).
Let us calculate the mean for some hypothetical example data. Suppose you have
eighth-grade mathematics scores from eight students, and their scores are: 3, 6, 10, 5, 8,
6, 9, and 4. To calculate the mean, we can use the formula above:

X  3  6  10  5  8  6  9  4  51
X    6.375
N 8 8
One particular feature of the mean is that it is sensitive to extreme cases. Those cases are
sometimes called outliers because they fall well outside the range of the other scores. For
example, imagine that in the above example, we had a ninth student whose score was 25.
What happens to the mean?

X  3  6  10  5  8  6  9  4  25  76
X    8.444
N 9 9
This one extreme case, or outlier, shifts the mean by quite a bit. If we had several of these
extreme values, we would see even more shift in the mean. For this reason, the mean is
not a good estimate when there are extreme cases or outliers.

Median
In cases where the mean might not be the best estimate, researchers will sometimes refer
to the median instead. The median is the physical, literal middle of the distribution. It is
the middle score from a set of scores. There is no real formula for the median. If we rank
order the scores and find the middle score, that score is the median. For example, in our
hypothetical eight students above, we could rank order their scores, and find the middle
(marked by a box):

3, 4, 5, 6, 6, 8, 9,10

In this case, we don’t have a true middle score, because there is an even number of scores.
When there is an even number of scores, we take the average of the two middle scores as
the median. So, in this case, the median will be:

6  6 12
 6
2 2
In the second example from above, where we had one outlier, we could rank order the
scores and find the middle score (marked by a box):

3, 4, 5, 6, 6, 8, 9,10, 25

In this case, we have nine scores, so there is a single middle score, making the median
6. Comparing our medians to the means, we see that, with no outliers, the median and
mean are closely aligned. We also see that adding the outlier does not result in any move-
ment at all in the median. While the mean moved by 2.444 with the addition of the
40 • BASIC ISSUES

outlier, the median did not move. So, we can see in these examples that the median is less
sensitive to extreme cases and outliers.
The median is also equal to the 50th percentile. That means that 50% of all scores
fall below the median. That’s true because it’s the middle score, so half of the scores are
above, and half are below the median. We return to percentiles in future chapters, but it
is helpful to know that the median is always equal to the 50th percentile.

Mode
The mode is another way of finding the center of the distribution. However, the mode
defines the center as being the most frequently occurring score. In other words, the score
that is the most common is the mode. There is no formula for the mode either. We sim-
ply find the most frequently occurring score or scores. Because the mode is the most
frequently occurring score, there can be multiple modes. There might be more than one
score that occurs the same number of times, and no score occurs more frequently. We
call those distributions bimodal when there are two modes or multimodal when there
are more than two modes. Note that most software, including jamovi, will return the
lowest mode when there is more than one. In our above example:

3, 4, 5, 6, 6, 8, 9,10

Only one value occurred more than once, which was 6. So the mode is 6. If we add the
outlier score of 25, the mode remains 6 (it is still the value that occurs most often). The
mode, then, is also more resilient to outliers and extreme values.

Comparing mean, median, and mode


Which of the three measures of central tendency should you use for a given distribution of
data? In the vast majority of cases, you should use the mean. The mean is more consistent
across time and across samples, and most of the statistical analyses we use require that
we use the mean. In most cases, it’s the right choice. However, sometimes the mean is not
the best measure of central tendency. As we noted above, in distributions where there are
extreme cases or outliers, the median might be a better estimate. It is very unusual to see
researchers in educational and behavioral research use the mode. Sometimes, though, it
will be used as an indicator of the “typical” case. It can also be used to evaluate the normal-
ity of a distribution or to indicate problems in the data (such as a multimodal distribution).
Later in this chapter, we will deal with the concept of the normal distribution. How-
ever, we usually expect distributions of scores taken with good measures and in a large
enough sample size to have a normal distribution. In a perfectly normal distribution, the
mean, median, and mode will all be the same. So, in most cases we will not see much dif-
ference between the mean, median, and mode. However, as we mentioned above, most
statistical tests require that we work with the mean.

VARIABILITY
So far, we’ve described central tendency estimates and emphasized that the mean is most
often used. We also described central tendency estimates as point estimates. Point esti-
mates attempt to describe an entire distribution by locating a single point in the center
BASIC EDUCATIONAL STATISTICS • 41

of that distribution. However, there is another kind of estimate we can use to understand
more about the shape and size of a distribution: variability estimates. These are range,
rather than point, estimates as they give a sense of how wide the distribution is, and
where most scores are located within a distribution. We will explore three estimates of
variability: range, variance, and standard deviation.

Range
Range is the simplest measure of variability to compute and tells us the total size of a
distribution. It is the difference between the highest score and lowest score in the distri-
bution. That means the range is an expression of how far apart lowest and highest values
fall. It can be calculated as:

Range  Xhighest  Xlowest

From our example above, without any outliers, the highest score was 10, and the lowest
score was 3. Thus, the range is 10 − 3 = 7. If we add the outlier of 25, then the range is
25 − 3 = 22. Because the range is based on the most extreme scores, it tends to be unsta-
ble and is highly influenced by outliers. It also offers us very little information about the
distribution. We have no sense, from range alone, about where most scores tend to fall
or what the shape of the distribution might be. Because of that, we usually rely on other
variability estimates to better describe the distribution.

Variance
Variance is based on the deviation of scores in the distribution from the mean. It meas-
ures the amount of distance between the mean and the scores in the distribution. Vari-
ance reflects the dispersion, spread, or scatter of scores around the mean. It is defined as
the average squared deviation of scores around their mean, often notated as s2. Variance
is calculated using the following formula:

X  X
2
2
s 
N 1

We will walk through this formula in steps. In the numerator, starting inside the paren-
theses, we have deviation scores. We will take each score and subtract the mean from it.
Those deviation scores are the deviation of each score from the mean. Next, we square
the deviation scores, resulting in squared deviation scores. The final step for the numer-
ator is to add up all the squared deviation scores, which gives the sum of the squared
deviation scores. That numerator calculation is also sometimes called the sum of squares
for short. The concept of the sum of squares carries across many of the statistical tests
covered later in this book, and it’s a good idea to get familiar and comfortable with it
now. Finally, we divide the sum of squares by the sample size minus one.1
In the table below, we show how to calculate the variance for our example data from
above. We have broken the process of calculating variance down into steps, which are
presented across the columns. For the original sample (without any outliers), where the
mean was 6.375:
42 • BASIC ISSUES

X  X
2
X X−X

3 −3.375 11.391
6 −0.375 0.141
10 3.625 13.141
5 −1.375 1.891
8 1.625 2.641
6 −0.375 0.141
9 2.625 6.891
4 −2.375 5.641
∑ = 41.878

s 2    X  X   41.878  41.878  5.983


2

N 1 8 1 7

If we add in the outlier score of 25, where the mean was 8.444:

X  X
2
X X−X

3 −5.444 29.637
6 −2.444 5.973
10 1.556 2.421
5 −3.444 11.861
8 −0.444 0.197
6 −2.444 5.973
9 0.556 0.309
4 −4.444 19.749
25 16.556 274.101
∑ = 350.221

X  X
2
2 350.221 350.221
s     43.778
N 1 9 1 8

In these examples, we can see that as the scores become more spread out, variance gets
bigger. One challenge with variance is that it is difficult to interpret. We know that a
variance of 9.515 indicates a wider dispersion around the mean than a variance of 5.983,
but we have no real sense of where the scores are around the mean. While most of our
statistical tests will use variance (or the sum of squares) as a key component, it is difficult
BASIC EDUCATIONAL STATISTICS • 43

to interpret directly, so most often researchers will report standard deviation, which is
more directly interpretable.

Standard deviation
Standard deviation is not exactly a different statistic than variance—it is actually a way of
converting variance to make it more easily interpretable and to standardize it. Standard
deviation is often notated as s, though it is sometimes also written as SD, which is simply
an abbreviation. The formula for standard deviation is:

X  X
2
s s 2
N 1

Standard deviation is s, and variance is s2, so we can convert variance to standard devi-
ation by taking the square root. In other words, the square root of variance is standard
deviation. From our examples above, the standard deviation of the scores without any
outliers is:

=s =
s2 5.983 = 2.446

One major advantage of standard deviation is that it is directly interpretable using some
simple rules. We’ll explain two sets of rules for interpreting standard deviation next.

Interpreting standard deviation


There are basic guidelines for interpreting the standard deviation. Which guidelines
apply depends on whether the data are normally distributed. We will return to this idea
later in this chapter and explain how to determine whether a distribution is normal or
non-normal. Most data in educational and behavioral research will be normally distrib-
uted in a large enough sample. In cases where the data are normally distributed, standard
deviation can be interpreted in this way:

• About 68% of all scores will fall within ±1 standard deviation of the mean.
• About 95% of all scores will fall within ±2 standard deviations of the mean.
• More than 99% of all scores will fall within ±3 standard deviations of the mean.

For example, in our data without outliers, the mean was 6.375 with a standard devi-
ation of 2.446 (M = 6.375, SD = 2.446). Based on that, we could expect to find about
68% of the scores between 3.929 and 8.821. To get those numbers, we take the mean and
subtract 2.446 for the lower number and add 2.446 for the higher number. We could add
and subtract the standard deviation a second time to get the 95% range. Based on that,
we’d find that about 95% of the scores should fall between 1.483 and 11.267. It is worth
noting that the interpretation of standard deviation gets a bit cleaner and more realistic
in larger samples.
What if the data were non-normal? In that case, we can use Chebyshev’s rule to inter-
pret the standard deviation. In this rule:
44 • BASIC ISSUES

• At least 3/4 of the data are within ± 2 standard deviations of the mean.
• At least 8/9 of the data are within ± 3 standard deviations of the mean.

It is worth noting that, because the denominator for variance includes sample size and
standard deviation is derived from variance, both variance and standard deviation will
be smaller in larger samples, all else being equal. In other words, as sample sizes get
bigger, we expect to see smaller variance and standard deviations. This becomes an
important point in some later analyses that compare groups, as it is one reason to prefer
roughly equal group sizes. But it also means that our ranges (like the 95% range) based
on standard deviation become more precise and meaningful as the sample size increases.
The interpretation does not change in a larger sample, but that estimation is going to be
more precise.

VISUAL DISPLAYS OF DATA


In addition to describing a distribution of data with central tendency and variability esti-
mates, researchers often want to visualize a distribution as well. There are several ways of
organizing, displaying, and examining data visually. For many of these visual displays of
data, organizing the data into some kind of groups or categories will be helpful. Below,
we’ll review the two most common types of visual displays of data: frequency tables and
histograms.
A frequency table is simply a table that has two columns: one for the score or set of
scores, and one for the frequency with which that score or set of scores occur. Using our
example from above, we might create a frequency table like this:

Score Frequency

1 0
2 0
3 1
4 1
5 1
6 2
7 0
8 1
9 1

In a small sample, like the one with which we are working, it can be useful to categorize
the scores in some way. For example, perhaps we might split our scores into 1–2, 3–4,
5–6, 7–8, and 9–10:
BASIC EDUCATIONAL STATISTICS • 45

Score Frequency

1–2 0
3–4 2
5–6 3
7–8 1
9–10 1

This kind of collapsing of values into categories will probably be unnecessary in larger
samples, where we are likely to have multiple participants at every score; in the case of a
small sample, however, it can help us visualize the distribution more easily. A frequency
table is the simplest way to display the data. Sometimes, frequency tables have additional
columns. For example, in jamovi, the software produces frequency tables that have a
column for the percentage of cases for each category/score as well.
The second kind of visual display we’ll introduce here is a histogram. Histograms take
the information from a frequency table and turn it into a graph. A histogram is essen-
tially a bar graph with no spaces between the bars. Across the horizontal, or X, axis will
be the scores or categories, and the vertical, or Y, axis will have the frequencies. For our
example.

The histogram allows us to visualize better the shape of the distribution and how scores
are distributed. Histograms are very commonly used in all kinds of research and are
easily produced with jamovi and other software. One way we often use histograms is to
visually inspect a distribution to determine if it is approximately normal.
46 • BASIC ISSUES

THE NORMAL DISTRIBUTION


We have mentioned the normal distribution several times in this chapter with the prom-
ise we would provide more detail later. In this section, we will explore the normal dis-
tribution, why it matters, and how we can tell if our distribution is normal or not. The
normal distribution is a theoretical distribution with an infinite number of scores and a
known shape. The shape of the normal distribution is sometimes called the bell-shaped
curve because its appearance looks kind of like a bell. The normal distribution is sym-
metrical (exactly the same on both halves), asymptotic (never reaches zero), and has an
exact proportion of scores at every point. The shape of the normal curve is shown below,
with markers for each standard deviation as vertical lines.

This distribution is theoretical, but we expect most score distributions to approximate


the normal curve. That is because most phenomena have a large middle and skinny end
to the distribution. In other words, most scores cluster around the average. Also, in a
normal distribution, the mean, median, and mode will all be equal because the distri-
bution is symmetrical. All of the statistical tests we will encounter later in this text will
assume that our distribution of scores is normal, too, so it is important that we evaluate
the normality of our data.
One of the ways that the normal distribution is extremely useful in research is that
we know exactly what proportion of scores are at each point in the distribution and
where scores fall in the distribution. Earlier, we mentioned interpretive guidelines for
standard deviation in a normal distribution. In those guidelines, we said “about” how
many scores fall in each range. That’s because the exact percentages within ±1 or ±2
standard deviations are slightly different from those guidelines, which are rounded to
the nearest whole percent. You can see the exact percentages in the figure above. Later
in this chapter, we will also learn how to use the normal distribution to calculate things
like percentiles or the proportion of a sample that falls between two values. However,
not all distributions are normal, and it is important to test whether a given distribution
is actually normally distributed. There are two ways in which samples can deviate from
normality: skew and kurtosis.
BASIC EDUCATIONAL STATISTICS • 47

Skew
Skewed distributions are asymmetrical, so they will have one long tail and one short
tail. Because they’re asymmetrical, skewed distributions will have a mean and median
that are spread apart. The mean will fall a little way down into the short tail, while the
median will be closer to the high point in the histogram. One way to think about skew
is to think of the peak in the histogram as being pushed toward one side. As is clear the
figure below, in a negatively skewed distribution, the long tail will be on the left, and in a
positively skewed distribution, the long tail will be on the right.

There is a statistic for evaluating skew, which jamovi labels “skewness.” Later in this chap-
ter, we will walk through how to produce all the statistics in the chapter using jamovi.
We will not review how the skewness statistic is calculated for the purposes of this book.
However, skewness is only interpretable in the context of the standard error of skewness.
When the absolute value of skewness (that is, ignoring whether the skewness statistic
is positive or negative) is less than two times the standard error of skewness, the distri-
bution is normal. If the absolute value of skewness is more than two times the standard
error of skewness, the distribution is skewed. If the distribution is skewed and the skew-
ness statistic is positive, then the distribution is positively skewed. If the distribution is
skewed and the skewness statistic is negative, then the distribution is negatively skewed.
For example, if we find that skewness = 1.000 and SEskewness = 1.500, then we can con-
clude the distribution is normal. Two times 1.500 is 3.000, and 1.000 is less than 3.000,
so the distribution is normal. However, if we find that skewness = −2.500 and SEskewness
= 1.000, then we can conclude the distribution is negatively skewed. Two times 1.000
is 2.000, and 2.500 (the absolute value of −2.500) is more than 2.000, so we know the
distribution is skewed. The skewness statistic is negative, so we know the distribution is
negatively skewed.

Kurtosis
The other way that distributions can deviate from normality is kurtosis. While skew
measures if the distribution is shifted to the left or right, kurtosis measures if the peak
of the distribution is too high or too low. There are two kinds of kurtosis we might
find. Leptokurtosis occurs when the peak of the histogram is too high, indicating there
are a disproportionate number of cases clustered around the median. Platykurtosis
occurs when the peak is not high enough (the histogram is too flat), indicating too few
cases are clustered around the median. The figure below shows how these distributions
might look.
48 • BASIC ISSUES

Leptokurtic
Mesokurtic
Platykurtic

There is also a statistic we can use to evaluate kurtosis, which in jamovi is labeled, simply,
kurtosis. Like the skewness statistic, it is interpreted in the context of its standard error.
In fact, the interpretive rules are essentially the same. If the absolute value of kurtosis
is less than two times the standard error of kurtosis, the distribution is normal. If the
absolute value of kurtosis is more than two times the standard error of kurtosis and the
kurtosis statistic is positive, the distribution is leptokurtic. If it is negative, the distribu-
tion is platykurtic. In kurtosis, when the distribution is normal it may be referred to as
mesokurtic. In other words, a normal distribution demonstrates mesokurtosis. There is
no similar term for normal skewness.
A distribution can have skew, kurtosis, both, or neither. A normal distribution will
not have skew or kurtosis. Non-normal distributions might be non-normal due to skew,
kurtosis, or both. It is fairly common, though, for skew and kurtosis to occur together.
There is a tendency for data showing a strong skew also to be leptokurtic. That pattern
makes sense because if we push scores toward one end of the scale, the likelihood the
scores will pile up too high is strong.

Other tests of normality


There are other, more sensitive, and advanced tests for normality. Most notable is the
Kolmogorov-Smirnov test or K-S test. This test can evaluate how closely observed dis-
tributions match any expected or theoretical distribution, but its most common use is to
test whether a sample distribution matches the normal distribution. This test can be pro-
duced in most software, including jamovi. However, the K-S test is much more sensitive
than other measures and becomes more sensitive as the sample size increases. In other
words, it is more likely that the K-S test will indicate non-normality than most other
measures of normality, and that likelihood increases when the sample size gets larger.
For this reason, we suggest in this text to default to visual inspection of the histogram
plus an evaluation of the skewness and kurtosis statistics as suggested above.
BASIC EDUCATIONAL STATISTICS • 49

STANDARD SCORES
One application of the normal distribution is in the calculation of standard scores.
Standard scores are also commonly referred to as z-scores. We can convert any score to
a z-score based on the mean and standard deviation for the sample of the distribution
from which that score came. These standard scores can solve a number of problems such
as scores with different units of measure and can also be used to calculate percentiles and
proportions of scores within a range. In addition, standard scores always have a mean of
zero and a standard deviation of one.

Calculating z-scores
To calculate standard scores, we use the following formula:
XX
z
s
In other words, the standard score (or z-score) is equal to the difference between the
score and the mean, divided by the standard deviation.
One way we can use these standard scores is to compare scores from different scales
or tests that have different units of measure. Imagine that we have scores for students on
a mathematics achievement test, where the mean score is 20, and the standard deviation
is 1.5. We also have scores on a writing achievement test, where the mean score is 35, and
the standard deviation is 4. A student, John, scores an 18 on the mathematics test and a
34 on the writing test. In which subject does John show higher achievement? We cannot
directly compare the two test scores because they are on different scales of measurement.
However, by converting to standard scores, we can directly compare the two test scores:

X  X 18  20 2
Mathematics : z     1.333
s 1. 5 1. 5

X  X 34  35 1
Writing : z     0.250
s 4 4
Based on these calculations, we can conclude that John had higher achievement in writ-
ing. His z-score for writing was higher than for mathematics, so we know his perfor-
mance was better on that test. Because we know the z-scores have a mean of zero and a
standard deviation of one, we can do this kind of direct comparison.

Calculating percentiles from z


We can also use z-scores to calculate percentiles. Using Table A1 in this book, we can
find the percentile for any z-score. If we look up John’s mathematics z-score (−1.333),
for example, we see that his percentile score would be 9.18. That means that 9.18% of all
students scored lower than John on the mathematics test. On the other hand, if we look
up John’s writing standard score (−0.250), we find that his percentile score for writing is
40.13. So, 40.13% of all students scored lower than John in writing.
50 • BASIC ISSUES

Another use for standard scores is in determining the proportion of scores that would
fall in a given range. Let us imagine a depression scale with a mean of 100 and a standard
deviation of 15, where a higher score indicates more depressive symptoms. What pro-
portion of all participants would we expect to have a score between 90 and 120? We can
answer this with standard scores. We’ll start by calculating the z-score for 90 and for 120:
X  X 90  100 10
z    0.667
s 15 15

X  X 120  100 20
z    1.333
s 15 15
Using those standard scores, we find the percentile for z = −0.667 is 25.14, and for z =
1.333 is 90.86. So, what percent of scores will fall between 90 and 120? We simply sub-
tract (90.86 − 25.14) to find that 65.72% of all depression scores will fall between 90 and
120 on this test.
This procedure is how we determined what percentage of scores would fall within
±1 and ±2 standard deviations of the mean. At +1 standard deviations, z = 1.000, and
at −1 standard deviation, z = −1.000. Looking at Table A1, we find that the percentiles
would be 84.13 and 15.87, respectively. So, the area between −1 and +1 is 84.13 − 15.87 =
68.26%. This is why we say about 68% of the scores will fall within ±1 standard deviation
of the mean.

CALCULATING CENTRAL TENDENCY, VARIABILITY,


AND NORMALITY ESTIMATES IN JAMOVI
In this chapter, we have learned about central tendency, variability, and normality esti-
mates, as well as standard scores. In this last section of the chapter, we’ll demonstrate
how to get jamovi to produce some of those statistics. We will continue working with the
data we introduced earlier in the chapter, where we had test scores of 3, 6, 10, 5, 8, 6, 9,
and 4. We’ll start with a brief overview of how to use the jamovi software.
To begin, you can visit www.jamovi.org to install the software. Once installed, open the
program. When jamovi first opens, you will see a screen something like this.

By default, jamovi opens a blank, new data file. We can click “Data” on the top toolbar,
and then “Setup” to set up our variables in the dataset. In that dialogue box, we can
BASIC EDUCATIONAL STATISTICS • 51

specify the nature of our variables. For this example, we will name the variable “Age” and
specify that is it a continuous variable.

Within the setup menu, there are various options we can set:

• Name. We must name our variables. There are some restrictions on what you
can use in a variable name. They must begin with a letter and cannot contain any
spaces. So, for our purposes, we’ll name the variable Score.
• Description. We can put a longer variable name here that is more descriptive. This
field does not have restrictions on the use of spaces, special characters, etc.
• Level of measurement. There are four radio buttons to select the data’s level of
measurement. For interval or ratio data, we will select “Continuous.” For ordinal
data, select “ordinal,” and for nominal data, select “nominal.”
• Levels. For nominal or ordinal variables, this area can be used to name the groups
or categories after the data are entered. Once data are entered, this field will popu-
late with the numeric codes in the dataset. By clicking on those codes in this area,
you can enter a label for the groups/categories. We will return to this function in a
few chapters and clarify its use.
• Type. In this field, you can change from the default type of integer (which means
a number). Usually the only reason to change this will be if you have text varia-
bles (like free-text responses, names, or other information that cannot be coded
numerically).

We can then click the upward-facing arrow to close the variable setup. Then, we can
enter our data in the spreadsheet below.
52 • BASIC ISSUES

Next, we will ask the software to produce the estimates of central tendency, variabil-
ity, and normality. To do so in jamovi, we will click Analyses, then Exploration, then
Descriptives. In the resulting menu, we can click on the variable we wish to analyse (in
this case, Score), and click the arrow button to select it for analysis. We can also check the
box to produce a frequency table, though it will only produce such a table for nominal
or ordinal data.

Then, under Statistics, we can check the boxes for various descriptive statistics we wish to
produce, including mean, median, mode, standard deviation, variance, range, standard
error of the mean, skewness, kurtosis, and other options.
BASIC EDUCATIONAL STATISTICS • 53

Finally, under Plots, we can check the box to produce a histogram.

As you select the various options in the analysis settings on the left, the output will
populate on the right. It updates as you change options in the settings, meaning there
is no “run analysis” or other similar button to click—the analyses are running as
we choose them. The descriptive statistics we requested are all in the table under
Descriptives.
54 • BASIC ISSUES

The histogram appears under the Plots heading.

Finally, notice that there are references at the bottom of the output, which may be useful
in writing for publication if references to the software or packages are requested.
The estimates in the table are slightly different from our calculations earlier in this
chapter because we consistently rounded to the thousandths place, and jamovi does not
round at all in its calculations. We can also note from this analysis that the distribution
appears to be normal because for skewness, .201 is less than two times .752, and for kur-
tosis, 1.141 (absolute value of −1.141) is less than two times 1.480.
Finally, to save your work: The entire project can be saved as a jamovi project file. In
addition, you can save the data in various formats by clicking File, then Export. Options
include saving as an SPSS file (.sav format), a comma separate values format (.csv—this
format is widely used and would be compatible with almost any analysis software), or
other formats. To save the output only, you can right click in the output, then go to All,
then Export. This will allow you to save the output as a PDF or HTML format. Notice
that you can also export individual pieces of the analysis. You can also copy and paste any
or all of the analysis in this way.

CONCLUSION
In this chapter, we have explored ways to describe samples using central tendency and
variability estimates. We have also demonstrated how to evaluate whether a sample is
normally distributed, and the properties of the normal distribution. Then we explained
how to convert scores to standard scores (or z-scores) to use the normal distribution for
comparisons, calculating percentiles, and determining proportions of scores in a given
range. Finally, we demonstrated how to calculate most of these estimates using jamovi.
BASIC EDUCATIONAL STATISTICS • 55

In the next chapter, we will work with these and similar concepts to understand the null
hypothesis significance test.

Note
1 The denominator has N − 1 when calculating the variance of a sample. If we were calculat-
ing variance for a population, the denominator would simply be N. However, researchers
in educational and behavioral research almost never work with population-level data, and
the sample formula will almost always be the correct choice. Some other texts and online
resources, though, will show the formula with N as the denominator, which is because they
are presenting the population formula.
Part II
Null hypothesis significance testing

57
4
Introducing the null
hypothesis significance test

Variables 59
Independent variables 60
Dependent variables 60
Confounding variables 61
Hypotheses 62
The null hypothesis 62
The alternative hypothesis 62
Overview of probability theory 62
Calculating individual probabilities 63
Probabilities of discrete events 63
Probability distributions 64
The sampling distribution 64
Calculating the sampling distribution 65
Central limit theorem and sampling distributions 67
Null hypothesis significance testing 68
Understanding the logic of NHST 68
Type I error 70
Type II error 70
Limitations of NHST 70
Looking ahead at one-sample tests 71
Notes 71

In the previous chapters, we have explored fundamental ideas and concepts in educa-
tional research, sampling methods and issues, and basic educational statistics. In this
chapter, we will work toward applying those concepts in statistical tests. The purposes
of this chapter are to introduce types of variables that might be part of a statistical test,
to introduce types of hypotheses, to give an overview of probability theory, discuss sam-
pling distributions, and finally to explore how those concepts are used in null hypothesis
significance testing.

VARIABLES
There are several types of variables that you might encounter in designing research or
reading about completed research. We will briefly define each and give some examples

59
60 • NULL HYPOTHESIS SIGNIFICANCE TESTING

of what sorts of things might fit in each category. All of the research designs in this text
require at least one independent variable and one dependent variable. However, some
tests can also include mediating, moderating, and confounding variables.

Independent variables
In the simplest terms, an independent variable is the variable we suspect is driving or
causing differences in outcomes. We have to be very cautious here because claiming a
variable causes outcomes takes very specific kinds of evidence. However, the logic of an
independent variable is that it would be a potential or possible cause of those outcomes.
The naming of these variables as independent is because the independent variable would
normally be manipulated by the researcher. This is accomplished through random assign-
ment, which we described in a previous chapter. By randomly assigning participants to
conditions on the independent variable, we make it independent of other variables like
demographic factors or prior experiences. Because it has been randomly assigned (and
was manipulated by the researchers), the only systematic difference between groups is the
independent variable. Examples of independent variables might be things like treatment
type (randomly assigned by researchers), the type of assignment a student completes
(again, randomly assigned by researchers), or other experimentally manipulated variables.
In a lot of educational research scenarios, though, random assignment is not possible,
is impractical, or is unethical. Often, researchers are interested in studying how out-
comes differ based on group memberships that cannot be experimentally manipulated.
For example, when we study racialized achievement gaps, it is impossible to assign race
randomly. If we want to study differences in outcomes between online and traditional
face-to-face courses, we can normally not randomly assign students as they self-select
the type of course they want to take. In these cases, we might still treat those things (race,
class type) as independent variables, even though they are not true independent varia-
bles because they have not been randomly assigned. In those cases, some researchers will
refer to these kinds of variables as pseudo-independent or quasi-independent variables.

Dependent variables
If the independent variable is the variable we suspect is driving or causing differences
in outcomes, the dependent variable is the outcome we are measuring. It is called the
dependent variable because we believe scores on this variable depend on the independent
variable. For example, if a researcher is studying reading achievement test score differ-
ences by race, the achievement test score is the dependent variable. It is possible to have
more than one dependent variable as well. In general, the tests in this text will allow only
one dependent variable at a time, but there are other more advanced analyses (called
multivariate tests) that will handle multiple dependent variables simultaneously. One
method that can help identify the independent versus dependent variable is to diagram
what variables might be leading to, influencing, driving, or causing the other variable.
For example, if a researcher models their variables like this:

Class type  online versus face-to-face   Final exam scores

This diagram shows that the researcher believes class type influences or leads to differ-
ences in final exam scores. So, class type is the independent (or pseudo-independent)
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 61

variable, and final exam scores are the dependent variable. In this kind of diagram, the
variable the arrow points away from is independent, and the variable the arrow points
toward is dependent.

Confounding variables
Another kind of variable to consider are those that make a difference in the dependent
variable other than the independent variable. In other words—variables that change the
outcome other than the independent variable. There are a few ways this can happen,
but these variables are generally known as confounding variables. They are called con-
founding because they “confound” the relationship between independent and dependent
variables. Confounding variables might also be unmeasured. It could be that there are
variables that change the outcome that we have not considered or measured in our
research design, which would create confounded results. Ideally, though, we would iden-
tify and measure potential confounding variables as part of the design of the study.
One issue confounding variables create is what is known as the third variable prob-
lem. The third variable problem is the fact that just because our independent variable
and dependent variable vary together does not mean that one causes the other. It might
be that some other variable (the “third variable”) actually causes both. For example,
there is a strong, consistent relationship between ice cream consumption and deaths by
drowning. Does eating ice cream cause death by drowning? We might intuitively suspect
that this is not the case. However, how can we explain the fact that these two variables
co-vary? In this case, a third variable explains both: summer heat. When it gets hot out-
side, people become more likely to do two things: eat cold foods like ice cream and go
swimming to cool down. The more people swim, the more drowning deaths are likely
to occur. So, the relationship between ice cream consumption and drowning deaths is
not causal—it is an example of the third variable problem. Often in applied educational
research, the situation will not be so clear. There might be some logical reason to suspect
a causal relationship. However, good research design will involve identifying, measuring,
and excluding third-variable problems.
There are also potential confounding variables that serve as mediator or moderator
variables. These variables alter or take up some of the relationship between independent
and dependent variables. Mediators are usually grouping variables (categorical variables,
usually nominal) where the effect of the independent variable on the dependent differs
between groups. For example, we may find that an intervention aimed at increasing the
perceived value of science courses works better for third graders than it does for sixth
graders. There is a relationship between the intervention and perceived value of sci-
ence—but that relationship differs based on group membership (in this case, the grade
in school). Mediator variables are usually continuous (interval or ratio) and explain the
relationship between independent and dependent variable. For example, if our interven-
tion increases perceived value of science courses, we might wonder why it does so. Per-
haps the intervention helps students understand more about science courses (increases
their knowledge of science) and that increased knowledge leads to higher perceived
value for science courses. In that case, knowledge might be a mediator, and we might
find that the intervention increases knowledge, which in turn increases perceived value
(intervention → knowledge → perceived value). In some cases, the mediation is only
partial, meaning that the mediator doesn’t take up all of the relationship between the
independent and dependent variable, but does explain some of that relationship.
62 • NULL HYPOTHESIS SIGNIFICANCE TESTING

HYPOTHESES
In quantitative analysis, we must specify a hypothesis beforehand and then test our
hypotheses using probability analyses. The specific kind of testing used in most (but not
all) quantitative analysis is null hypothesis significance testing (NHST). We will return
to NHST later in this chapter, but first, we will talk about what hypotheses in these kinds
of analyses look like.

The null hypothesis


The null hypothesis is our hypothesis that the results are null. In other words, the null
hypothesis is that there is nothing to see here. In group comparisons, the null hypoth-
esis will be that there is no difference between groups, for example. Take an example
where we might compare the means of two groups. The null hypothesis would be that
the means of the two groups are equal, or put another way, that there is zero difference
in the group means. The null hypothesis is often notated as H0, and our example of two
group means might be written as:

H0 : X = Y

In other analyses, the null hypothesis might be that there is no relationship between
two variables. In any case, the null hypothesis will always be that there is no meaningful
difference or relationship.

The alternative hypothesis


The alternative hypothesis is the opposite of the null. Sometimes called the research
hypothesis, this hypothesis will be that there is some difference or relationship. For
example, if we are testing the difference in means of two groups, the alternative hypoth-
esis will be that those means are different. Some research designs might involve more
than one alternative hypothesis, but the first one is notated as H1. If there were multiple
alternative hypotheses, the next hypotheses would be H2, H3, and so on. For our example
involving two groups, the alternative hypothesis might be written as:

H1 : X ≠ Y

Alternative hypotheses can also potentially be directional. We will explore this more
in the next few chapters, but it could be that our hypothesis is that group X will have a
higher mean than group Y. We can specify that directionality in the hypothesis. We’ll
return to this idea in a later chapter and give examples of when it might be appropriate.
We’ll evaluate our hypotheses using NHST. Those tests operate based on probabilities,
and our decision about these hypotheses will be based on probability values. Because of
that, we next briefly review the basics of probability theory.

Overview of probability theory


We’ll begin by exploring some basics in probability theory, and then we will work up to
applying it to statistical tests, specifically NHST. Probabilities are always expressed as a
value between 0.000 and 1.000. They can be converted to a percentage through multiplying
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 63

by 100. So, if an event had a probability of p = .250, then we’d expect that event to occur in
about 25% of cases. Stated another way, there is about a 25% chance of that event occurring.

Calculating individual probabilities


In order to calculate these probability values, we will divide the number of occurrences
over the total number of cases (or possible outcomes, often called the sample space). In a
simple example, imagine flipping a coin. There are two possible outcomes: heads or tails.
So, the sample space is two. If we want to calculate the probability of flipping a coin and
getting heads, we simply divide the number of cases that meet our criteria (only one side
is heads) out of the sample space (there are two sides). So, p(A) = A/N, where A is the
event whose probability we want to calculate, and N is the sample space. In the case of
calculating the probability of flipping a coin and getting heads, p(heads) = 1/2 = .500. In
other words, there’s about a 50% chance of getting heads on a coin flip.
Let us take another example: Imagine rolling a standard six-sided die. Such dice have
one number per side (1, 2, 3, 4, 5, and 6). There are six possible outcomes, making the
sample space (N) six. We can calculate the probability of rolling a die and getting a given
value. What is the probability of rolling the die and getting a 3? There is one 3 side (A)
and six total sides (N). p(3) = 1/6 = .167. There’s about a 17% chance on any given roll of
the die of rolling a 3.
In terms of educational research, the sample space is often the total number of people
in a sample, class, school, population, or another grouping. For example: Imagine a class
of 30 students in various degree plans. In the class, 10 are elementary education students,
5 are special education students, 7 are kinesiology students, and 8 are science education
students. If the instructor randomly draws a name, what is the probability the name will be
of a science education student? The total number of students is 30 (N), and there are 8 sci-
ence education students (A). p(science education) = 8/30 = .267. There is thus about a 27%
chance that a randomly drawn student from that class will be a science education student.

Probabilities of discrete events


We can also calculate the probability that two discrete events will both occur. Discrete
events are events that do not affect the probability of other events. For example, die roll
outcomes are discrete events because the probability of future rolls does not depend on
past rolls. In other words, regardless of what side of the die starts on top, the probability
for the next roll stays the same. Similarly, the probability of getting heads on a coin flip
does not depend on what side the coin started. The probabilities are independent. In
such cases, there are two kinds of combinations we can calculate.
First, we can calculate the probability that two discrete events will both occur. For
example, what is the probability of flipping a coin twice and getting heads both times?
Stated another way, what is the probability of getting heads on one flip, and getting heads
on a second flip? To calculate this kind of probability, we will use the formula: p(A&B) =
(p(A))(p(B)). For our cases, the probability of getting heads on any one flip is p(heads) =
1/2 = .500. So, p(heads & heads) = (p(heads))(p(heads)) = (.500)(.500) = .250. The prob-
ability of flipping a coin twice in a row and getting heads both times is p = .250. There
is about a 25% chance of that happening. In another example, what is the probability
of rolling two standard six-sided dice and getting a 6 on both? As we found earlier, the
probability of getting a specific result on a die roll is .167. p(6 & 6) = (p(6))(p(6)) = (.167)
64 • NULL HYPOTHESIS SIGNIFICANCE TESTING

(.167) = .028. In other words, when rolling two standard six-sided dice, both would come
up 6 about 3% of the time. From the example about randomly calling names in a class,
what is the probability that the instructor would randomly draw two names, and one
would be a science education student while the second would be a kinesiology student?1
p(science education & kinesiology) = p(science education) * p(kinesiology) = (8/30) *
(7/30) = .267 * .233 = .062. We’ll apply this logic to thinking about the probability of
getting samples with certain combinations of scores.
The other way we can work with discrete events is to ask the probability of getting one
outcome or another. For example, what is the probability of rolling two dice and at least
one of them coming up 6?2 In such a case, the formula is p(A or B) = p(A) + p(B). So, the
probability of rolling two dice and at least one of them coming up 6 is p(6 or 6) = .167
+ .167 = .334. At least one of the two dice will come up 6 about 33% of the time. What
is the probability an instructor will draw two names at random, and at least one will be
either a science education student or a kinesiology student? p(science education or kine-
siology) = p(science education) + p(kinesiology) = .267 + .233 = .500.

Probability distributions
We can also combine the probabilities of all possible outcomes into a single table. That
table is called a probability distribution. In a probability distribution, we calculate the
independent probabilities of all possible outcomes and put them in a table. For example:

Elementary Special Kinesiology Science Total


Education Education Education

n 10 5 7 8 30
p 0.333 0.167 0.233 0.267 1.000

So, if we randomly draw a name from the example class of 30 we described before, this
table shows the probability that the student whose name we draw will be in a given
program. Notice that the total of those independent probabilities is 1.000. If we draw a
name, the chances that student will be in one of these four programs are 100%, because
those are the only programs represented in the class.

THE SAMPLING DISTRIBUTION


However, in calculating the statistical tests we will learn in this text, we will normally
not be interested in the probability of a single case but in combinations of cases. We will
calculate things like the probability of a sample with a given mean or the probability of
a mean difference of a certain amount. To do that, we’ll need to take one step beyond
probability distributions to sampling distributions.
Put simply, a sampling distribution is a distribution of all possible samples of a given
size from a given population. For the sake of illustration, imagine that the class of 30 we
described above is the population. Of course, in real research, populations tend to be
huge, but we will imagine these 30 students are our population to illustrate the idea of a
sampling distribution. Now imagine that, from this population of 30, we draw a sample
of two (in other words, we pull two names). What are the possible compositions of that
sample? We could have the following samples:
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 65

Name 1 Name 2

Elementary Education Elementary Education


Elementary Education Special Education
Elementary Education Kinesiology
Elementary Education Science Education
Special Education Elementary Education
Special Education Special Education
Special Education Kinesiology
Special Education Science Education
Kinesiology Elementary Education
Kinesiology Special Education
Kinesiology Kinesiology
Kinesiology Science Education
Science Education Elementary Education
Science Education Special Education
Science Education Kinesiology
Science Education Science Education

In total, there are 16 possible samples we could get by drawing two names at random
from this class of 30. The distribution of those samples is the sampling distribution. In
applied research, we likely have populations that number in the millions, and samples
that number in the hundreds. In that case, the number of possible samples gets much
larger. The number of possible samples also increases when there are more possible out-
comes (for example, in this case, if we had five majors represented in the class, we would
have 25 possible samples).

Calculating the sampling distribution


Let us take another example, and this time we will use it to construct a full sampling
distribution. Imagine that we have collected data from a population of 1,150 high school
students on the number of times they were referred to the main office for disciplinary
reasons. The lowest number of disciplinary referrals was zero, and the highest was three.
Below is the frequency distribution of disciplinary referrals and the probability that a
randomly selected student would have been referred that number of times (calculated as
described above).

Zero One Two Three Total

Frequency 450 300 180 220 1150


p .391 .261 .157 .191 1.000

Now imagine we take a random sample of two students from this population. We will
learn later that our analyses usually require samples of 30 or more, but to keep the
66 • NULL HYPOTHESIS SIGNIFICANCE TESTING

process simpler and easier to follow, we’ll imagine sampling only two. What are the pos-
sible combinations we might get? We could get: 0, 0; 0, 1; 0, 2; 0, 3; 1, 0; 1, 1; 1, 2; 1, 3; 2,
0; 2, 1; 2, 2; 2, 3; 3, 0; 3, 1; 3, 2; 3, 3. Our next question might be: what is the probability
of obtaining each of these samples? We can calculate that, using the probability formula
we discussed earlier. For example, if the probability of the random student having no
referrals is .391, and the probability of the random student having one referral is .261,
what is the probability of drawing two students randomly and the first having zero refer-
rals and the second having one referral? We would calculate this as p(zero & one) =
(p(zero))(p(one)) = (.391)(.261) = .102. So, there is roughly a 10% chance of getting that
particular combination. We can use this same process to determine the probability of all
possible samples:

Possible Sample p M

0, 0 (.391)(.391) = .153 (0 + 0)/2 = 0.0


0, 1 (.391)(.261) = .102 (0 + 1)/2 = 0.5
0, 2 (.391)(.157) = .061 (0 + 2)/2 = 1.0
0, 3 (.391)(.191) = .075 (0 + 3)/2 = 1.5
1, 0 (.261)(.391) = .102 (1 + 0)/2 = 0.5
1, 1 (.261)(.261) = .068 (1 + 1)/2 = 1.0
1, 2 (.261)(.157) = .041 (1 + 2)/2 = 1.5
1, 3 (.261)(.191) = .050 (1 + 3)/2 = 2.0
2, 0 (.157)(.391) = .061 (2 + 0)/2 = 1.0
2, 1 (.157)(.261) = .041 (2 + 1)/2 = 1.5
2, 2 (.157)(.157) = .025 (2 + 2)/2 = 2.0
2, 3 (.157)(.191) = .030 (2 + 3)/2 = 2.5
3, 0 (.191)(.391) = .075 (3 + 0)/2 = 1.5
3, 1 (.191)(.261) = .050 (3 + 1)/2 = 2.0
3, 2 (.191)(.157) = .030 (3 + 2)/2 = 2.5
3, 3 (.191)(.191) = .036 (3 + 3)/2 = 3.0

Notice that we can also calculate a mean for each sample. Next, we might want to know
the probability of getting a random sample of two from this population with a given
mean. For example, what is the probability of randomly selecting two students, and their
mean number of referrals being 1.0? To calculate this, we will use another formula intro-
duced earlier in this chapter. As we look at the sample means, there are three samples
that have a mean of 1.0 (0, 2; 2, 0; and 1, 1). So, we calculate p(0,2 OR 2,0 OR 1,1) = p(0,2)
+ p(2,0) + p(1,1) = .061 + .061 + .068 = .190. There is about a 19% chance that a random
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 67

sample of two from this population will have a mean of 1.0. Below are the calculations
for all possible means from a sample of two:

Mean Samples p

0.0 0,0 .153


0.5 0,1; 1,0 .102 + .102 = .204
1.0 0,2; 1,1; 2,0 .061 + .068 + .061 = .190
1.5 0,3; 1,2; 2,1; 3,0 .075 + .041 + .041 + .075 = .232
2.0 1,3; 2,2; 3,1 .050 + .025 + .050 = .125
2.5 2,3; 3,2 .030 + .030 = .060
3.0 3,3 .036

Notice that if we add up all these probabilities, the total is 1.0. This set of possible means
represents all possible outcomes of a random sample of two, so the total of the probabil-
ities will add to 1.0. We can also display these probabilities as a histogram.

Central limit theorem and sampling distributions


As we mentioned earlier, these sampling distributions become more complex as the
sample sizes increase and the number of response options increases. There might be
millions of possible samples in some situations. As the number of samples in a sampling
68 • NULL HYPOTHESIS SIGNIFICANCE TESTING

distribution increases, several things start to happen. These trends are described by
something called the Central Limit Theorem, and include that:

• As the sample size increases, the sampling distribution will become closer and
closer to a normal distribution.
• As the sample size increases, the mean of the sampling distribution (that is, the
mean of all possible samples, often called the “mean of means”) will become closer
and closer to the population mean.
• Sample sizes over 30 will tend to produce a normal sampling distribution and a
mean of means that approximates the population mean.

The practical importance of this is that, because all the tests included in this book require
normal distributions, the minimum sample size will generally be 30, and we prefer larger
sample sizes.
In practice, it is rarely necessary to hand-calculate a sampling distribution, as we have
done above. Instead, we usually use known or theoretical sampling distributions, like
the z, t, and F distributions we’ll encounter in later chapters. However, those known or
theoretical sampling distributions work on the same mathematical principles. They also
produce the same result: they let us calculate the probability of getting a sample with
certain characteristics (e.g., a sample with a certain mean).

NULL HYPOTHESIS SIGNIFICANCE TESTING


Now that we have explored types of variables, probability theory, and sampling distribu-
tions, we are ready to talk about null hypothesis significance testing (NHST). In NHST,
we test the probability of an observed score or observed difference against an assumed
null difference. This can be initially confusing to think about. One way of describing this
is: if we assume, we live in a world where the null hypothesis is true, how likely are we
to observe a difference of this size? As such, NHST starts with a sampling distribution
where the mean of means (the mean of all samples in the sampling distribution) is zero.
Based on our example above, we know it is fairly likely that we will get a sample mean
that is different from that population mean. In the case of the NHST, we know that it is
fairly likely we will get a score or difference that is not zero. However, what is the proba-
bility associated with our outcome?

Understanding the logic of NHST


As mentioned above, for NHST, we work with known sampling distributions. Those dis-
tributions usually have a mean of means that is zero, and we use them to test differences
from zero. In our above example, we were able to determine the probability of getting a
sample with a given mean. In NHST, we will do the same but will typically test the prob-
ability of getting a sample with a given mean difference from some test value. In the next
chapter, we will use this to test the difference of a sample mean from a population mean.
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 69

Most of the tests covered in this book involve testing the difference between two or more
sample means. However, these tests all work on the same logic—given a sampling distri-
bution, what is the probability of a difference this large?
This part of NHST can initially be confusing. Many students want the NHST to be a
direct test of the null hypothesis. It is not. The NHST is not a direct test of any hypothe-
sis. Instead, the NHST assumes the null hypothesis is true (and thus assumes a sampling
distribution where the mean of means is zero) and tests the probability of the observed
value. It asks the question: if the true difference in the population is zero, how likely is a
difference of this size to occur? Put another way: in a world where there is no difference,
what is the probability of getting a sample with this large of a difference?
In general, in educational research, we set the threshold for deciding to reject the null
hypothesis at p < .050. So, if the probability of the observed difference is less than .050,
we reject the null hypothesis. If it is greater than or equal to .050, we fail to reject the
null hypothesis. Notice that these are the only two options in NHST: rejecting or failing
to reject the null hypothesis. When the probability of the observed outcome occurring
if the true difference were zero (null) is low (less than .050), we conclude that the null
hypothesis is not a good explanation for the data. It is unlikely we would see a difference
this big if there were no real difference, so we decide it is not correct to believe there is
no real difference. If the probability is relatively high (greater than or equal to .050) that
we would observe a difference of this size if the true difference were zero, we conclude
there is not enough evidence to reject the null as a plausible explanation (we fail to reject
the null).
We will return to this logic of the NHST again in the next chapter, where we will have
our first statistical test. For now, it is important to be clear that the NHST tests the proba-
bility of our data occurring if the null were true and is not a direct test of the null or alter-
native hypothesis. It tells us how likely our data are to occur in a world where the null
hypothesis is true. If p = .042, for example, we would expect to find a difference that large
about 4.2% of the time in a world where the null hypothesis was true. Another important
note: our decision to make the cutoff .050 is completely arbitrary. It’s become the norm
in educational research and in social sciences in general, but there is no reason for .050
versus any other number. Many scholars have pointed out that NHST has shortcomings,
in part because we work with an arbitrary cutoff of .050, and in part because testing for
differences greater than zero is a low bar. We know that very few things naturally exist in
states of zero difference, so testing for differences greater than zero means that in a large
enough sample, we almost always find non-zero differences.
One final note about language before we review the types of error that occur in NHST.
It is typical to describe a finding where we reject the null hypothesis as “significant” and
to call it “nonsignificant” when we fail to reject the null hypothesis. Because of this, the
word “significant” takes on a special meaning in quantitative research. In writing about
quantitative research, it is important to avoid using “significant” for anything other than
describing an NHST result. In common speech and writing, the word “significant” can
mean lots of other things, but in quantitative research, it takes on this very particular
meaning. As a general rule, the word “significant” should be followed by reporting a
probability in this kind of writing. In all other cases, default to synonyms like “impor-
tant,” “meaningful,” “substantial,” and “crucial.”
70 • NULL HYPOTHESIS SIGNIFICANCE TESTING

Type I error
As we discussed above, typically in educational and social research, we set the criterion
value for p < .050, and reject the null hypothesis when the probability of our data are
below that threshold. This value is sometimes referred to as α (or alpha), and it repre-
sents our Type I error rate. A Type I error occurs when we reject the null hypothesis
when it was actually correct. In other words, a Type I error is when we conclude there is
a significant difference, but there is no real difference. By setting α = .050, we are setting
the Type I error rate at 5%. We expect to make a Type I error about 5% of the time using
this criterion. Type I error is the more serious kind of error, as this kind of error means
we have claimed a difference that does not really exist. Type I error is also the only kind
of error for which we directly control (by setting our criterion probability or α level).

Type II error
A Type II error occurs when we conclude there is no significant difference (fail to reject
the null hypothesis), but there is actually a difference. Perhaps the difference is small,
and our test led us to conclude it was too small to be significant. Sometimes researchers
refer to the Type II error rate as being 1 − α, so that when α = .050, the Type II error
rate would be .950. This is a bit misleading because it is not as though we would expect
to make a Type II error 95% of the time. That formula is probably less useful and more
misleading. The way to decrease our chances of a Type II error is not to increase α, after
all. Instead, we protect against Type II error by having sufficiently large sample size and
a robust measurement strategy.
To put it in the simplest terms, there are two kinds of errors we might make in an
NHST:

• Type I: Rejecting the null hypothesis when there is no real difference.


• Type II: Failing to reject the null hypothesis when there is a real difference.

Limitations of NHST
All of the tests presented in the remainder of this textbook are null hypothesis signif-
icance tests—most researchers in educational and social research who do quantitative
work conduct NHST. Still, NHST, and, in particular, the use of p thresholds or criterion
values, have been the subject of much debate. Cohen (1994) suggested that the use of
NHST can lead to false results and overconfidence in questionable results. The American
Statistical Association has long advocated against the use of probability cutoffs (Wass-
erstein & Lazar, 2016), and has continued to push for more use of measures of magni-
tude and effect size. In response, many educational and social researchers have become
more critical of the use of probability values and NHST. The approach is deeply limited:
NHST assumes a zero difference, does not produce a direct test of the null or alterna-
tive hypothesis, and (as we will discover in future chapters) p can be artificially reduced
simply by increasing sample size. In other words, NHST is easily misinterpreted and can
also be “gamed.” As a result, we advocate in this text, as many publishers and journals
do, for the interpretation of p values alongside other measures of magnitude and effect
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 71

size. In fact, the APA (2020) publication manual is explicit in requiring NHST be paired
with measures of effect size or magnitude. With each NHST, we will also use at least
one such effect size indicator, and we will discover that not all significant differences are
meaningful.

Looking ahead at one-sample tests


So far, we have reviewed the basics of probability theory and sampling distributions and
described how null hypothesis significance tests (NHST) use sampling distributions to
produce a probability value. We have also discussed null and alternative hypotheses, and
Type I and Type II error. In the next chapter, we will introduce the first applied statistical
test of this book. Specifically, we will introduce one-sample tests. One-sample tests test
for differences between a sample and a population or criterion value. These tests are
not particularly common in published research or applied use. However, they do have
practical uses and serve as a useful way to learn about null hypothesis tests. In other
words, these tests are used infrequently in applied work but do have some applications.
However, we begin with these tests as a way of transitioning into using NHST and under-
standing the logic and mechanics of these tests.

Notes
1 For our purposes, all examples are given using sampling with replacement. We made this
choice because the sampling distribution logic, to which this section builds, uses sampling
with replacement calculations. It’s reasonable to think, though, about the classroom example
with a sampling without replacement calculation. When we draw the first name, if we don’t
replace that name before drawing the second name, the sample space would be reduced by
one. As a result, for our second probability calculation, we would calculate out of a sample
space of 29. However, for our purposes, we will always assume sampling with replacement,
so such adjustments are not needed.
2 For our purposes, we’re calculating “or” probabilities, including the probability of both
events occurring. In other words, the probability we calculate for rolling a 6 on at least one of
the two dice includes the probability that both dice will come up 6. So, we’ve phrased this as
getting 6 on at least one of the two rolls.
5
Comparing a single sample to the
population using the one-sample
Z-test and one-sample t-test

The one-sample Z-test 74


Introducing the one-sample Z-test 74
Design considerations 74
Assumptions of the test 75
Calculating the test statistic 75
Calculating and interpreting effect size estimates 76
Interpreting the pattern of results 77
The one-sample t-test 77
Introducing the one-sample t-test 77
Design considerations 78
Assumptions of the test 78
Calculating the test statistic 78
Calculating and interpreting effect size 79
Interpreting the pattern of results 79
Conclusion 79

In the previous chapter, we explored the basics of probability theory and introduced null
hypothesis significance testing (NHST). In this chapter, we will go further with NHST
and learn two one-sample NHSTs. Neither of these one-sample tests are especially com-
mon in applied research. We teach them here as they are easier entry points to learning
NHST that help us build toward more commonly used tests. There are, though, prac-
tical uses for each of these two tests, and we will present realistic scenarios for each.
One-sample tests, in general, compare a sample to the population or compare a sample
to a criterion value. That function is why they are less common in applied work because
researchers rarely have access to population statistics and usually compare two or more
samples. However, in the event that a researcher had population statistics or criterion
values, the one-sample tests could be used for that comparison.

73
74 • NULL HYPOTHESIS SIGNIFICANCE TESTING

THE ONE-SAMPLE Z-TEST


The first one-sample test we will introduce is the one-sample Z-test. This test allows
a comparison of a sample to the population, given that we know means and standard
deviations for both the sample and the population. This test is uncommon in practical
research but is based on the formula for standard scores (Z scores), so it will feel familiar
from the previous chapter.

Introducing the one-sample Z-test


The one-sample Z-test answers the following question: is the sample mean different from
the population mean? As such, it has the following null and alternative hypotheses:
H0 : M  
H1 : M  
In this set of hypotheses, the null is that the sample and population means are equal,
and the alternative is that they are not equal. These nondirectional hypotheses are called
two-tailed tests because we are not specifying if we expect Z to be positive or negative
(the sample or the population to be higher), and so are testing in both “tails” of the Z
distribution.
However, for this test, we can specify a directional hypothesis if it is appropriate. For
example, we could specify in the alternative hypothesis that we expect the sample mean
to be higher than the population mean:
H0 : M  
H1 : M  
We could also specify the opposite directionality—suggesting we expect the sample
mean to be lower than the population mean:

H0 : M  
H1 : M  

These directional hypotheses (hypotheses that specify a particular direction for the dif-
ference) are called one-tailed tests. That is because, by specifying a direction of the differ-
ence, we are only testing in the positive or the negative “tail” of the Z distribution.

Design considerations
This design is fairly rare in applied research. The reason is that the test requires that we
know the population mean and population standard deviation. This is fairly uncom-
mon—we almost always know or can calculate the sample mean and standard deviation.
But it’s trickier to do that for a population. In fact, much of the work that is done in
quantitative research is done because the population data are inaccessible. We very rarely
have access to the full population, which we would need in order to know its standard
deviation. But in situations where the population standard deviation is known, we can
calculate this test to compare a sample to the population.
COMPARING A SINGLE SAMPLE TO THE POPULATION • 75

Again, this test is not used frequently in applied research, but we will imagine a scenario
where it might be. Imagine that we gain, from a college entrance exam company, informa-
tion about all test-takers in a given year. They might report that, out of everyone who took
the college entrance exam that year, the mean was 21.00, with a standard deviation of 5.40.
The test scores range from 1 to 36. Imagine that we work in the institutional research office
for a small university that this year admitted 20 students. Those 20 admitted students had
an average entrance exam score of 23.45, with a standard deviation of 6.21. The university
might be interested in knowing if its group of incoming students had higher scores than
the population of test-takers. In this case, the university hopes to make a claim of a high-
er-than-normal test score for the purposes of marketing. Of course, there is considerable
debate among researchers about the value and importance of those scores, but it is also
true that many universities use them as a way to market the “quality” of their students.

Assumptions of the test


We will discover throughout this text that each test has its own set of assumptions. In
general, the assumptions should be met in order to use the test as it was intended. There
are two kinds of assumptions: statistical and design assumptions. Statistical assumptions
deal with the kinds of data we are testing and usually need to be more closely met in
order to use a test at all. Design assumptions usually bear on what kinds of inferences we
can draw from a test, so being further off from those will mean more limited or qualified
kinds of conclusions. For the one-sample Z-test, the assumptions are:

• Random sampling. The test assumes that the sample has been randomly drawn
from the population. As we’ve already explored in this text, this assumption is
rarely, if ever, met because true random sampling is nearly impossible to perform.
But if we hope to generalize about a result, then the adequacy of sampling methods
is important. For example, in the scenario we described above, generalizability is
of limited importance. In fact, the university in our scenario hopes to demonstrate
their students are not representative of the population. But in other cases, it might
be that the researchers need their sample to be representative of the population so
that they can generalize their results.
• Dependent variable at the interval or ratio level. The dependent variable must be
continuous. That means it needs to be measured at either the interval or ratio level.
In our scenario, entrance exam scores are measured at the interval level (the same
interval between each point, with no true absolute zero).
• Dependent variable is normally distributed. The dependent variable should also
be normally distributed. We have introduced the concept of normality, and how to
assess it using skewness and kurtosis statistics. In future chapters, we will practice
evaluating this assumption for each test, but in this chapter, we will focus on the
mechanics of the test itself.

Many of these assumptions will show up in future tests, as well.

Calculating the test statistic


The one-sample Z-test is relatively simple to calculate. It is calculated based on the fol-
lowing formula:
76 • NULL HYPOTHESIS SIGNIFICANCE TESTING

M
Z

N

The numerator is the mean difference, and the denominator is the standard error of the
population mean. In our example, the sample mean (M ) was 23.45, population mean (μ)
was 21.00, population standard deviation (σ) was 5.40, and the sample size (N ) was 20.
So we can calculate Z for our example as follows:
M   23.450  21.000 2.450 2.450
Z     2.0028
 5.400 5.400 1.208
N 20 4.472

So, for our example, Z = 2.487. We can use that statistic to determine the associated
probability. In Table A1, the third column shows the one-tailed probability for each value
of Z. At Z = 2.02, p = .022. (Note that we always round down when comparing to the table
because rounding up would inflate p slightly, which raises the risk of Type I error.) That is
the one-tailed probability, which in this scenario is what we need because the hypothesis
was directional. If it had not been a directional hypothesis, though, we would simply
double that probability, so the two-tailed probability would have been p = .044. Finally,
because p < .050 (.007 < .050), we reject the null hypothesis and conclude there is a sig-
nificant difference between these 20 incoming students and all test-takers nationwide.

Calculating and interpreting effect size estimates


One of the problems with null hypothesis significance tests is that they only tell us
whether or not a difference exists. They do not provide any indication of whether that
difference is large or how large the difference is. As a result, we will always want to sup-
plement any null hypothesis significance test with an estimate of effect size. These esti-
mates give a sense of the magnitude of the difference. Our first effect size estimate will be
Cohen’s d, which is simple to calculate:
M
d


Note that this formula only works for the one-sample Z-test. Each test has its own for-
mula for effect size estimates. So, for our example:
M   23.450  21.000 2.450
d    0.454
 5.400 5.400

Cohen’s d is not bounded to any range—so the statistic can be any number. However,
in cases where d is negative, researchers typically report the absolute value (drop the
negative sign). It can also be above 1.0 or even 2.0, though those cases are relatively
uncommon. There are some general interpretative guidelines offered by Cohen (1977),
which suggest that d or .2 or so is a small effect, or .5 or so is a medium effect, and .8 or
so is a large effect. It is important to know, though, that Cohen suggested those as start-
ing points to think about, not as any kind of rule or cutoff value. In fact, the best way to
COMPARING A SINGLE SAMPLE TO THE POPULATION • 77

interpret effect size estimates is in context with the prior research in an area. Is this effect
size larger or smaller than what others have found in this area of research? In our case,
we will likely call our effect size medium (d = .454).
Sometimes, researchers find extremely small effect sizes on significant differences.
That is especially likely if the sample size is very large. In those cases, researchers might
describe a difference as being significant but not meaningful. That is, a difference can be
more than zero (a significant difference) but not big enough for researchers to pay much
attention to it (not meaningful). One of the questions to ask about effect size is whether
a difference is large enough to care about it.
Finally, a note about the language of effect size: although researchers use language like
“medium effect” or “large effect,” they mean that there is a statistical effect of a given size.
This should not be confused with making a cause-effect kind of claim, which requires
other evidence, as we’ve discussed in this text elsewhere.

Interpreting the pattern of results


Finally, we would want to interpret the pattern of differences in our results. There are two
ways to do this in a one-sample Z-test. Based on the Z-test result, we already know the
sample and the population are significantly different. We could look at the means and see
that 23.45 is higher than 21.00, so the sample mean is higher than the population mean.
We could also determine that from the Z value itself—when it is positive, the sample
mean was higher, and when negative, the population mean was higher.

THE ONE-SAMPLE T-TEST


However, it is, as we have mentioned already in this chapter, very rare to actually know
the population standard deviation. That is where the one-sample t-test comes in—it
can compare a sample to the population even if the population standard deviation is
unknown. We will discover in future chapters that the t-test is a very versatile and useful
test. But, in this chapter, we introduce it in its simplest form—the one-sample t-test.

Introducing the one-sample t-test


This test has the same possible hypotheses as the one-sample Z-test. The main difference
is that it does not require the population standard deviation. Because the population
standard deviation is unknown, we cannot calculate the standard error of the population
mean for the denominator, as we did in the Z-test. Instead, we use an estimate of the
standard error of the mean. As a result, we cannot use the Z distribution, and will instead
use a new test distribution: the t distribution. We will discuss more about this distribu-
tion in the next chapter. For now, it is enough to know that the t distribution is shaped,
much like the Z distribution in that it is a normal sampling distribution. However, the t
distribution’s shape varies based on sample size. It uses something called degrees of free-
dom, which we’ll explore in more depth in the next chapter. However, for the one-sample
t-test, the degrees of freedom are sample size minus one (N − 1).
Notice in Table A2 that the t table includes only “critical values”—not exact probabili-
ties. The critical values listed are the value for t at which p is exactly .050. So, if the abso-
lute value of t (ignoring any negative/positive sign) is more than the value in the table,
p is less than .050 (p gets smaller as t gets bigger). As a result, if the value in the table is
less than the absolute value of the test statistic, we reject the null hypothesis. Researchers
78 • NULL HYPOTHESIS SIGNIFICANCE TESTING

often call this “beating the table”—if the calculated test statistic “beats” the table value,
then they reject the null hypothesis.

Design considerations
The design considerations are essentially the same for this design as compared with the
one-sample Z-test. However, there is one situation where this test can be used while Z
cannot. Because the t-test does not require population standard deviation, it can be used
to test a sample against some criterion or comparison value. For example, it could be
used to test if a group of students has significantly exceeded some minimum required
test score. It could also be used to compare against populations with a known mean but
not a known standard deviation.

Assumptions of the test


The assumptions of the one-sample t-test are identical to those of the one-sample Z-test
with no meaningful differences in their application. So, the same assumptions apply
random sampling and assignment; dependent variable at the interval or ratio level; the
dependent variable is normally distributed.

Calculating the test statistic


The test statistic formula has only one difference from Z: the population standard devia-
tion (σ) is swapped for the sample standard deviation (s):

M
t
s
N

Imagine that in our earlier example, we had not known the population standard devia-
tion. We could then use the t-test to compare our sample of students to all test-takers as
follows:

M   23.450  21.000 2.450 2.450


t     1.7764
s 6.210 6.210 1.389
N 20 4.472

Because the sample size was 20, there will be 19 degrees of freedom (N − 1 = 20 − 1 =
19). On the t table, we find that at 19 degrees of freedom, the critical value for a one-
tailed test is 1.73. Remember that our test is one-tailed because we specified a directional
hypothesis (that this group of students would have higher test scores than the popula-
tion of test-takers). Because our calculated value is higher than the tabled value (1.764
> 1.73), we know that p < .05, so we reject the null hypothesis and conclude there was a
significant difference.
COMPARING A SINGLE SAMPLE TO THE POPULATION • 79

Calculating and interpreting effect size


For this test, the formula for Cohen’s d changes slightly:

M
d
s

This simply replaces the population standard deviation with sample standard deviation,
which is the same substitution as in the t-test formula. So, for our example:

M   23.450  21.000 2.450


d    0.395
s 6.210 6.210

Notice this is a slightly smaller effect size estimate than it was in the Z-test. This is com-
mon as the standard deviation for the sample will usually be larger than the population.

Interpreting the pattern of results


Finally, the t-test would be interpreted very similarly to the Z-test in this situation. A
positive t means the sample mean was higher, while a negative t means the sample mean
was lower than the population mean. It is also fine to simply compare the sample and
population mean as the t-test result determines the difference is significant, and we can
plainly see that 23.45 is more than 21.00—the sample mean was higher than the popu-
lation mean.

CONCLUSION
Although neither of the tests we introduced in this chapter are very common in applied
research, they work the same way, conceptually, as all of the other tests, we will explore in
this text. In all cases, these tests will take some form of between-groups variation (here,
it was the difference between the population and sample means) over the within-groups
variation or error (here, the standard error of the mean). How exactly we calculate those
two terms will change with each new design, but the basic idea remains the same. In our
next chapter, we will explore one of the most widely used statistical tests in educational
and psychological research: the independent samples t-test.
Part III
Between-subjects designs

81
6
Comparing two sample means

The independent samples t-test

Introducing the independent samples t-test 83


The t distribution 84
Research design and the independent samples t-test 84
Assumptions of the independent samples t-test 86
Level of measurement for the dependent variable is interval or ratio 86
Normality of the dependent variable 87
Observations are independent 87
Random sampling and assignment 88
Homogeneity of variance 88
Levene’s test 88
Correcting for heterogeneous variance 89
Calculating the test statistic t 89
Calculating the independent samples t-test 89
Using the t critical value table 94
One-tailed and two-tailed t-tests 94
Interpreting the test statistic 94
Effect size for the independent samples t-test 94
Calculating Cohen’s d 95
Calculating ω2 95
Interpreting the magnitude of difference 96
Determining how groups differ from one another and
interpreting the pattern of group differences 97
Computing the test in jamovi 97
Writing Up the Results 102
Notes 103

INTRODUCING THE INDEPENDENT SAMPLES t-TEST


This chapter introduces the first group comparison test covered in this text: the independ-
ent samples t-test. This test allows for the comparison of two groups or samples. There are
many research scenarios that might fit this analysis. Imagine an instructor who teaches
the same course online and face-to-face. That instructor might wonder about student

83
84 • BETWEEN-SUBJECTS DESIGNS

achievement in each of those versions of the class. Perhaps the format of the course
(online versus face-to-face) makes a difference in how well students learn (as measured
by a final exam). While this scenario presents several design limitations (which we will
discuss later in this chapter), the instructor could use an independent samples t-test to
evaluate whether students in the two versions of the course differ in their exam scores.
A more typical example for the independent samples t-test involves an experimental
group and a control group, compared on some relevant outcome. Imagine that an educa-
tional consultant comes to a school, advertising that their new video-based modules can
improve mathematics exam performance dramatically. The system involves assigning
students to watch interactive video modules at home before coming to class. To test the
consultant’s claims, we might randomly assign students to either complete these new
video modules at home or to spend a similar amount of time completing mathematics
worksheets at home. After a few weeks, we could give a mathematics exam to the stu-
dents and compare their results using an independent samples t-test.

The t distribution
As we discovered when we explored the one-sample t-test in the previous chapter, t-test
values are distributed according to the t distribution. The t distribution is a sampling
distribution for t-test values and allows precise calculation of the probability associated
with any given t-test value. We also described how the shape of the t distribution changes
based on the number of degrees of freedom. When we explored the one-sample t-test,
we said that there would always be n − 1 degrees of freedom. In the independent samples
t-test, there will be n − 2 degrees of freedom. As with the one-sample test, we use the t
distribution table to look up the critical value at a given number of degrees of freedom
and alpha or Type I error level (usually .05 for social and educational research). If the
absolute value of our t-test value exceeds the critical value, then p < .05 and we can reject
the null hypothesis. Of course, it is also possible to calculate the exact probability, or p,
values, and software like jamovi will produce the exact probability value. It is typical to
report the exact probability value when writing up the results.

RESEARCH DESIGN AND THE INDEPENDENT SAMPLES t-TEST


As a way of thinking about design issues in the independent samples t-test, we will return
to the two examples offered at the beginning of this chapter and think through the design
limitations of each. For each example, we will think about design validity in terms of
both internal validity (that is, the trustworthiness of the results) and external validity
(that is, the generalizability of the results).
Our first example, with an instructor who teaches online and face-to-face versions
of the same class, presents us with several design challenges. One is related to the use of
intact groups. That is, in most institutions that offer both face-to-face and online classes,
students make decisions about which class they want to take. In other words, students
self-select into online and face-to-face instruction. A number of factors might drive that
decision, such as convenience, scheduling, and distance. An adult student who is work-
ing full-time or has children at home, and/or other non-academic obligations might
choose the online class for the sake of convenience and ability to work around their
schedule. A traditional student, attending school full-time without outside employment,
COMPARING TWO SAMPLE MEANS • 85

might choose the face-to-face version of the class. Because of this self-selection, multiple
differences between the groups are built in from the start and cannot be attributed to the
course delivery mode. Students in the online class might be older, more likely to have
outside employment, and more likely to have multiple demands on their time. Students
in the face-to-face class might be younger (meaning less time has elapsed since prior
coursework, perhaps), have fewer employment and family obligations, and generally
have more free time to devote to coursework. If those differences exist, then a difference
in achievement cannot be fully attributed to the mode of course delivery.
In this case, we have an example where we cannot randomly assign participants to
groups. The groups are pre-existing, in this case by self-selection. Of course, many other
kinds of groups are intact groups we cannot randomly assign, like gender, in-state versus
out-of-state students, free-and-reduced-lunch status, and many others. A lot of those
intact categories are of interest for educational researchers. But because of the design
limitations inherent with intact groups, the inferences we can draw from such a com-
parison are limited. Specifically, we will not be able to claim causation (e.g., that online
courses cause lower achievement). We will only be able to claim association (e.g., that
online courses are associated with lower achievement). The distinction is important and
carries different implications for educational policy and practice.
In our second example described above, we have randomly assigned students to groups
(either to do video modules or the traditional worksheets). It is important to note that
the control group, in this case, is still being asked to do some sort of activity (in this case,
worksheets). That’s important because it could be that merely spending an hour a day
thinking about mathematics improves achievement and that it has nothing to do with
the video modules themselves. So, we assigned the control group to do an activity that is
typical, traditional, and should not result in gains beyond normal educational practice.
Because both groups of students will be doing something with mathematics for about the
same amount of time every day, we can more easily claim that any group differences are
related to the activity type (e.g., video or worksheet). This form of random assignment
will make it easier to make claims about the new videos and their potential impact on
student achievement than it was in the first example, where we had intact groups.
But there are still serious design challenges here. One issue is whether or not students
actually complete the worksheets or video modules. We would need a way to verify that
they completed those tasks (sometimes called a compliance check). Another possible
complication is with morbidity or dropout rates. Specifically, in a case where we ran-
domly assigned participants to groups, we would be concerned if the dropout rates were
not equal between groups. That is, we might assume some students will transfer out of
the class or stop participating in the research study. But because of random assignment,
the characteristics related to that decision to drop out of the study should be roughly
evenly distributed in the groups, so the dropout rates should be about the same. But what
if about 20% of the video group leaves the study and only 5% of the workshop group
leaves? That could indicate there is something about the video-based modules that is
increasing the rate of dropout and makes it harder to infer much from comparing the
two groups at the end of the study.
One final issue we will discuss, though there are many design considerations in both
examples, is experimenter bias. If teachers in the school know which students are doing
video modules and which are doing worksheets, that knowledge might change the way
they interact with students. If a teacher presumes the new content to be helpful, the
86 • BETWEEN-SUBJECTS DESIGNS

teacher might give more encouragement or praise to students in that group, perhaps
without being conscious of it. The opposite could be true, too, with a teacher assuming
the new video modules are no better than existing techniques and interacting with stu-
dent in a way that reflects that assumption. To be clear, in both cases, the teacher is likely
unaware they might be influencing the results. It is possible that someone involved with
a study might make more conscious, overt efforts to swing the results of a study, but that
is scientific misconduct and quite rare. On the other hand, unknowingly giving slight
nudges to study participants is more common and harder to account for.
Finally, before we move on to discuss the assumptions of the test, there are a few broad
considerations in designing research comparing two groups. One is that the independent
samples t-test is a large sample analysis. Many methodologists suggest a minimum of 30
participants per group (60 participants overall in a two-group comparison). Those groups
also need to be relatively evenly distributed—that is, we want about the same number of
people in both groups. This is built into a random assignment process, but when using
intact groups, it can be more challenging. The general rule is that the smaller group needs
to be at least half the size of the larger group. So, for example, if we have 30 students in a
face-to-face class and 45 in an online class, the samples are probably balanced enough.
If, however, we had 30 students face-to-face and 65 online, the imbalance would be too
great (30 is well less than half of 65). However, we want the groups to be as close in size to
one another as possible. We will explore the reason for that a bit more in the section on
assumptions. As we discuss the assumptions of the independent samples t-test, some of
these design issues will become clearer, and we will introduce a few other issues to consider.

ASSUMPTIONS OF THE INDEPENDENT SAMPLES t-TEST


As we move into the assumptions of the t-test, it is important first to consider what we
mean by assumptions. In common speech, we might expect an assumption to be some-
thing that is probably true—and usually, we mean that the individual (such as a researcher)
has assumed something to be true. For the assumptions of a statistical test, though, it is
the test itself that is assuming certain things to be true. The tests were constructed with
a certain set of ideas about the data and the research design. If those things are not true
of our data or of our research design, we will have to make adjustments to our use and
interpretation of the test. Put simply: the test works as intended when these assumptions
are met and does not work quite so well when those assumptions are not met.

Level of measurement for the dependent variable is interval or ratio


The first assumption we will explore for the independent samples t-test relates to the
level of measurement for the dependent variable. The dependent variable must be meas-
ured at the interval or ratio level. In other words, the dependent variable has to be a
continuous variable. Hopefully, if we’ve been thoughtful in choosing a research design
and statistical test, we have considered the issue of levels of measurement long before
starting to evaluate the data. However, it is always worth pausing for a moment to ensure
this assumption is met. This is especially true because most statistical analysis software
will not alert you if there is a problem with the level of measurement and will produce
the test statistic regardless. However, the t-test is meaningless on a categorical (nominal
or ordinal) dependent variable. There is no test for this assumption—we simply have to
COMPARING TWO SAMPLE MEANS • 87

evaluate the research design and nature of the variables. This is also an assumption for
which we cannot correct—if the level of measurement for the dependent variables is
incorrect, the t-test simply cannot be used at all.

Normality of the dependent variable


Because this test, like all of the tests covered in this text, is a parametric test, it requires
that the dependent variable is normally distributed. Normally distributed variables are a
key requirement of all parametric tests, and deviations from normality can cause serious
issues in the t-test. We described testing for normality and evaluating indicators of nor-
mality, such as the skewness and kurtosis statistics, earlier in this text. The t-test is gener-
ally considered to be more sensitive to deviations from normality than some other tests
(it is less robust in this sense), but it is more sensitive to kurtosis, especially platykurtosis,
than to skew. The t-test can tolerate moderate deviations from normality without intro-
ducing much additional error. When evaluating the assumption of normality, the ideal
is that the normality statistics do not indicate any significant deviation from normality.
But if there is a slight deviation from normality on skewness (for example, the absolute
value of skewness is more than two times the standard error of skewness but less than
four times the standard error of skewness) or is slightly leptokurtic, it is probably safe to
proceed with the independent samples t-test. However, it will be important in the case of
any significant non-normality to note that deviation and the approach to thinking about
that deviation in the resulting manuscript/write-up.

Observations are independent


We have already discussed the requirement for groups or samples to be independent.
But the test also assumes that all observations are independent. To understand what this
means, let’s start with an example where observations are dependent. Imagine we give a
computerized test to third graders. Their school has fewer computers than students, so
the school has typically used a buddy system where children are paired up for computer
work. In this case, even if both children complete their own test, the fact that one child
might see the other’s answers or that they might discuss them creates the potential for
dependence. In another example, imagine we give surveys to college students that ask
about attitudes toward instruction. Several students, who sit near each other, discuss
their answers and talk about the questions on the survey. Their discussion has the poten-
tial to create dependence. This influence from one observation to the next is a violation
of the assumption of independence.
Another way this can happen is when data are nested. Imagine we want to assess stu-
dents’ perception of teachers who are men versus teachers who are women. We gather
data from 10 different classes, 5 taught by women, 5 by men. The students complete sur-
veys, but students are nested within teachers. That is, some of the variance in perception
of teachers is related to the individual teacher, and multiple students in the sample had
the same teacher. If we fail to account for the differences among the teachers, there is
systematic dependence in the data. These nested designs (such as in our case, where stu-
dents are nested within teachers who are nested within genders) violate the assumption
of independence in this statistical test. A more advanced test that accounts for nesting
would be needed to overcome the dependence among the observations.
88 • BETWEEN-SUBJECTS DESIGNS

Random sampling and assignment


As we have discussed in previous chapters, inferential tests like the independent samples
t-test arose from the positivist school of thought. These tests, then, were designed for use
in experimental research with the goal of determining cause–effect relationships. Because
of this, one of the assumptions of the independent samples t-test and many other tests
covered in this text is that participants have been randomly assigned to groups. As we
discussed earlier, random assignment is a strong practice that allows clearer inferences
about the nature of relationships. That is largely because random assignment, in theory,
randomizes participant differences so that the test conditions (e.g., experimental condi-
tion and control condition) should be the only systematic difference between groups. Of
course, much behavioral and educational research does not involve random assignment,
because of practical and ethical limitations.
When participants are not randomly assigned, and the research is not experimental
in nature, some of the language commonly used in the model becomes tricky. A clear
example is language about “effects.” People who use the t-test often write about treatment
effects, effect sizes, and the magnitude of effects. That language makes sense in a world
where the goal of research is to infer cause–effect relationships, and, as such, the design
is experimental with random assignment. But when the groups have not been randomly
assigned, as in the case of intact groups, that language no longer fits. As a result, we will
have to be very careful in writing about our results and making inferences about the
relationship between the independent and dependent variable.
Another issue in the randomness assumption is random sampling. The model here
assumes that we have not only randomly assigned our sample to groups, but also that the
sample represents a random sample from the population. As we discussed previously, it
is almost impossible to imagine a truly random sample because of issues like self-selec-
tion and sampling bias. Earlier chapters dealt with this issue more thoroughly, so here
we will simply reiterate that the degree to which the sample is biased limits the degree to
which results might be generalizable.

Homogeneity of variance
The independent samples t-test, and many of the other tests covered in this book, requires
that the groups have homogeneous variance. In other words, the variance of each group
is roughly the same. The idea here is that, because we will be testing for differences in the
means of the two groups, we need variances that are roughly equal. A mean difference is
less meaningful if the groups also differ widely in variance. Basically, we are suggesting
that the width of the two sample distributions is about the same—that they have similar
standard deviations. That similarity will mean the two samples are more comparable.

Levene’s test
The simple test for evaluating homogeneity of variance is Levene’s test. It can be pro-
duced by the jamovi software with the independent samples t-test. Levene’s test is dis-
tributed as an F statistic (a statistic we will learn more about in a later chapter). In the
jamovi output for the t-test, the software will produce the F statistic and a related prob-
ability (labeled as Sig. in the output). That probability value is evaluated the same way as
any other null hypothesis significance test. If p < .05, we will reject the null hypothesis. If
p ≥ .05, we will fail to reject the null hypothesis. However, it is very easy to be confused
COMPARING TWO SAMPLE MEANS • 89

by the interpretation of Levene’s test. The null hypothesis for Levene’s test is homogeneity
of variance, while the alternative hypothesis is heterogeneity of variance1:

H 0 : SX1 2 = SX2 2

H 0 : SX1 2 ≠ SX2 2
Because of this, failing to reject the null hypothesis on Levene’s test means that the
assumption of homogeneity of variance was met. Rejecting the null hypothesis on Lev-
ene’s test means that the assumption of homogeneity of variance was not met. Put simply,
if p ≥ .05, the assumption of homogeneity of variance was met.

Correcting for heterogeneous variance


Luckily, in jamovi, there is a simple correction for heterogeneity of variance (i.e., if we
reject the null on Levene’s test). There is a simple checkbox to add this correction to the
output, called Welch’s test. We will discuss a bit more about how this correction works
later in this chapter. However, when we select this option, there will be two lines of out-
put in jamovi: one for the student’s t-test (the standard, uncorrected t-test), and another
for Welch’s test. If our data fail the assumption of homogeneity of variance, we will check
the box for Welch’s test and use that line of output. The Welch’s correction works by
adjusting degrees of freedom so that the corrected output will feature degrees of freedom
that are not whole numbers, making it easy to spot the difference.
It is not uncommon for students, especially at first, to get a bit confused by the jamovi
output around the issue of homogeneity of variance. We will provide several examples of
how to read this output a bit later in this chapter.

CALCULATING THE TEST STATISTIC t


We’ll begin by presenting the independent samples t-test formula, and then we will
explain each element of the formula and how we get to it. As you’ll discover as we move
through the other tests covered in this text, most group comparisons function as a ratio
of between-groups variation to within-groups variation. Because of that, these tests are
essentially asking whether the difference between groups is greater than the differences
that exist within groups (a logic that will become clearer as we move through this and
other tests). For the t-test, that formulation is:
X Y
t
sdiff

In other words, t is equal to the mean difference between the two groups, over the stand-
ard error of the difference.

Calculating the independent samples t-test


The means are easy enough. We have two groups of participants, and we can calculate
the mean of each group. In the formulas, the first group will be labelled X and the second
group will be labelled Y. Let us return to our first example from earlier in this chap-
ter, with an instructor who teaches the same class online and face-to-face and wants to
90 • BETWEEN-SUBJECTS DESIGNS

know if there is a difference in final exam scores between the two versions of the course.
Imagine the instructor collects final exam scores and finds the following:2

Online Class Face-to-Face Class

Student Exam Score Student Exam Score

1 85 1 91
2 87 2 89
3 83 3 93
4 84 4 94
5 81 5 92

We can easily calculate a mean for each group. The online class will be group X, and the
face-to-face class will be group Y.

 X 85  87  83  84  81 420
X    84.00
Nx 5 5

 Y 91  89  93  94  92 459
Y    91.80
NY 5 5

In this case, there were five students in both classes, which is why the denominator is the
same in calculating both means. Returning to our t formula, we are already done with
the numerator!

X  Y 84.00  91.80 7.80


t  
sdiff sdiff sdiff

Next, we will turn to the denominator.

Partitioning variance
In future chapters, the topic of partitioning variance will get more nuanced. Here, we have
only two sources of variance: between-groups (the mean difference), and within-groups
(standard error of the difference) variance. We discussed above the calculation of the mean
difference, which defines between-groups variance for the independent samples t-test. The
more complicated issue with this test is the within-groups variance, here defined by the
standard error of the difference. To understand where this number comes from, we will
start at the sdiff term, and work our way backward to a formula you already know. Then, to
actually calculate sdiff, we’ll work through this set of equations in the opposite order.
As we learned in a prior chapter, standard error and standard deviation are related
concepts, applied to different kinds of distributions. So, just as standard deviation was
the square root of sample variance, standard error is the square root of error variance.
We can express this for sdiff in the following way:
COMPARING TWO SAMPLE MEANS • 91

sdiff = s 2diff

That error variance (variance of the difference) is calculated by adding the partial vari-
ance associated with each group mean:

s 2diff  s 2 M X  s 2 MY
That partial variance for each group mean is calculated by dividing the pooled variance
by the sample size of each group, so that:

s 2 pooled
s2 MX =
NX
s 2 pooled
s 2 MY =
NY

The pooled variance is calculated based on the proportion of degrees of freedom coming
from each group multiplied by the variance of that group:

 df  2  dfY  2
s 2 pooled   X s X  s Y
 dftotal   dftotal 

And finally, we have already learned how to calculate the variance of each group:

X  X
2
2
s X 
N X 1

 Y  Y 
2

s 2
Y 
NY  1

Okay, that is a lot to take in all at once, and a lot of unfamiliar notation. So, we will pause
for a moment to explain a bit before reversing the order of the formulas and calculating
sdiff. If you follow the order of this from bottom-to-top, what is happening is we start with
the variance of each of the two groups. Those group variances represent within-groups
variation. We described this in an earlier chapter as giving a sense of the error associated
with the mean. However, for the t-test, we need a measure of overall within-groups vari-
ation, rather than a separate indicator for each group. To accomplish that, we go through
a series of steps to account for the proportion of participants coming from each group
and the variance of that group, to arrive at a pooled variance. That pooled variance then
gets adjusted again based on the sample size of each group (the first adjustment was for
degrees of freedom, not sample size), and finally gets combined as an indicator of with-
in-groups variation. Finally, we take the square root to get from variance to standard
error. Those concepts might become even clearer as we walk through the calculations
with our example.
Recall that, above, we calculated a mean final exam score of 84.00 for online students
and 91.80 for face-to-face students. We can use those means to calculate group variance,
using the same process we introduced in Chapter 3:
92 • BETWEEN-SUBJECTS DESIGNS

Online Class

Squared Deviation  X  X 
2
Student Score (X) Deviation (X − X )

1 85 85 − 84 = 1 12 = 1
2 87 87 − 84 = 3 32 = 9
3 83 83 − 84 = −1 (−1)2 = 1
4 84 84 − 84 = 0 02 = 0
5 81 81 − 84 = -3 (−3)2 = 9
Σ = 420 Σ=0 Σ = 20

So, for the online class (labelled X):

X  X
2
2 20 20
s X     5.00
N X 1 5 1 4

Face-to-Face Class

Squared Deviation Y  Y 
2
Student Score (Y) Deviation (Y − Y )

1 91 91 − 91.8 = −0.8 (−0.8)2 = 0.64


2 89 89 − 91.8 = −2.8 (−2.8)2 = 7.84
3 93 93 − 91.8 = 1.2 1.22 = 1.44
4 94 94 − 91.8 = 2.2 2.22 = 4.84
5 92 92 − 91.8 = 0.2 0.22 = 0.04
Σ = 459 Σ=0 Σ = 14.80

So, for the face-to-face class (labelled Y):

 Y  Y 
2
2 14.80 14.80
s Y     3.70
NY  1 5 1 4

Our next calculation requires us to provide the degrees of freedom from each group and
the total degrees of freedom. Recall from the previous chapter on the one-sample t-test
that we learned, for a single sample, df = n − 1. You might imagine, then, that if we want
to know how many degrees of freedom are contributed by group X (the online class),
we could use the same formula, finding the dfx = NX − 1 = 5 − 1 = 4. Similarly, for group
Y (the face-to-face class), we’d find the dfY = NY − 1 = 5 − 1 = 4. So, the total degrees of
freedom would be 4 + 4 = 8. You can also extrapolate from this that, for the independent
samples t-test, dftotal = ntotal – 2. Armed with that information, we are ready to calculate
the pooled variance:
COMPARING TWO SAMPLE MEANS • 93

 df X  2  dfY  2  4   4 
s 2 pooled   s X   s Y    5    3. 7
 dftotal   dftotal  8 8
 .5 5  .5 3.7  2.5  1.85  4.35
Next, we partial the pooled variance into variance associated with each group mean
(a process through which we make a further adjustment for the size of each sample):

s 2 pooled 4.35
s2 MX
= = = 0.87
NX 5

s 2 pooled 4.35
s 2 MY
= = = 0.87
NY 5

This step is not particularly dramatic when we have balanced samples. However, it is easy
to see how if we had different numbers of students in the two classes, this adjustment
would account for that unbalance.
Next, we will calculate the difference variance:

s 2diff  s 2 M X  s 2 MY  0.87  0.87  1.74

And finally, we’ll convert the difference variance to the standard error of the difference:

=
sdiff s 2diff
= 1.74 = 1.32

As we mentioned at the start of this section, the process of getting the denominator
value, which represents within-groups variation, is more laborious. Many of the steps
in that calculation are designed to account for cases where we have unbalanced samples
by properly apportioning the variance based on the relative “weight” of the two samples.
But having drudged through those calculations, we are now ready to examine the ratio
of between-groups to within-groups variation.

Between-groups and within-groups variance


Recall from above that the t-test value, like most test statistics we’ll discuss, is a ratio of
between-groups variance to within-groups variance. Thus, the t statistic is calculated as:

X Y
t
sdiff

In our case, we have all the information we need to calculate the test statistic as follows:

X  Y 84.00  91.80 7.80


t    5.91
sdiff 1.32 1.32

Here, the negative sign on the t-test value simply indicates that the second group (Y, or,
in our case, face-to-face instruction) had the higher score. If we had treated online
94 • BETWEEN-SUBJECTS DESIGNS

instruction as group Y and face-to-face instruction as group X, we’d have the same test
statistic, except it would be positive. So, the order of the groups doesn’t matter, but it will
affect the sign (+/−) on the t-test value.

Using the t critical value table


Having calculated the t-test value, we know that t8 = − 5.91. Here, the subscripted number
8 signifies the number of degrees of freedom. To determine if that t-test value shows a
significant difference, we will compare it to the t critical value table. Looking at the critical
value table for the t distribution, we find that, at 8 degrees of freedom, the one-tailed criti-
cal value is 1.86, and the two-tailed critical value is 2.31. We’ll discuss a bit more about the
difference between those two values and determine which one we would use below.

One-tailed and two-tailed t-tests


As discussed in the prior chapter, there are two kinds of hypotheses we might have, and
those will be tested against a slightly different critical value. As a reminder, one-tailed
tests involve directional hypotheses. That means we specify the direction of the differ-
ence in advance. For example, we might have hypothesized that students in face-to-face
classes would get higher exam scores than those in online classes. That would be a one-
tailed research or alternative hypothesis. In our imaginary research scenario, though, the
instructor simply wondered if there would be a difference in final exam scores between
the two course types. Asking if a difference exists, without specifying what kind of differ-
ence we expect, is a two-tailed hypothesis. So, in our case, we have a two-tailed test and
will use the two-tailed critical t-test value.

Interpreting the test statistic


Above, we calculated for our comparison of final exam scores between online and face-
to-face students that t8 = − 5.91, and tcritical = 2.31. To determine if this is a significant dif-
ference, we compare the absolute value of our calculated t-test (in other words, ignoring
the sign) to the critical value. If the absolute value of the calculated t-test is larger than
the critical value, then p < .05. Sometimes people refer to this comparison as “beating
the table.” They mean that, in order to reject the null hypothesis and conclude there was
a significant difference, our calculated value needs to “beat” (exceed) the critical value.
In our case, the absolute value of the calculated t-test was 5.91, which is more than the
critical value of 2.31. So, we know that p < .05, and we reject the null hypothesis. We will
conclude that there was a significant difference in final exam scores between online and
face-to-face students. Next, we want to determine how large the difference between the
two groups was, as well as which group performed better on the final exam.

EFFECT SIZE FOR THE INDEPENDENT SAMPLES t-TEST


As we discussed in the previous chapter, knowing whether a difference is statistically signif-
icant is only one part of answering the research question. Differences that are miniscule can
be statistically significant under the right conditions. But most educational and behavioral
researchers are interested in finding differences that are not only statistically significant but
COMPARING TWO SAMPLE MEANS • 95

are also meaningful. We have previously discussed and calculated Cohen’s d as an effect size
estimate. We will demonstrate how to calculate and interpret Cohen’s d for the independent
samples t-test too. Then we’ll explore another effect size estimate, ω2 (or omega squared).

Calculating Cohen’s d
When we learned the one-sample t-test, we also learned to calculate d based on the mean
difference over the standard error. Cohen’s d will work basically the same way in the
independent samples t-test, just with different ways of getting at the mean difference and
standard error. For the independent samples t-test, Cohen’s d will be calculated as follows:

X Y
d
s pooled

As part of the process of calculating t, we already have all of these terms. The numerator
is the mean difference, which, in the case of the independent samples t-test, is the differ-
ence between the two groups’ means. The denominator is the standard error, which in
our case will be the square root of the pooled variance. The reason it will be the square
root is that we want standard error (spooled), which is the square root of the pooled vari-
ance (s2pooled). In other words:

s pooled = s 2 pooled

So, in our case:

= s 2 pooled
s pooled = 4.35 = 2.09

We can take all of that information, plug it into the formula for d, and calculate effect
size:

X  Y 84.00  91.80 7.80


d    3.73
s pooled 2.09 2.09

Remember that, for d, we report the absolute value (dropping the sign), which is why we
reported it here as 3.73 rather than −3.73. That would be a very large effect, according
to the general rules for interpreting d we described in the previous chapter, where any d
larger than .8 is typically considered large. Checking the box for effect size in jamovi will
produce Cohen’s d.

Calculating ω2
One of the problems with Cohen’s d is that it can be difficult to interpret. Even the general
cutoff points we described in the previous chapter are, in practice, not particularly use-
ful. Cohen himself suggested that those cutoff points are arbitrary and may not be mean-
ingful in applied research. We also do not know what it means proportionally, especially
because d can theoretically range from zero to infinity. Because of those limitations, and
96 • BETWEEN-SUBJECTS DESIGNS

others, researchers often prefer to use omega squared as an effect size estimate. We will
discuss more about interpreting this effect size estimate below, but it ranges from zero to
one and represents a proportion of explained variance. Because it is a proportional esti-
mate, it is often easier to interpret and make sense of than unbounded estimates like d.
Like d, the formula for omega squared will vary based on the statistical test to which
we apply it. In the case of the independent samples t-test, it is calculated as follows:

t2 1
2 
t 2  N X  NY  1

In this formula, we already calculated t, and we know the sample size of each group (here
represented as NX and NY). Because all of this information is already known, we can cal-
culate omega squared for our example:

 5.91  1  34.93  1  33.93  0.77


2
t2 1
2  
t  N X  NY  1  5.912  5  5  1 34.93  9 43.93
2

There is one special case for omega squared that is worth pointing out where the formula
will not work: In the case of extremely small t values (where −1 < t < 1), this formula will
return a negative value. Omega squared, as we said above, is bounded between zero and
1—it cannot be a negative value. This is an artifact of the formula. When this happens,
we report omega squared as zero. In other words, when t is between −1 and +1, omega
squared will be zero. The formula will not work in those cases.

Interpreting the magnitude of difference


In the previous chapter, we described rough cutoff scores for Cohen’s d, where values
around .2 are small, around .5 are medium, and around .8 are large. We also pointed out
that those cutoff values are of limited value in applied work and suggested interpreting it
in the context of the research literature. Based on the effect sizes being reported in your
area of research, you can judge whether the effect size in your study are smaller, about
the same, or larger than what others have found.
Omega squared is best interpreted in the context of the literature too. However, there
is a really plain, meaningful interpretation of the effect size estimate available. Omega
squared, as we stated above, represents a proportion of explained variance. In the case of
the independent samples t-test, it represents the proportion of variance on the dependent
variable that was explained by group membership on the independent variable. In our
example, that would mean that omega squared indicates the proportion of variance in
final exam scores that was explained by which version (online versus face-to-face) of the
class a student took. Specifically, we can convert omega squared to a percentage of variance
explained by multiplying by 100 (i.e., by moving the decimal space to the right two spaces).
So, in our example, we found that ω2 = .77. That would indicate that about 77% of the
variance in final exam scores was explained by which version of the class students took.
The wording here is actually fairly important. We know how much variance in exam
scores was explained by group membership. In other words, if we calculate the total
variance in exam scores (s2), 77% of that was explained by whether a student was in the
face-to-face or online version of the class.
COMPARING TWO SAMPLE MEANS • 97

A final note is that in our example, we are using fake data. We made the data up for
the purposes of the example. In real research, an effect size this large would be shocking.
By reading the published research in your area, you’ll get a feel for typical effect sizes,
but it is fairly uncommon for omega squared to exceed .20 in educational and behavioral
research. So, when you start working with real data, do not be discouraged to see some-
what smaller effect size estimates than we find in these contrived examples.

DETERMINING HOW GROUPS DIFFER FROM ONE ANOTHER


AND INTERPRETING THE PATTERN OF GROUP DIFFERENCES
We have now arrived at the easiest step in the independent samples t-test: figuring out
how the groups differ and what that pattern of differences might mean. This will get a
bit more complicated in future chapters when we introduce larger numbers of groups.
But for the independent samples t-test, it is quite simple. Given that the difference
between groups is statistically significant, we can easily determine which group scored
higher and which scored lower. Perhaps the easiest way to do this is to look at the
group means. If they are significantly different, then the group with the higher mean
scored significantly higher than the other group. Conversely, the group with the lower
mean scored significantly lower than the other group. In our example, the mean for
students in the online class was 84.00, while the mean for students in the face-to-face
class was 91.80. We know the difference is statistically significant because our calcu-
lated t-test value exceeded the critical value. Based on the group means, we know that
students in the face-to-face version of the class scored significantly higher on the final
exam than students in the online class. We know that because 91.80 is a higher exam
score than is 84.00.
We could have made the same determination another way. Our t-test value was nega-
tive. Negative t values indicate that the second group (or group Y) scored higher, while pos-
itive t values indicate the first group (or group X) scored higher. So, the negative t-test value
indicates that group Y scored higher, and in our data, we labelled the face-to-face class as
group Y. As we mentioned above, if we had flipped those labels so that face-to-face was
group X, we would have gotten the same t-test value, except positive. For most students,
it is probably easier to interpret the significant t-test by looking to the group means rather
than interpreting the sign on the t value, but both approaches will yield the same result.

COMPUTING THE TEST IN JAMOVI


Before turning to writing up the results, we’ll demonstrate how to calculate the inde-
pendent samples t-test in the jamovi software. Because we’ve demonstrated how to test
the assumption of normality in a previous chapter, we’ll begin directly with the inde-
pendent samples t-test, as though we have already checked that assumption.
First, we will go to the Data tab at the top of the screen. By default, jamovi starts with
three blank variables, but we will need just two: (1) Final Exam Score and (2) Class Ver-
sion. To do so, we will click on the first column, and then click “Setup” at the top of the
screen. When naming the variable (the top field on the Setup screen), remember that
variable names cannot start with a number and cannot contain spaces. For example, we
might name the first variable FinalExamScore and the second ClassVersion. For Final-
ExamScore, we will also select the button for “Continuous” data type because the exam
scores are ratio data.
98 • BETWEEN-SUBJECTS DESIGNS

We could name them anything we want, so long as it starts with a letter and doesn’t
contain spaces, but it should be something we can easily identify later on. We can also
give the variable a better description on the “Description” line of the setup menu, which
can contain any kind of text so we can include a clearer description, if needed. Next, we
will set up the second column, which we might name ClassVersion. This variable will be
a Nominal data type, because the class versions are nominal data. We can also label the
groups using the “Levels” feature. Note that this will not work until after we have typed
in the data so the software knows what group numbers we will need to label. In our case,
the data will have two groups, which we will simply number 1 and 2. Group 1 will be the
Online Class group and group 2 will be the Face-to-Face Class group.

The step of adding group labels is optional—the analysis will run just fine without setting
up group labels. However, if we do take the time to go in and set up group labels (which,
again, needs to be done after data entry, so we are going just a little bit out of order to
show the data setup all together), the output will be labelled in the same way, making
it easier to interpret. One final step we may want to take is to delete the third variable
(automatically named “C”) that jamovi created by default. To do so, right click on the
column header “C” and click “Delete Variable.” It will prompt you to confirm you wish to
delete the variable. Simply click “OK” and the variable will be permanently deleted. This
step is also optional but results in a somewhat cleaner data file.
To enter the data, we simply type it into the spreadsheet. Note that the group mem-
bership will be entered as 1 for the Online Class and 2 for the Face-to-Face class, but if
COMPARING TWO SAMPLE MEANS • 99

you set up the group labels in jamovi, you will see the group names in the spreadsheet as
in this example. If you do not add the group labels, the spreadsheet will show the 1 or 2
in those cells instead.

Now that the data file is set up and the data entered, we are ready to run the independent
samples t-test. At the top of the window, click on the “Analyses” tab, then the “T-Tests”
button (note that due to the software formatting, it has a capital T although the t in t-test
is lowercase). Then choose the independent samples t-test.

In the resulting menu, click FinalExamScore and then the arrow next to the Depend-
ent Variables area to set that as the dependent variable. Then click on ClassVersion and
then the arrow next to Grouping Variable to set that as the independent variable. Notice
there are a number of options showing on the screen. By default, jamovi will have the
box for “Student’s” checked. This is the uncorrected t-test, which is what we will use if
the assumption of homogeneity of variance was met. To produce that test, we will check
the box next to “Equality of variances” under “Assumption Checks.” Another option
we will want to check is the “Descriptives” box under “Additional Statistics.” Note also
that there are options under “Hypothesis,” and by default it has a two-tailed hypothe-
sis checked. Probably the easiest way to conduct the test is to always leave that option
checked, and simply divide p by two if the test was actually one-tailed. Another option to
check is the “Confidence interval” under “Additional Statistics.” Finally, as we described
100 • BETWEEN-SUBJECTS DESIGNS

above, the “Effect size” option will produce Cohen’s d, but typically we would be more
interested in calculating omega squared (ω2).

One thing you may notice about jamovi as you select all the options is that the out-
put shows up immediately to the right, and updates in real time as you select different
options. This is a nice feature of jamovi compared to some other analysis software, as
it allows us to select different options without having to entirely redo the analysis. For
example, if we find that the assumption of homogeneity of variance was not met, we can
check the box for Welch’s correction and the output will update accordingly.
The output will start with the independent samples t-test results, then the assumption
tests, and finally the group descriptive statistics. However, we will discuss the output
starting with the assumption tests, because that output would inform how we approach
the main analysis. Note that for Levene’s test, jamovi produces an F ratio, degrees of free-
dom, and a p value. As discussed earlier in this chapter, if p > .050, then the assumption
was met. In this case, F1, 8 = .044, p = .839, so the assumption was met. As a result, we
can proceed with the student’s (or standard, uncorrected) t-test. If the data had not met
the assumption, we could choose the Welch’s correction and interpret it instead of the
student’s test.
COMPARING TWO SAMPLE MEANS • 101

Next, we will look at the independent samples t-test output. We see that t at 8 degrees of
freedom is −5.913, and p < .001. Because p < .050, the difference in exam scores between
online and face-to-face students was statistically significant. The software will, by default,
produce two-tailed probabilities for independent samples t-test. In our example, that
works because our hypothesis was two-tailed. If we needed to produce a one-tailed test,
we could simply divide the probability value reported in jamovi in half. We are also given
the 95% confidence interval, which is based on the standard error. From this, we can
determine that 95% of the time, in a sample of this same size from the same population,
we would expect to find a mean difference between −4.758 and −10.842.

A note here about rounding: The default in jamovi is to round to the hundredths place,
or two places after the decimal. We described this in an earlier chapter, but typically we
will want to report statistical test results to the thousandths place (three after the dec-
imal). There is an easy setting for this in jamovi. Simply click the three vertical dots in
the upper right corner of the software, and then change the “Number format” to “3 dp”
and the “p-value format” to “3 dp”. In this same menu, if we click “Syntax mode”, jamovi
will display the code used to produce the output. The jamovi software is based on R, a
programming language commonly used for statistical analysis. Taking a look and getting
familiar with R coding can be very useful, especially because there are some advanced
analyses (beyond the scope of this text) that might require using R directly.
102 • BETWEEN-SUBJECTS DESIGNS

WRITING UP THE RESULTS


The conventions for writing up results will vary a bit based on the discipline and subdis-
cipline. However, we provide here a general guide for writing up the results that will be
helpful in most cases:

1. What test did we use, and why?


2. What was the result of that test?
3. If the test was significant, what is the effect size? (If the test was not significant,
simply report effect size in #2.)
4. What is the pattern of group differences?
5. What is your interpretation of that pattern?

You can see that we are suggesting a roughly five-sentence paragraph to describe the
results of an independent samples t-test. Here is an example of how we might answer
those questions for our example:

1 What test did we use, and why?


We used an independent samples t-test to determine if final exam scores differed
between students taking an online versus a face-to-face version of the class.
2 What was the result of that test?
Final exam scores were significantly different between students in the two ver-
sions of the class (t8 = −5.91, p < .001).
3 If the test was significant, what is the effect size? (If the test was not significant,
simply report effect size in #2.)
About 77% of the variance in final exam scores was explained by the version of
the course in which students were enrolled (ω2 = .77).
4 What is the pattern of group differences?
Students in the face-to-face version of the class (M = 91.80, SD = 2.24) scored
higher on the final exam than did those in the online version of the class
(M = 84.00, SD = 1.92).
5 What is your interpretation of that pattern?
Among the present sample, students performed better on the final exam in the
face-to-face version of the class.

We could pull this all together to create a results paragraph something like:

We used an independent samples t-test to determine if final exam scores dif-


fered between students taking an online versus a face-to-face version of the
class. Final exam scores were significantly different between students in the
two versions of the class (t8 = −5.91, p < .001). About 77% of the variance in
final exam scores was explained by the version of the course in which students

(Continued)
COMPARING TWO SAMPLE MEANS • 103

were enrolled (ω2 = .77). Students in the face-to-face version of the class
(M = 91.80, SD = 2.24) scored higher on the final exam than did those in
the online version of the class (M = 84.00, SD = 1.92). Among the present
sample, students performed better on the final exam in the face-to-face
version of the class.

Notes
1 While the assumption of homogeneity of variance actually refers to population variance,
Levene’s test only assesses sample variance. That is, the assumption of homogeneity of vari-
ance is that the population variances for the two groups are equal. But, because we do not
have data from the full population, Levene’s test uses sample variances. As a result, we have
expressed the null and alternative hypotheses for Levene’s test using Latin notation, rather
than Greek notation. Other texts will show Levene’s test hypotheses in Greek notation (trad-
ing sigma for s) because the assumption is actually about the population.
2 In this example, as in most computation examples in this text, the sample size is quite small.
This is to make it easier to follow the mechanics of how the tests work. In actuality, this
sample size is inadequate for a test like the independent samples t-test. But, for the purposes
of demonstrating the test, we’ve limited the sample size.
7
Independent samples
t-test case studies

Case study 1: written versus oral explanations 105


Research questions 106
Hypotheses 106
Variables being measured 106
Conducting the analysis 107
Write-up 108
Case study 2: evaluation of implicit bias in graduate school applications 109
Research questions 109
Hypotheses 110
Variables being measured 110
Conducting the analysis 110
Write-up 112
Note 112

In the previous chapter, we explored the independent samples t-test using a made-up exam-
ple and some fabricated data. In this chapter, we will present several examples of published
research that used an independent samples t-test. For each sample, we encourage you to:

1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the t-test.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.

CASE STUDY 1: WRITTEN VERSUS ORAL EXPLANATIONS


Lachner, A., Ly, K., & Nückles, M. (2018). Providing written or oral explanations? Differ-
ential effects of the modality of explaining on students’ conceptual learning and transfer.

105
106 • BETWEEN-SUBJECTS DESIGNS

Journal of Experimental Education, 86(3), 344–361. https://doi.org/10.1080/00220973.2


017.1363691.
The first case study examined whether two methods of explanations (written versus oral)
of how combustion engines work would result in differences in two learning outcomes:
conceptual knowledge and transfer knowledge. The researchers randomly assigned the
participants in the two types of explanations and then scored them on the two outcomes.
The authors were interested in determining whether different explanation types would
result in differences in students’ quality of explanations and learning. They examined if
there were average differences in learning outcomes between students who provided writ-
ten explanations versus students who provided oral explanations to fellow students on a text
that described combustion engines. They conducted the study in two phases. During phase
one, they asked the students to read a text describing internal combustion engines. In phase
two, they randomly assigned students to generate written or oral explanation to a fictitious
student who had no scientific knowledge about internal combustion engines. Then the two
groups of students took a test to measure conceptual knowledge and a transfer knowledge.

Research questions
The researchers were interested in determining:

1. If there were differences in conceptual knowledge test scores between students


who generated a written explanation versus those students who generated an oral
explanation.
2. If there were differences in transfer knowledge test scores between students who gen-
erated a written explanation versus those students who generated an oral explanation

Hypotheses
The authors hypothesized the following related to conceptual knowledge:

H0: There was no difference in conceptual knowledge scores between students gener-


ating written versus oral explanations.
H1: There was a difference in conceptual knowledge scores between students generat-
ing written versus oral explanations.

The authors hypothesized the following related to transfer knowledge:

H0: There was no difference in transfer knowledge scores between students generat-


ing written versus oral explanations.
H1: There was a difference in transfer knowledge scores between students generating
written versus oral explanations.

Variables being measured


There were two hypotheses that the researcher tested: conceptual knowledge and trans-
fer of procedural knowledge. For the conceptual knowledge hypothesis, the dependent
INDEPENDENT SAMPLES t-TEST CASE STUDIES • 107

variable (DV) was conceptual knowledge test scores (12-item multiple-choice test). The
authors reported that they evaluated the content validity of the conceptual knowledge
test by using a subject-matter expert to check the correctness of the questions and the
possible answers. For the transfer hypothesis, the dependent variable was transfer test
scores (two open-ended questions) rated by two raters. The raters generated scores for
each student, and each question had 13 points resulting in a maximum of 26 points in
the transfer test. Two raters rated the transfer knowledge test. The interrater reliability
of the two raters (ICC) = .98. This shows that the raters were consistent in rating the
students’ transfer knowledge scores.

Conducting the analysis


1. What test did they use, and why?
The researchers used two independent samples t-tests to determine if conceptual
knowledge and transfer knowledge would differ between students generating writ-
ten versus oral explanations. Because the authors used two independent samples
t-tests, they set Type I error at .025, using the Bonferroni inequality to adjust for
familywise error.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The two dependent variables for this study were conceptual knowledge and
transfer knowledge scores. Both of scores are interval scales.
b. Normality of the dependent variable
In most cases, when this assumption is met, the article will not report normality
statistics. Because the authors did not report about normality, we must infer
this assumption was met. However, normally researchers would evaluate this
assumption before running the analysis (even if they do not write about it in the
article).1 We would check for normality using skewness and kurtosis statistics.
c. Observations are independent
They met the independence assumption by the fact that each student in the
study responded to the two tests independent of one another. They did not
respond to the test questions in pairs or as a group. Also, the researchers did not
note any meaningful nested structure to the data.
d. Random sampling and assignment
The authors indicate that they randomly assigned the students to do written
or oral explanation. The study meets the random assignment assumption.The
sample is not random. It appears to be a convenience sample, which may raise
some issues about sampling bias and generalizability.
e. Homogeneity of variance
The assumption was met for transfer knowledge (F = 3.239, p = .078), but not
for conceptual knowledge (F = 6.587, p = .014). So the authors used Welch’s
correction for heterogeneous variances on conceptual knowledge.
3. What was the result of that test?
There was a significant difference in transfer knowledge between those generat-
ing written versus oral explanations (t46 = 2.317, p = .025). However, there was no
significant difference in conceptual knowledge scores between those generating
written versus oral explanations (t38.447 = −1.324, p = .192).
4. What was the effect size, and how is it interpreted?
108 • BETWEEN-SUBJECTS DESIGNS

The authors reported Cohen’s d. However, we could calculate omega squared for
each test:
t2 1 2.3172  1 5.368  1 4.368
Transfer : 2  2
 2
   .083
t  N X  NY  1 2.317  24  24  1 5.368  47 52.368
t2 1 1.3242  1 1.753  1
Conceptual : 2   
t 2  N X  NY  1 1.3242  24  24  1 1.753  47
0.753
  .015
48.753
From these calculations, we can determine that about 8% of the variance in transfer
knowledge was explained by the type of explanation (ω2 = .083). We would not in-
terpret the effect size for conceptual knowledge because the test was nonsignificant.
5. What is the pattern of group differences?
Those producing oral explanations (M = 11.062, SD = 3.446) scored higher on
transfer knowledge than those generating written explanations (M = 9.146,
SD = 2.134). For conceptual knowledge, there was no significant difference between
the oral (M = 8.333, SD = 1.834) and written (M = 8.917, SD = 1.139) conditions.

Write-up

Results

We used independent samples t-tests to determine if conceptual knowl-


edge and transfer knowledge would differ between those producing writ-
ten explanations and oral explanations of internal combustion engines.
The assumption was met for transfer knowledge (F = 3.239, p = .078),
but not for conceptual knowledge (F = 6.587, p = .014). As a result, we
applied the correction for heterogeneous variances to the test for concep-
tual knowledge. Because we used two t-tests, we adjusted the Type I
error rate to account for familywise error. Using the Bonferroni inequal-
ity, we set α = .025. There was a significant difference in transfer knowl-
edge between those generating written versus oral explanations (t46 =
2.317, p = .025). However, there was no significant difference in concep-
tual knowledge scores between those generating written versus oral

(Continued)
INDEPENDENT SAMPLES t-TEST CASE STUDIES • 109

explanations (t38.447 = −1.324, p = .192, ω2 = .015). About 8% of the vari-


ance in transfer knowledge was explained by the type of explanation (ω2
= .083). Those producing oral explanations (M = 11.062, SD = 3.446)
scored higher on transfer knowledge than those generating written expla-
nations (M = 9.146, SD = 2.134). For conceptual knowledge, there was
no significant difference between the oral (M = 8.333, SD = 1.834) and
written (M = 8.917, SD = 1.139) conditions.

Now, compare this version, which follows the format we suggested in Chapter 6, to the
published version. What is different? Why is it different? Notice that, in the full article,
the t-tests are just one step among several analyses the authors used. Using the t-test in
conjunction with other analyses, as these authors have done, results in some changes in
how the test is explained and presented.

CASE STUDY 2: EVALUATION OF IMPLICIT BIAS


IN GRADUATE SCHOOL APPLICATIONS
Strunk, K. K., & Bailey, L. E. (2015). The difference one word makes: Imagining sexual
orientation in graduate school application essays. Psychology of Sexual Orientation and
Gender Diversity, 2(4), 456–462. https://doi.org/10.1037/sgd0000136.
In this second case study, the researchers were interested in understanding how partic-
ipants might rate graduate school application essays differently based on slight changes in
the essays. Specifically, the authors sought to test whether participants would rate appli-
cants differently if they perceived them to be gay versus straight (the fictitious applicant
was a cisgender man). In the first part of this manuscript, the authors test whether one-
word changes to the essay were sufficient to induce participants to identify the fictitious
applicant as gay versus straight. They found that one-word changes in the essay, in par-
ticular changing the word the fictitious applicant used to refer to his significant other
(“wife” versus “partner” or “husband”), were sufficient to induce sexual identification. In
the second part of the study, which we will review in this case study, the authors examined
how those one-word differences in the essay might be related to differences in ratings of
the fictitious applicant’s essay. Participants were randomly assigned to view versions of the
same essay with differences of one word (“wife” versus “partner” or “husband”). The essays
were otherwise identical, and participants completed ratings forms about the applicants.

Research questions
The researchers wanted to know if participants would rate the essays using the words
partner or husband (implying the author was gay) differently than they rated the
essay using the word “wife” (implying the author was straight). Their literature review
110 • BETWEEN-SUBJECTS DESIGNS

suggested that participants might have an implicit bias against gay applicants, which
might result in lower ratings on some scales. In particular, they wanted to test differences
in perceived “fit” with the graduate program, because fit is a subjective quality where
implicit or unconscious bias would be more likely to manifest. They also tested differ-
ences in rating of preparedness for graduate school.

Hypotheses
The authors hypothesized the following related to ratings of fit:

H0: There was no difference in ratings of fit between participants reading the wife
essay version versus the husband or partner essay versions.
H1: Participants reading the wife version would provide higher fit ratings than those
reading the husband or partner versions.

The authors hypothesized the following related to ratings of preparedness:

H0: There was no difference in ratings of preparedness between participants reading


the wife essay version versus the husband or partner essay versions.
H1: Participants reading the wife version would provide higher preparedness ratings
than those reading the husband or partner versions.

Notice that these hypotheses are one-tailed. They specify a direction of difference—that
the wife essay will get higher ratings. By default, jamovi produces two-tailed probabilities for
t-tests, so we will have to divide those probabilities by two to get the one-tailed probabilities.

Variables being measured


The dependent variables were ratings of fit and ratings of preparedness. For prepared-
ness, they used a 7-point Likert-type type scale ranging from “1 = not well prepared at
all” to “7 = extremely well prepared.” For ratings of fit, they used a 7-point Likert-type
type scale ranging from “1 = not at all” to “7 = extremely well.” Because the authors used
single-item measures, they did not report reliability coefficients. The independent varia-
ble was randomly assigned groups, with the first group being those randomly assigned to
read the essay that used the word “wife,” and the second being those randomly assigned
to read the essays that used the word “partner” or “husband.”

Conducting the analysis


1. What test did they use, and why?
The authors used two independent samples t-tests, one for ratings of fit and the
other for ratings of preparedness. The authors argued a Type I error correction for
multiple comparisons was not needed in this case due to the nature of the research
question and the variables under analysis.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
INDEPENDENT SAMPLES t-TEST CASE STUDIES • 111

Both of the dependent variables were measured on Likert-type scales. As we


discussed in Chapter 2, most educational and behavioral research will treat
Likert-type data as interval.
b. Normality of the dependent variable
As in the first case study, the authors did not report information on normality.
This is typical in published articles where the assumption of normality was met.
However, in the process of analyzing the data, researchers should check nor-
mality prior to conducting the main analysis, even if that analysis will not ulti-
mately be part of the publication. Normality could be assessed using skewness
and kurtosis statistics.
c. Observations are independent
The observations appear to be independent, and participants were not paired or
otherwise in any kind of nested structure, according to the information available.
d. Random sampling and assignment
Participants were randomly assigned to read one of the three versions of the
essay (wife, partner, or husband). However, the participants were not randomly
sampled. They appear to be a convenience sample from a single university.
e. Homogeneity of variance
The authors did not present Levene’s test in the published manuscript. However,
this is typical when the assumption was met. In the data on the online resources
for this case study, we can calculate Levene’s test, and see that the assumption
was met for both fit ratings (F = 1.656, p = .202), and for preparedness ratings
(F = 2.675, p = .107).
3. What was the result of that test?
There was a significant difference in ratings of fit (t68 = 2.178, p = .016), but not for
ratings of preparedness (t68 = .668, p = .253). For the probabilities, it is important
to remember these were one-tailed hypotheses, so we have divided the probability
value by 2 because SPSS produces two-tailed probabilities.
4. What was the effect size, and how is it interpreted?

t2 1 2.1782  1 4.744  1 3.744


Fit : 2  2
 2
   .051
t  N X  NY  1 2.178  32  38  1 4.744  69 73.744

About 5% of the variance in ratings of fit was explained by whether participants


read the wife or the partner/husband version of the essay (ω2 = .051).
For preparedness, −1 < t < 1 so ω2 = .000, as we reviewed in Chapter 6. If we used the
formula for this t value, we would get a negative result but would report it as .000.
5. What is the pattern of group differences?
Participants reading the wife essay (M = 5.531, SD = 1.270) rated the applicant’s fit
higher than those reading the partner or husband essay (M = 4.911, SD = 1.075).
However, there were no differences in preparedness ratings between the group
reading the wife essay (M = 5.437, SD = 1.645) versus those reading the partner or
husband essay (M = 5.211, SD = 1.189).
112 • BETWEEN-SUBJECTS DESIGNS

Write-up

Results

To determine if participant ratings of fit with the program and prepared-


ness for graduate school would differ between those reading the wife
versus the partner or husband versions of the essays, we used two inde-
pendent samples t-tests. There was a significant difference in ratings of fit
(t68 = 2.178, p = .016), but not for ratings of preparedness (t68 = .668,
p  =  .253, ω2 = .000). About 5% of the variance in ratings of fit was
explained by which essay version participants read (ω2 = .051). Participants
reading the wife essay (M = 5.531, SD = 1.270) rated the applicant’s fit
higher than those reading the partner or husband essay (M = 4.921, SD =
1.075). However, there were no differences in preparedness ratings
between the group reading the wife essay (M = 5.437, SD = 1.645) versus
those reading the partner or husband essay (M = 5.211, SD = 1.189).

Again, compare this to the published study to see how they differ. In this case, we’ve
focused on Study 2 of the manuscript, but the authors have written about the t-test in a
rather different way than we have here because it was one of multiple analyses they used.
This is very typical in published work—to see multiple analyses in a single paper. This is
perhaps especially true of the use of the independent samples t-test, which is often used
as an additive or preliminary analysis to other tests. However, the independent samples
t-test can certainly stand on its own, especially in experimental research.
In the next chapter, we’ll move on to comparisons of more than two groups using the
one-way ANOVA. It functions in similar ways to the t-test but also has some key differ-
ences. The ANOVA is a more general form of the t-test, because it can test any number
of groups, while the t-test can only test two groups at a time.
For additional case studies, see the online eResources, which include dedicated
examples related to race and racism in education.

Note
1 The process by which we simulate data for these case studies results in data that are almost
perfectly normally distributed. Remember that the example datasets on the online resources
are not actual human subjects’ data, but simulated data to reproduce the outcomes of the case
study articles. If you decide to run the tests for normality for practice on these datasets, keep
in mind they will be nearly perfect due to the manner in which we have simulated those data.
8
Comparing more than
two sample means

The one-way ANOVA

Introducing the one-way ANOVA 114


The F distribution 114
Familywise error and corrections 115
Research design and the one-way ANOVA 117
Assumptions of the one-way ANOVA 118
Level of measurement for the dependent variable is interval or ratio 118
Normality of the dependent variable 118
Observations are independent 118
Random sampling and assignment 119
Homogeneity of variance 119
Calculating the Test Statistic F 120
Calculating the one-way ANOVA 120
Using the F critical value table 125
F is always a one-tailed test 125
Interpreting the test statistic 125
Effect size for the one-way ANOVA 126
Calculating omega squared 126
Interpreting the magnitude of difference 126
Determining how groups differ from one another and interpreting the
pattern of group differences 126
Post-hoc tests 127
A priori comparisons 130
Computing the one-way ANOVA in jamovi 134
Computing the one-way ANOVA with post-hoc tests in jamovi 134
Computing the one-way ANOVA with a priori comparisons in jamovi 140
Writing Up the Results 140
Writing the one-way ANOVA with post-hoc tests 141

113
114 • BETWEEN-SUBJECTS DESIGNS

In the previous chapter, we explored the independent samples t-test as a way to compare
two group means. However, many research designs will involve more than two groups,
and the t-test is an inefficient way to conduct those comparisons. In this chapter, we will
encounter the one-way analysis of variance (ANOVA, for short) for comparing more
than two group means. We will also explore how it is related to the t-test, and why we
cannot just use multiple t-tests to do multiple group comparisons.

INTRODUCING THE ONE-WAY ANOVA


The one-way analysis of variance, or ANOVA (not italicized because it is an abbreviation,
not notation), is the first version of the ANOVA we will learn. As we move through future
chapters, we will discover that the ANOVA has a number of different iterations, and because
of that is applicable to a wide range of research designs. However, in this chapter we will
learn the simplest version of this test—the one-way ANOVA. In the most basic sense, the
fundamental difference between the t-test and the ANOVA is that the t-test requires two
groups for comparison, while the ANOVA can compare more than two groups.
It is perhaps worth pausing here to point out that the ANOVA can also handle com-
parisons of only two groups. In fact, the t-test is merely a special case of the ANOVA
where there are only two groups. Both the t-test and the ANOVA are derived from the
General Linear Model but have different applications. Both test the ratio of between-
groups variation (in the t-test, measured by the mean difference between two groups)
and within-groups variation or error (in the t-test, measured by the standard error of the
mean difference). The t-test is a special case of the ANOVA, and so has simpler calcula-
tions and interpretations, but accomplishes that simplification by limiting the number of
groups we can compare to two. The ANOVA is conceptually and computationally more
complex, but allows us to handle more than two groups.
Because the t-test is simpler than the ANOVA, if we have only two groups, we would
use the independent samples t-test. The ANOVA would produce a mathematically equiv-
alent result (in fact, we’ll learn that t2 = F, where F is the ANOVA test statistic), but is a
more complex test than we really need in that instance. In other words, the ANOVA is
overkill if we have only two groups, and we would opt for the simpler t-test. In the event
of more than two groups, we will need an ANOVA. But first, we will explore some of the
features of the ANOVA, research design issues in the one-way ANOVA, and discuss the
assumptions of this test.

The F distribution
The ANOVA produces an F statistic, unlike the t-test, which produces a t statistic. The F
here stands for Fischer, but for our purposes we will typically refer to it as the F test or the
F ratio. F has some characteristics that are a bit different from t, though. Both F and t are
(as we described in the prior section) sampling distributions of the test statistic, so that
given a test statistic and degrees of freedom, we can calculate the probability associated
with that test result. The t distribution was a normal distribution with a mean, median,
and mode of zero (not unlike the z distribution). The F distribution takes on a different
shape, though. And it does so because F cannot be negative (we’ll discover why shortly).
Because of that, the F distribution is not normal, unlike z. While it won’t be particu-
larly important that you can visualize the F distribution, the graphic below shows how it
might be shaped in several different situations.
COMPARING MORE THAN TWO SAMPLE MEANS • 115

As illustrated by this figure, the shape of the F distribution is quite different from pre-
vious sampling distributions we have encountered. Its shape varies based on the degrees
of freedom, of which there are infinite possible combinations. However, our interaction
with the F distribution will be quite similar. For hand-calculated examples, we will look
up the critical value of F in a table, and if our calculated value exceeds the critical value
(if we “beat the table”), then we reject the null hypothesis.

Familywise error and corrections


One of the issues the one-way ANOVA corrects for is the problem of multiple compar-
isons and familywise error. The ANOVA can handle multiple groups, while the t-test
can only handle two groups at a time. In other words, one ANOVA can do the work of
multiple t-tests. Take a case where we might have three groups we want to compare. If all
we had available was the t-test, we could do a t-test to compare group 1 versus group 2,
another t-test to compare group 2 versus group 3, and a third t-test to compare group 1
versus group 3. So it would take three t-tests to do the work of one ANOVA. As we add
more groups (more levels on the independent variable), we need exponentially more
t-tests.

Why not use multiple t-tests?


Nevertheless, it is still natural to wonder why we cannot just do lots of t-tests. After all,
we already know that test, and it seems like it should work. There are problems with
using multiple t-tests, though. One is that, when we split up our groups into pairs so that
we can do t-tests, we might be missing larger patterns of difference. But the larger prob-
lem is error. By performing multiple tests, we can inflate the Type I error rate.

Familywise error
This problem is referred to as a familywise error. Here, we are thinking about the set
of data as a “family” in that the data are related to one another. When we perform
multiple tests on that same family of data, the Type I error piles up. We usually set
testwise Type I error (that is the Type I error rate for an individual test) at .05. But
116 • BETWEEN-SUBJECTS DESIGNS

if I do multiple tests in the same family of data, the error compounds. The formula
for computing how much error we get is ∝fw = 1 − (1 − ∝tw)c, where c is the num-
ber of tests, and αtw is the testwise Type I error rate. It won’t be important to learn
this formula, but we’ll use it briefly to illustrate how much familywise error we can
accumulate.
In the case where we have set testwise Type I error at .05 and we are doing three tests
(as in the example above, to compare three groups), ∝fw = 1 − (1 − ∝tw)c = 1 − (1 − .050)3
= 1 − .9503 = 1 − .857 = .143. In other words, I have a 14.3% chance of making a Type I
error across the set of tests. This gets much worse as we expand the number of groups.
For example, if we want to compare five groups using multiple independent samples
t-tests, we would need to do nine t-tests (1 vs. 2, 1 vs. 3, 1 vs. 4, 1 vs. 5, 2 vs. 3, 2 vs. 4, 2
vs. 5, 3 vs. 4, 3 vs. 5). Using our formula, ∝fw = 1 − (1 − ∝tw)c = 1 − (1 − .050)9 = 1 − .9509
= 1 − .630 = .370. That is a 37% chance of making a Type I error across the set of tests.
This is unacceptably high, and illustrates why we do not use multiple tests in the same
family of data. As we do, it becomes increasingly likely that we are claiming differences
exist that, in reality, do not exist.

The Bonferroni correction


One way of correcting for familywise error is the Bonferroni correction. It is a simple
method of correcting for familywise Type I error, in which we simply divide the testwise
Type I error rate by the number of comparisons to arrive at our adjusted α. For example,
if we are going to do three tests, we would adjust alpha by dividing by three (.050/3 =
.0125). In that case, we would require p < .0125 before rejecting the null hypothesis and
concluding a difference was significant. Similarly, if we planned nine tests, we would
divide alpha by nine (.050/9 = .0056), and require p < .0056 before rejecting the null and
concluding a difference was significant. However, in doing so, we have lost some statis-
tical power. In other words, a difference would need to be quite large (due to the smaller
α) before I could conclude it was significant.

Omnibus tests and familywise error


Because of the issues of a loss of power if we adjust alpha, and the problem that multiple
pairwise tests might not give a clear indication of overall patterns, we instead prefer to
use an omnibus test. Omnibus tests are simply overall tests that evaluate the entire set
of data all at once. In the case of the ANOVA, this will mean evaluating how different
all of the groups are, taken together. Because this test includes all of the groups, we only
need one, and thus avoid the problem of familywise error. So rather than needing lots of
independent samples t-tests, we just need one ANOVA.
It is a good general rule that if we find ourselves thinking we need to use multi-
ple of the same test on the same set of data, there is probably a better test available.
Our need for corrections like the Bonferroni correction should be infrequent, because
we will prefer higher-level tests where only one test is needed. As an added benefit, those
tests usually offer us some additional options and information versus their lower-level
counterparts. As we will discover in this chapter, that is the case here. The ANOVA offers
us some more interpretive options versus the t-test, which is another reason we prefer it
when there are more than two groups to compare.
COMPARING MORE THAN TWO SAMPLE MEANS • 117

RESEARCH DESIGN AND THE ONE-WAY ANOVA


We have talked so far about the ANOVA as an omnibus generalization of the t-test. We
have discussed that it can handle more than two groups at a time and does so with a
single test, avoiding issues with familywise error. Now we will discuss some research
designs where the ANOVA would be appropriate and the design considerations for
working with ANOVAs.
Fundamentally, the one-way ANOVA will be used for designs where there are more
than two independent groups. In Chapter 6, we used an example of an observational
study where we might track course performance of online versus face-to-face students.
If we also had a blended or hybrid version of that class (one that involves both online
and face-to-face components), we might want to compare all three versions of the class.
Because there would then be three groups, we would use a one-way ANOVA. As we
discussed in the prior chapter, one of the design limitations of that study would be that
students self-select into courses. Students who want or need the online course might
differ in a variety of ways from students who want or need the face-to-face version, and
students choosing the hybrid version might have other unique characteristics. Because
of this self-selection, it is difficult to claim that group differences are because of the ver-
sion of the class, rather than attributable to other between-groups differences.
Now let us imagine a different research scenario altogether, and we’ll follow this
through most of the rest of this chapter. Let us say we are interested in finding ways
to reduce racial stereotypes among grade school-aged children. We take three classes
of third-grade students. In the first class, students complete an assignment where they
produce a short documentary-style film about a country outside the United States. In
the second class, students write letters back and forth (become “pen pals”) with a student
of another race at a different school district. In the third class, students complete their
normal coursework. We administer a test of implicit bias after the school year ends to
assess the degree to which students still hold racial biases.
In this example, we no longer have the problem of self-selection, because the students
did not choose which teacher/class to take. However, we still have a potential challenge.
Assuming the three classes have the same teacher and curriculum, we still cannot account
for differences in the backgrounds of students in the three classes. We did not randomly
assign individual students to treatment conditions but instead assigned the entire class.
Practically, that is the only option. It would not be feasible to have different students in
the same class doing very different assignments. Moreover, even if we could get random
assignment at the student level, we then introduce the possibility that students will share
their assignments and experiences with their classmates. If they do, we potentially violate
the assumption of independence. What this means for our consideration of the research
design is that we will be very careful about attributing any between-groups differences
to the three treatments. Instead, we will make claims about an association between the
treatment type and the differences.
Design options do exist to deal with this limitation. Most notably, we could use either
mechanical or mathematical matching (discussed in the first section of this book) to create
a quasi-experimental design. For that approach to work, we would need a large number of
students in each of the three conditions, because we will lose some students from each con-
dition when they don’t have a close match in the other conditions. Often, perhaps especially
in school-based research, getting a large sample is a big challenge, so we might decide to use
the classes as they are, with the caveat that our inferences will be more constrained.
118 • BETWEEN-SUBJECTS DESIGNS

ASSUMPTIONS OF THE ONE-WAY ANOVA


Because the ANOVA and the t-test are both part of the general linear model, they share
similar sets of assumptions. In fact, in the case of the one-way ANOVA, the assumptions
are nearly identical. We will discuss each one briefly here but will focus on how the
ANOVA design might change our evaluation of these assumptions.

Level of measurement for the dependent variable is interval or ratio


As we discussed in the previous chapter, this assumption has to do with the type of data
we use as a dependent variable. We must use continuous data, which includes interval
or ratio-level data. In the prior chapter, we discussed some of the issues around this
assumption, especially as it applies to Likert-type data. The assumption stays essentially
unchanged in the ANOVA, where we require continuous dependent variables.

Normality of the dependent variable


Again, this assumption is almost entirely the same as it was in the independent samples
t-test. The ANOVA, like t, assumes that the dependent variable is normally distributed.
As we illustrated in the previous chapter, we can test this by examining skewness and
kurtosis statistics, compared to their standard error. The ANOVA, like t, is relatively
robust against violations of this assumption, and is more robust against skewness than
it is against kurtosis. In other words, moderate deviations from normality on skew will
typically not affect the ANOVA. As is our usual course of action when we see violations
of normality (that is, an absolute value of skewness or kurtosis that is more than twice the
standard error of skewness or kurtosis, as illustrated in the prior chapter), we will note
that in our write-up of the results but will typically not dissuade us from applying the
ANOVA. In the case of more extreme violations of normality, we would probably select
a different test, such as a nonparametric test (though this book does not cover nonpar-
ametric tests). To protect ourselves from violating this assumption, we would try to use
scales and measures that have been well researched and used with a successful track
record in the literature and would try to maximize our sample sizes.

Observations are independent


As we discussed in the prior chapter, this assumption requires that all observations be
independent of one another. We discussed some cases where this might be questioned
and highlighted the importance of the groups being independent. One challenge in
ensuring independence becomes even more pronounced in the ANOVA. We want to be
sure that this is no crossover influence between the groups. In the example we have used
so far in this chapter, where students in different classes are doing different assignments,
with the goal of reducing racial bias, we have an illustration of this challenge. We would
need to find a way of ensuring that participants are not sharing their experiences across
the treatment groups. In other words, we do not want a student doing the pen pal assign-
ment to talk about that experience to a student doing no special assignments. The reason
is that we might get some confounding effects. The student not doing special assignments
might experience a reduction in bias simply by hearing about the other student’s experi-
ence, for example. We typically try to instruct participants not to share their experiences
until after the final testing is complete, but we might also look for structural safeguards
(like assigning the three classes to three different lunch periods) as well.
COMPARING MORE THAN TWO SAMPLE MEANS • 119

Random sampling and assignment


This assumption is also carried over from what we learned with the t-test. Because the gen-
eral linear model (under which all of these tests fall) was built with certain kinds of research
in mind, it makes the assumption that our data are randomly sampled from the population
and randomly assigned to groups. As we previously discovered, the reason for those assump-
tions is because those conditions help us make strong inferences (via random assignment
to groups) and to generalize those inferences (via random sampling). We discussed before
that, realistically, random sampling is not a feature of almost any educational research. How-
ever, as we discussed in prior chapters, we would work to minimize sampling bias.
As pointed out above, in the research design we are contemplating as an example in
this chapter, random assignment to groups is not a possibility because we are giving an
entire classroom of students the same assignment. That will limit our inference some-
what—it will be more difficult to attribute any between-groups differences to the treat-
ment itself, as other systematic differences might exist between the classes.

Homogeneity of variance
We also encountered the assumption of homogeneity of variance in the Chapter 6. Here,
we have the assumption that the variances will be equal across all groups. This relates to
the idea that we are expecting our group means to differ but with relatively constant vari-
ance across the groups. This is especially important in the ANOVA design, because when
we calculate the test statistic, we will calculate a within-groups variation. For that cal-
culation to work properly, we need relatively consistent variation in each of our groups.
However, it is worth noting that it is far less common to violate this assumption in the
ANOVA. When we do violate this assumption, it is often due to unbalanced sample
sizes. As we discussed in Chapter 6, a good guideline is that no group should be more
than twice as large as any other group. Because variance is, in part, related to sample
size, groups with very different sample sizes will have different variances. The strongest
protection against failing this assumption is to have balanced group sizes or as close to
balance as possible.

Levene’s test for equality of error variances


Much like with the independent samples t-test, we will use Levene’s test to evaluate the
assumption of homogeneity of variance. It will function similarly to the independent
samples t-test. The null hypothesis for Levene’s test is that the variances are equal across
groups. So, when p < .05, we reject the null and conclude the variances are not equal across
groups—in other words, we’d conclude that we violated the assumption of homogeneity of
variances. When p is greater than or equal to .05, we fail to reject the null, conclude that the
variances are equal (homogeneous) across groups, and that we have met the assumption.

Correcting for heterogeneous variances


Unlike in the independent samples t-test, the correction for heterogeneous variances
(for when we fail to meet the assumption) is not automatically included in our output
and is not quite as simple. We will have to request the correction specifically and have
a few options to choose from. We’ll briefly describe the corrections for heterogeneous
variances in the ANOVA. First, in many cases, even if the Levene’s test statistic is signif-
icant (p < .05), it may be possible to support the assumption of homogeneity of variance
120 • BETWEEN-SUBJECTS DESIGNS

through other means. In order to proceed with an uncorrected F statistic, even though
Levene’s test is significant, the following three conditions must all be met:

1. The sample size is the same for all groups (slight deviations of a few participants are
acceptable).
2. The dependent variable is normally distributed.
3. The largest variance of any group divided by the smallest variance of any group
is less than three. Or, put another way, the smallest variance of any group is more
than 1/3 the largest variance of any group.
If all three conditions are met, there is no need for a correction. However, if one
or more of these conditions was not met, jamovi has a correction available known
as the Welch correction. In fact, in jamovi, it defaults to the Welch correction,
and we must choose the uncorrected (or exact, or Fisher’s) test if it is appropriate.
Note, though, that this option is only available in the One-Way ANOVA menu,
and we normally will choose to use the ANOVA menu as it is more versatile, so
if the correction is needed, we would need to change which part of the program
we use.

CALCULATING THE TEST STATISTIC F


When we learned the independent samples t-test, the test statistic was t. That made some
intuitive sense because the statistic was in the name of the test. As we discussed earlier in
this chapter, ANOVA is not a test statistic. It’s a kind of abbreviation for “analysis of vari-
ance.” In the ANOVA, the test statistic is F. Knowing the test statistic is F is really enough,
but just to satisfy any curiosity, it is F because it’s named after Fisher, who co-created the
test. The F statistic is, like t, a proportion of between-groups variation to within-groups
variation. In the t-test, we just calculated between-groups variation as the mean difference
between the two groups and calculated a standard error statistic as within-groups variation.
In the F test, though, things are a little more complicated. Because we potentially have
more than two groups, we cannot use a mean difference as the numerator like we did
in t. Instead, we will calculate a series of sums of squares to estimate between-groups,
within-groups, and total variation.

Calculating the one-way ANOVA


The ANOVA is calculated from what is called a source table. In the source table (as illus-
trated below), there are several “sources” of variance, degrees of freedom associated with
each “source,” as well as mean squares, and finally the F test statistic:

Source SS df MS F

Between SSB dfB MSB F


Within SSW dfW MSW
Total SST dfT

As we move forward, we’ll explore how to calculate each of these and the logic behind
the test statistics.
COMPARING MORE THAN TWO SAMPLE MEANS • 121

Partitioning variance
The ANOVA is called the “analysis of variance” because it involves partitioning total
variation into different sources of variance. In the one-way ANOVA, those sources are
between-groups and within-groups variance. But what is being partitioned into between
and within variation is the total variance. In the ANOVA, we determine how much total
variation is in the data based on deviations from the grand mean.
The grand mean is simply the mean of all the scores—that is, the overall mean regard-
less of which group a participant belongs to, often written as X. You might recall that
X normally notates a group mean. The second bar indicates this is the grand mean, with
some texts using the phrase “mean of means.” That terminology really only works in
perfectly balanced samples, though, where the grand mean and the mean of the group
means will be equal. However, sometimes knowing that background can make it easier
to remember that the double bar over a variable indicates the grand mean.
How then do we calculate variation from the grand mean? For the purposes of the
ANOVA source table, we’ll be calculating the sum of squares (SS), or sum of the squared
deviation scores. You might recall this in previous chapters where the numerator of the
variance formula was called the sum of squares or sum of the squared deviation scores.
The SST will be calculated based on deviations from the grand mean (just like the SS in
the numerator of variance was calculated from the group means). So, for the ANOVA:

( )
2
SST = X − X

Returning to our example, where we have students doing three different kinds of course-
work and hope to evaluate if there are any differences in racial stereotypes between students
doing these different kinds of work, imagine that each group has five children. For research
design purposes, five participants per group would not be sufficient (we would normally
want at least 30 per group), but for the purposes of illustrating the calculations, we will stick
to five per group. Below, we illustrate the calculations involved in getting the SST .

(X − X) (X − X)
2
Group Score

Film Project 3.50 −0.46 0.21


3.70 −0.26 0.07
4.20 0.24 0.06
4.10 0.14 0.02
3.80 −0.16 0.03
Pen Pal Assignment 3.60 −0.36 0.13
3.40 −0.56 0.31
3.10 −0.86 0.74
3.90 −0.06 0.00
3.30 −0.66 0.44
Normal Coursework 4.30 0.34 0.12
4.50 0.54 0.29
4.70 0.74 0.55
4.40 0.44 0.19
4.90 0.94 0.88
∑ = 59.40 ∑ = 0.00 ∑ = 4.04
122 • BETWEEN-SUBJECTS DESIGNS

We would start by calculating the grand mean, which is the total of all scores divided by
the number of participants—in our case 59.4/15 = 3.96. We then take each score minus
the grand mean of 3.96 to get the deviation scores (which will sum to zero as we discov-
ered in prior chapters). Finally, we square the deviation scores and take the sum of the
squared deviation scores. The sum of the squared deviation scores from this procedure
is SST, which is 4.04 in this case.

Between-groups and within-groups variance


Both between-groups and within-groups variance are also calculated as sums of squared
deviation scores (SS). The difference in those calculations is what deviations we’re inter-
ested in. Let’s start with the within-groups variance, as it will use a familiar formula:
SSW = ( X − X )
2

That formula calls for us to sum the squared deviations of scores from their group mean.
In other words, for children in the film project group, we’ll take their scores on the racial
stereotype measure minus the mean score for all children in the film project group. Then,
we will take scores for children in the pen pal group minus the mean of all children in
the pen pal group. Finally, we will take the scores of children in the normal coursework
group minus the mean of all children in the normal coursework group. That means we
will need to calculate a mean for each group, which again will simply be the total of the
scores in that group minus the number of participants in the group. Below, we have illus-
trated how we would do these calculations for this example:

Group Score (X − X) (X − X)
2

Film Project 3.50 −0.36 0.13


∑ = 19.30 3.70 −0.16 0.03
M = 3.86 4.20 0.34 0.12
4.10 0.24 0.06
3.80 −0.06 0.00
Pen Pal Assignment 3.60 0.14 0.02
∑ = 17.30 3.40 −0.06 0.00
M = 3.46 3.10 −0.36 0.13
3.90 0.44 0.19
3.30 −0.16 0.03
Normal Coursework 4.30 −0.26 0.07
∑ = 22.8 4.50 −0.06 0.00
M = 4.56 4.70 0.14 0.02
4.40 −0.16 0.03
4.90 0.34 0.12
∑ = 59.40 ∑ = 0.00 ∑ = 0.95

In this example, then, the SSW = 0.95. We are now 2/3 of the way through calculating the
sums of squares, after which we’ll move to complete the source table.
COMPARING MORE THAN TWO SAMPLE MEANS • 123

The final SS term we need to calculate is SSB, which measures between-groups varia-
tion. The formula for this term is:

( )
2
SSB = X − X

That is, the sum of the squared deviations of the group mean minus the grand mean.
This point can be confusing at first, because there is only one group mean per group,
but there are multiple participants per group. We will repeat the process for every par-
ticipant in every group, as illustrated below (remember from above that the grand mean
is 3.96):

(X − X) (X − X)
2
Group Score

Film Project 3.50 −0.10 0.01


∑ = 19.30 3.70 −0.10 0.01
M = 3.86 4.20 −0.10 0.01
4.10 −0.10 0.01
3.80 −0.10 0.01
Pen Pal Assignment 3.60 −0.50 0.25
∑ = 17.30 3.40 −0.50 0.25
M = 3.46 3.10 −0.50 0.25
3.90 −0.50 0.25
3.30 −0.50 0.25
Normal Coursework 4.30 0.60 0.36
∑ = 22.8 4.50 0.60 0.36
M = 4.56 4.70 0.60 0.36
4.40 0.60 0.36
4.90 0.60 0.36
∑ = 59.40 ∑ = 0.00 ∑ = 3.10

As shown above, we took the group mean for each participant minus the grand mean,
giving the deviation score. Then we squared those deviation scores and took the sum of
the squared deviation scores. So, for our example, SSB = 3.10. Let’s go ahead and drop
those SS terms into the source table:

Source SS df MS F

Between SSB = 3.10 dfB MSB F


Within SSW = 0.95 dfW MSW
Total SST = 4.04 dfT

Because the total variance is partitioned into between and within, we expect SSB + SSW
= SST. In this illustration, we are off by 0.01 because we rounded throughout to the
124 • BETWEEN-SUBJECTS DESIGNS

hundredths place. If we had not rounded (the software will not round), this would be
exactly equal. Next, we’ll move on to fill in the rest of the source table.

Completing the source table


To complete the source table, we need to calculate the degrees of freedom, mean square,
and finally F. The degree of freedom between will be calculated as dfB = k − 1, where k
is the number of groups. For us, there are three groups, so dfB = k − 1 = 3 − 1 = 2. Next,
the degrees of freedom within will be calculated as dfW = n − k, where n is the number of
total participants, and k is the number of groups. So, for our example, where we had 15
total participants across three groups, dfW = n − k = 15 − 3 = 12. Finally, the degrees of
freedom total will be calculated as dfT = n − 1, where n is the total number of participants.
For our example, then, dfT = n − 1 = 15 − 1 = 14. Just like with the SS terms, we expect
between plus within to equal total, or dfB + dfW = dfT. If we try that out on our example,
2 + 12 = 14, so we can see that all of our calculations were correct.
Next up, we need to calculate the mean square (MS) terms. We will have only two:
between (MSB) and within (MST). There is no mean square total. Each will be calculated
as the SS term divided by the degrees of freedom. So:
SSB
MSB =
df B
SS
MSW = W
dfW
For our example, this means that MSB = 3.10/2 = 1.55, and MSW = 0.95/12 = 0.08.
The final piece of the source table is the F statistic itself. F is equal to the ratio of
between-groups variation (measured as MSB) to within-groups variation (measured as
MSW), so that:

MSB
F=
MSW
In our case, that would mean that F = 1.55/0.08 = 19.38. The completed source table for
this example would be:

Source SS df MS F

Between SSB = 3.10 dfB = 2 MSB = 1.55 F = 19.38


Within SSW = 0.95 dfW = 12 MSW = 0.08
Total SST = 4.04 dfT = 14

Notice that the bottom three cells in the right-hand corner are empty. That pattern will
always be present in every ANOVA design we learn. Finally, we supply below for refer-
ence a source table with the formulas for each term:
COMPARING MORE THAN TWO SAMPLE MEANS • 125

Source SS df MS F

( ) dfB = k − 1 SS MSB
2
Between SSB = X − X MSB = B F=
df B MSW
dfW = n − k SS
SSW = ( X − X )
2
Within MSW = W
dfW
Total
( ) dfT = n − 1
2
SST = X − X

Using the F critical value table


Now, by working through the source table, we have our F test statistic for the example.
However, we do not yet know whether that statistic represents a significant difference or
not. To determine that, we will consult the F critical value table. The F critical value table
is a bit different from the t critical value table in that we now have numerator (between)
degrees of freedom, as well as denominator (within) degrees of freedom. We will look
for the point where the correct column (between df) crosses the correct row (within df),
where we will find our critical value. In our case, at 2 and 12 degrees of freedom, the
critical value for F is 3.89.

F is always a one-tailed test


It is worth pausing here to point out a feature of the ANOVA. The F statistic is always
positive. It would have to be based on the calculations. We cannot have negative
SS terms, so we cannot have a negative F statistic. Here we also see a feature of the
ANOVA that is different from the independent samples t-test: the ANOVA is always
one-tailed. Mathematically, that is because F is always positive. But theoretically, the
F test is not capable of testing directionality. It is only telling us whether there is a dif-
ference among the groups, not testing particular patterns of group differences. As we
discussed earlier in this chapter, the ANOVA is an omnibus test, and as such is always
one-tailed.

Interpreting the test statistic


Interpreting the F test statistic from here is similar to what we have already learned.
If our calculated value (in the example, 19.38) “beats” the critical value (in the exam-
ple, 3.89) we can conclude there was a significant difference. In other words, because
the critical value is the value of F when p is exactly .050, when our calculated value
exceeds that critical value, we know that p < .050 and we can reject the null hypoth-
esis. In our example, because 19.38 is more than 3.89, we conclude that p < .050 and
reject the null hypothesis. We will conclude that there was a significant difference in
racial stereotypes between children getting these three different kinds of assigned
coursework.
126 • BETWEEN-SUBJECTS DESIGNS

EFFECT SIZE FOR THE ONE-WAY ANOVA


As we discussed in the previous chapter, simply knowing a difference is significant is not
enough—we also want to know about the magnitude of the difference. In other words,
we need to calculate and report effect size. In the case of the one-way ANOVA, we will
calculate and report omega squared as the effect size estimate. It is the same statistic we
used for effect size in the independent samples t-test, and it will be interpreted in a sim-
ilar manner but is calculated differently.

Calculating omega squared


In the one-way ANOVA, omega squared will give us the proportion of variance explained
by the grouping variable. In the case of our example, how much variance in racial stere-
otypes is explained by which type of coursework students received? To do that, omega
squared calculates a ratio of between-groups variation adjusted for error, and divides it
by the total variation, again adjusted for error, with the formula:

SSB − ( df B )( MSw )
ω2 =
SST + MSw
All of the necessary information is on our source table, so we can just drop those terms
into the formula and calculate omega squared:

SSB − ( df B )( MSw ) 3.10 − ( 2 )( 0.08 )


3.10 − .16 2.94
ω2 = = = = = 0.710
SST + MSw 4.04 + 0.08 4.12 4.12
Helpfully, as well, jamovi will produce the omega squared statistic in the output, saving
us a step in hand calculations.

Interpreting the magnitude of difference


As with the independent samples t-test, omega squared is interpreted as the proportion
of variance explained by the grouping variable. In our example, ω2 = 0.710, so we would
interpret that about 71% of the variance in racial stereotypes was explained by which
type of coursework children were assigned. Just like we discussed in an earlier chapter,
whether 71% is a lot of explained variance or not very much depends on the previous
research on this topic. We would need to read other studies of racial stereotypes in chil-
dren to see what kinds of effect sizes are typical, from which we could judge whether ours
was larger, smaller, or about average compared to prior research. However, it’s worth not-
ing that this would be a very large effect size in most studies, and is probably unrealisti-
cally large based on our contrived data. However, because we interpret it as a percentage
of variance explained, omega squared is fairly directly interpretable and will make some
amount of intuitive sense to most audiences.

DETERMINING HOW GROUPS DIFFER FROM ONE ANOTHER


AND INTERPRETING THE PATTERN OF GROUP DIFFERENCES
Given the omnibus test finding (from the F statistic) we know the groups differ. But we do
not yet know how the groups differ or what the pattern of differences looks like. That is
because an omnibus test, while it protects against Type I error, is looking at overall variation,
COMPARING MORE THAN TWO SAMPLE MEANS • 127

so it cannot evaluate group-by-group differences. Because of that, we will need a follow-up


analysis to determine how the groups differ and what the patterns are in the data. There are
two ways we can approach this: post-hoc tests or a priori comparisons. Post-hoc (which
literally means after the fact) tests will include comparisons of all pairs of groups. Because
of that, they are sometime called pairwise comparisons. By contract, a priori comparisons
test specific combinations of groups or specific patterns of differences. We would want to
use a post-hoc test in a case where we have no clear theoretical model of how the groups
might differ. In other words, we use post-hoc tests when we have no real hypothesis about
how the groups will differ. By contrast, when we have a theory or hypothesis about how the
groups will differ, we would prefer a priori comparisons (sometimes called planned con-
trasts because they are planned ahead of time based on the theory or hypotheses).

Post-hoc tests
We will start by exploring post-hoc tests, or pairwise comparisons. In published research,
post-hoc tests tend to be more common and are many researchers’ default choice. We
will talk more later about why that might not be the right default choice, but post-hoc
tests are quite common. In a post-hoc test, we will test all possible combinations of
groups and interpret the pattern of those comparisons to determine how the groups
differ from one another. They are also called pairwise comparisons because we compare
all possible pairs of groups. In our example, that would mean comparing students doing
the film project to those doing the pen pal project, then comparing those doing the film
project to those doing normal coursework, and finally comparing those doing the pen
pal project to those doing normal coursework. Of course, when we have more than three
groups, we get many more possible pairs, and thus will have far more pairwise compar-
isons. No matter how many groups we have, though, the basic process will be the same.

Calculating Tukey’s HSD


We will start by learning one version of the post-hoc test: Tukey’s HSD. Here, HSD
stands for Honestly Significant Differences, but most people just call it the Tukey test. It
is among the simpler post-hoc tests to calculate, so we will use it to illustrate the way the
post-hoc tests operate. Then we’ll discuss some other post-hoc tests that are available and
how they differ from Tukey’s HSD.
Like other group comparison statistics, Tukey’s HSD (and all of the other available
post-hoc tests) are a ratio of between-groups variation to within-groups variation. As we
have discovered so far, the difference for each test is just how they define and quantify
those terms. In the case of Tukey’s HSD, the numerator, which is between-groups vari-
ation, is simply the mean difference between the groups being compared. The denom-
inator is a standard error term, which is an estimate of within-groups variation. So the
Tukey’s HSD formula is:

X1 − X2
HSD =
sm
The standard error term, sm, is calculated as:
MSW
sm =
N
128 • BETWEEN-SUBJECTS DESIGNS

In this formula, N is the number of people per group. Because of that, the formula works
as written only if there are the same number of people in each group. In the event the
groups are unbalanced (that is, they do not each have an equal number of participants),
we replace N with an estimate calculated as:
k
N′ =
1
∑ 
N
In this formula, k is the number of groups, and N is the number of people in each group.
In other words, we would divide 1 by the number of people in each group (one at a time),
and sum the result from each group, which becomes the denominator of the equation.
If that feels a little confusing, no need to worry. It is probably sufficient to know that,
when there are unbalanced group sizes, we make an adjustment to the standard error
calculation to account for that. In practice, we’ll usually run this analysis with software,
which will do that correction automatically. For our example data, we have five people
per group in each group, so no adjustment will be needed. Let us walk through calculat-
ing the Tukey post-hoc test for our example and determine how our three groups differ.
First, we’ll calculate sm for our example, taking MSW from our source table, and replac-
ing N with the number of people per group (we had 5 in all groups):

MSW .08
sm
= = = = .02 .14
N 5
So, the denominator for all of our Tukey post-hoc tests will be .14. Next, we will calculate
our three comparisons.
First, let’s compare students doing the film project (M = 3.86) to those doing the pen
pal assignment (M = 3.46):
X1 − X2 3.86 − 3.46 .40
HSD = = = = 2.86
sm .14 .14
Next, we will compare those doing the film project (M = 3.86) to those doing normal
coursework (M = 4.56):

X1 − X2 3.86 − 4.56 −.70


HSD = = = = −5.00
sm .14 .14
Finally, we will compare those doing the pen pal assignment (M = 3.46) to those doing
normal coursework (M = 4.56):

X1 − X2 3.46 − 4.56 −1.10


HSD = = = = −7.86
sm .14 .14
We can then use the critical value table to determine the critical value for the HSD statis-
tic. Notice that the HSD critical value table has columns based on the number of means
being compared (here, three), and rows based on the dfW (here, 12). Based on that, we
can see the critical value for HSD given three comparisons and 12 dfW is 3.77. Compar-
ing our calculated values to the critical value, we determine that the first comparison is
not significant (because 2.86 is less than 3.77), but the second is (because 5.00 [ignoring
the sign for this comparison] is more than 3.77), and so is the third (because 7.86 [again
ignoring the sign for this comparison] is more than 3.77).
COMPARING MORE THAN TWO SAMPLE MEANS • 129

Based on that, we can conclude that there is a significant difference in racial bias between
those doing the film project and the normal coursework, there is a significant difference in
racial bias between those doing the pen pal assignment and normal coursework, but there
is no significant difference in racial bias between those doing the film project and the pen
pal assignment. We can take the final step in this analysis by examining the means to see
that those doing normal coursework (M = 4.56) had significantly higher racial bias than
those doing the pen pal assignment (M = 3.46). Similarly, those doing normal coursework
(M = 4.56) had significantly higher racial bias scores than those doing the film project
(M = 3.86). We know the difference is significant based on the Tukey test result, and we
know that their bias scores are higher by examining the group means. Finally, there was
no significant difference between those doing the pen pal assignment and the film project.
The Tukey HSD post-hoc is only one of the available post-hoc tests. There are many
more available to us. All of them operate on the same basic logic and mathematics as
HSD, but estimate error somewhat differently, so will yield somewhat different test
results. We should also point out that to only report p values.

Comparison of available Post-hoc tests


There are many other post-hoc tests available, but here we’ll focus on three of them and
how they differ from the Tukey HSD test. We will discuss the Scheffe, Bonferroni, and
LSD (Least Significant Difference) post-hoc tests. They differ in how conservative they
are about error. Each test offers a different level of protection against Type I error. The
trade-off will be that more protection against Type I error means less power and some-
what higher p values. These tests compare as illustrated below:

Test Name Type I Error Power


Protection

LSD Lowest Highest (produces the lowest p values)


Tukey HSD Low High (produces low p values)
Bonferroni Moderate Moderate (produces moderate p values)
Scheffe High Lower (produces higher p values)

In practical terms, the differences in p values between these tests in most applied research
will be very small, perhaps .020 or less. Of course, with small differences between groups,
that can be the difference between rejecting the null and failing to reject the null (between
determining a difference is significant or not significant). In the table below are the p-val-
ues for our three comparisons for each of the three tests. This illustrates nicely how they
line up in terms of power and error:

Test Film vs. Pen Pal Film vs. Normal Pen Pal vs. Normal

LSD p = .043 p = .002 p < .001


Tukey p = .100 p = .005 p < .001
Bonferroni p = .129 p = .006 p < .001
Scheffe p = .118 p = .007 p < .001
130 • BETWEEN-SUBJECTS DESIGNS

In this case, our selection of post-hoc test can potentially change our interpretation of
the results. For example, if we would have used LSD post-hoc tests, we would conclude
there is a significant difference in bias among students doing the film project vs. those
doing the pen pal assignment. All of the other tests would lead us to conclude there was
no significant difference in those two groups. It is also worth noting that, generally, as the
sample sizes increase, the difference between the test results will become smaller.
The general rule will be to prefer more conservative tests when possible. That is espe-
cially true for research in an established area or confirmatory research. However, when
our sample sizes are smaller, the area of research is more novel, or our work is more
exploratory in nature, we might prefer to use a less conservative test. There are probably
few situations that justify the use of the LSD post-hoc, as it is quite liberal and provides
minimal Type I error protection, for example. But we could select among other tests
based on the research question and other factors. As a final note on selecting the appro-
priate post-hoc test, it is never acceptable to “try out” various post-hoc tests in the same
sample to see which one gives the desired result. We should select the post-hoc test in
advance and apply it to our data whether it gives us the answer we want or not.
There are many more post-hoc tests available than these, too. In jamovi, we will find
several options. The four we outlined here are seen most commonly in educational and
behavioral research, though, and are good general purpose post-hoc tests that will be
appropriate for the vast majority of situations.

Making sense of a pattern of results on the post-hoc tests


Interpreting the individual post-hoc results is relatively simple. If a pairwise comparison
is significant, we can simply examine the group means to determine which group scored
higher. However, we want to take a step beyond that kind of pair-by-pair interpretation
to look for a pattern of results. In our example, we might interpret the pattern as being
that those receiving either racial bias intervention (pen pal assignment or film project)
had lower racial bias scores than those who received normal coursework (i.e., no inter-
vention). Further, we could say that the difference was similar regardless of which racial
bias intervention students received.
What we are attempting to do in this process is to take the pairwise differences and
make sense of them as a pattern. We know that racial bias was lower in the pen pal
assignment group than the normal coursework group, and that racial bias was lower in
the film project group compared to the normal coursework group. Taken together, it is
fair to say that racial bias was lower in those getting either intervention than it was in
those getting no intervention at all.

A priori comparisons
Of course, we might have expected that would be the case. In all likelihood, the reason
we were testing the interventions was because we figured they would reduce racial bias as
compared with normal coursework. If so, we had a theoretical model in mind before we ran
our analysis. However, post-hoc tests do not directly evaluate theoretical models—they are
more like casting a wide net and seeing what comes up. That strategy might be appropriate
when we do not have a theoretical model going in. But when we have a theory beforehand,
we can instead specify a priori comparisons, otherwise known as planned contrasts. As we
COMPARING MORE THAN TWO SAMPLE MEANS • 131

introduce this concept, we wish to note that a priori comparisons are tricky to do in jamovi
and easier in some other software packages. Still, we will introduce this type of comparison
conceptually and provide information on how to specify and calculate planned contrasts,
while noting that only certain sets of contrasts are possible in jamovi.

Introducing planned contrasts


Planned contrasts or a priori comparisons allow us to specify how we think the groups
will differ beforehand. For example, we might specify beforehand that we think scores
will be different in the normal coursework group as compared to the two intervention
groups. We could specify a second planned contrast to determine whether there was
any difference between the two types of intervention. Because this is done via statistical
analysis, we will have to quantify our planned differences.

Setting coefficients for orthogonal contrasts


The way we quantify our planned contrasts is by setting coefficients for them. We will
assign negative values to one side of the comparison, and positive values to the other
side. The two sides should equal zero, because our null hypothesis is a zero difference. For
example, I might specify a coefficient of 1.0 for the film group, 1.0 for the pen pal group,
and −2.0 for the normal coursework group. Doing so sets up a contrast to test if there is a
difference in normal coursework as compared to the combination of the two intervention
types. I could specify a second set of contrasts to compare the two kinds of interventions,
where I might give a coefficient of 1.0 to the film group, and −1.0 to the pen pal group.
One additional consideration in creating the coefficients is that they need to be
orthogonal. That means that we want sets of coefficients that are not mathematically
related. One way to check this quickly and relatively easily is to multiply them across the
set of coefficients. The products should come to zero. This concept might make more
sense in an example, so taking our planned contrasts for our racial bias example, we
could put them in a table as:
Group Contrast 1 Contrast 2 Contrast 1 × Contrast 2

Film 1.0 1.0 1.0


Pen Pal 1.0 −1.0 −1.0
Normal Classroom −2.0 0.0 0.0
Sum 0.0 0.0 0.0

For individual contrasts to work, as discussed earlier, the coefficients should sum to zero.
But for the coefficients to be orthogonal, we want the product of the coefficients to sum
to zero as well. This can take a little careful planning, but the set of contrasts here are
fairly common for three groups when one group is a control condition (a group that gets
no intervention).
Finally, how many contrasts should we specify? The answer is k − 1. For the contrasts
to be orthogonal (which we need them to be), we need to specify one fewer contrast than
we have number of groups. So, for our case, where there are three groups, we should
specify two contrasts, as we’ve done in the above example.
132 • BETWEEN-SUBJECTS DESIGNS

Calculating planned contrasts in the ANOVA model


To understand the calculation of these planned contrasts, it will be helpful to understand
their null and alternative hypothesis. For our first planned contrast as specified above,
the hypotheses would be:
( )
H 0 : (1)( M Film ) + (1) M Pen Pal = ( 2 )( M Normal Coursework )

H1 : (1)( M Film ) + (1) ( M Pen Pal ) ≠ (2 )( M Normal Coursework )


And for our second contrast would be:
(
H 0 : (1)( M Film ) = (1) M Pen Pal )
H1 : (1)( M Film ) ≠ (1) ( M Pen Pal )
In calculating these comparisons, we will use an ANOVA source table, where the between
variation is broken down into variation attributable to the two contrasts. That is why it is
so important the contrasts be orthogonal, because otherwise the between variation will
not be completely partitioned into the two contrasts. So, our new source table looks like:

Source SS df MS F

Contrast 1 nψ 2 dfC1 = 1 SSC1 MSC1


SSC1 = MSC1 =
∑ c2 dfW MSW
Contrast 2 nψ 2 dfC2 = 1 SSC 2 MSC 2
SSC2 = MSC 2 =
∑ c2 dfW MSW
dfW = n − k SSW
SSW = ( X − X )
2
Within MSW =
dfW
Total
( ) dfT = n − 1
2
SST = X − X

While we have reproduced the formulas for the within and total lines of the source table,
we already calculated them for the omnibus test. Those terms do not change. All that is
happening in the a priori comparisons is we’re breaking down the between variation into
variance attributable to our planned contrasts.
The only new bit of calculation here is in the Sum of Squares column for our contrasts.
For each contrast, we will calculate the SS term using the formula, which has some unfa-
miliar elements in it. But it is a relatively simple calculation. The ψ term in the numer-
ator is calculated by multiplying all group means by their coefficients and adding them
together. For our contrasts:

ψ C1 = (1) 3.86 + (1) 3.46 + ( −2 ) 4.56 = 3.86 + 3.46 − 9.12 = −1.80


COMPARING MORE THAN TWO SAMPLE MEANS • 133

ψ C2 = (1) 3.86 + ( −1) 3.46 + ( 0 ) 4.56 = 3.86 − 3.46 + 0 = 0.40

The other term that is new for us is ∑c2, but it is simply the sum of the squared coeffi-
cients. For our two contrasts:

∑ c 2C1 = (1) + (1) + ( −2 ) = 1 + 1 + 4 = 6


2 2 2

∑ c 2C 2 = (1) + ( −1) + ( 0 ) = 1 + 1 + 0 = 2
2 2 2

So, then, using the full formula, we can determine the SS for our two contrasts:

SSC1 = =
(
nψ 2 5 −1.80
2

=
)
5 ( 3.24 ) 16.20
= = 2.70
2
∑c 6 6 6

SSC2 = =
(
nψ 2 5 .40
2

=
)
5 (.16 ) .80
= = .40
2
∑c 2 2 2
Finally, we can complete our source table and calculate F statistics for the two contrasts:

Source SS df MS F

Contrast 1 SSC1 = 2.70 dfC1 = 1 SSC1 2.70 MSC1 2.70


=
MSC1 = = 2.70 = = 33.75
dfW 1 MSW .08
Contrast 2 SSC2 = .40 dfC2 = 1 SSC 2 .40 MSC 2 .40
=
MSC2 = = .40 = = 5.00
dfW 1 MSW .08
Within SSW = 0.95 dfW = 12 MSW = 0.08
Total SST = 4.04 dfT = 14

At 1 numerator df (the df for the contrast) and 12 denominator df, the critical value is
4.75. In both cases, our calculated value exceeds the critical value, so we can conclude
that p < .05, we can reject the null hypothesis, and conclude that there is a significant
difference based on these two contrasts.

Interpreting planned contrast results


We interpret the results of this follow-up analysis based on our planned contrast coeffi-
cients. In Contrast 1 above, we specified a difference between the film group and pen pal
group as compared to the normal coursework group. We found that such a difference
exists and is statistically significant. We can take one step further, though. Notice that the
coefficients for the pen pal and film groups were positive, and for the normal coursework
group the coefficient was negative. Because ψ (our weighted group mean difference) is
negative, we know that the normal coursework group had the higher score. Of course, we
could also examine the means of the groups and come to the same conclusion. Similarly,
in Contrast 2, we had the film group with a positive coefficient and the pen pal group with
134 • BETWEEN-SUBJECTS DESIGNS

a negative coefficient. In that comparison, ψ was positive, meaning the pen pal group had
higher scores. Again, though, the simpler path for interpretation will be to simply look at
the group means for significant comparisons, and interpret the pattern directly from those.

COMPUTING THE ONE-WAY ANOVA IN JAMOVI


Now that we have explored the mechanics and use of the ANOVA, we will follow our
example through in jamovi as well. First, we’ll show how to calculate the test with post-
hoc comparisons. Then we will learn how to do the same test with a priori comparisons.
Finally, we will write up the results of our example scenario.

Computing the one-way ANOVA with post-hoc tests in jamovi


We will begin by creating a new data file for our example data. In jamovi, we will simply
open a new window. Alternatively, you can click the symbol in the upper left corner, then
“New” to create a blank data file. Remember that the initial view has data tab and analy-
sis tab. We’ll start in the data tab and set up our variables. We’ll need two in this case: one
for the racial bias test score (we will call RacialBias) and one for the group membership
(we will call Group). We can also add labels to those variables to make them easier to
remember later on.
COMPARING MORE THAN TWO SAMPLE MEANS • 135

After typing in our data, we can also assign group labels in the Setup window for the
variable Group by adding a group label for each of the three groups.

Our data file now shows all of the scores with group labels.

Next, to run the ANOVA, we will go to the Analysis tab at the top of the screen, then
select ANOVA, and then on the sub-menu, click ANOVA. Note that there is also a spe-
cialized menu for the one-way ANOVA, which would work for this case. However, we’ll
demonstrate using the ANOVA sub-menu because it is a bit more versatile and has some
options that are useful.
136 • BETWEEN-SUBJECTS DESIGNS

In this case, racial bias scores are the dependent variable, so we will click on that variable,
and then click the arrow to move it to the Dependent Variable spot. Group is the inde-
pendent variable, which jamovi labels the Fixed Factor. So, we will click on Group, then
click the arrow to move it to Fixed Factors (plural because later designs will allow more
than one independent variable). In the same setup window, under Effect Sizes, we can
check the box for ω2 to produce omega squared.

Next, under Assumption Checks, we can check the box for Equality of Variances to pro-
duce Levene’s test. Then, under Post-Hoc Tests, we select Group on the left column, and
COMPARING MORE THAN TWO SAMPLE MEANS • 137

move it using the arrow button to the right column (which sets it as a variable to com-
pare using a post-hoc test). We can then choose the error correction, with the options
being None (LSD comparison), Tukey, Scheffe, Bonferroni, and one we haven’t discussed
called Holm. Earlier in the chapter, we provided a comparison of the most popular post-
hoc tests to help with deciding which test to use. For this example, we will use the more
conservative Scheffe test. So, we’ll check the Scheffe box and uncheck the Tukey box.

On the right side of the screen, the output has updated in real time as we choose our
options and settings. The first piece of output is the ANOVA summary table.

Just below that is the Levene’s test for homogeneity of variance.


138 • BETWEEN-SUBJECTS DESIGNS

On Levene’s test, we see that F2, 12 = .150, p = .862. Because p > .05, we fail to reject
the null hypothesis on Levene’s test. Recall that the null hypothesis for Levene’s test
is that the variances are equal (or homogeneous), so this means that the data met the
assumption. In other words, we have met the assumption of homogeneity of variance.
Because of that, we’re good to use the standard ANOVA F ratio, and will not need any
correction.
Notice that this summary table includes all of the information (SS, df, MS, and F) that
we calculated by hand, plus the probability (labelled “p” in the output). Because jamovi
provides the exact probability (here, p < .001), we do not need to use the critical value.
Instead, if p < .05, we reject the null hypothesis and conclude there is a significant differ-
ence. Notice, too, that jamovi labels the “within” or “error” term as “residuals,” and it does
not supply the “total” terms. However, we could easily calculate the total sum of squares
and degrees of freedom by adding together the between and within terms (here labelled
Group and Residuals). Looking at the ANOVA summary table, we see that F2, 12 = 19.872,
p < .001, so there was a significant difference between groups. The output also has omega
squared, from which we can determine that about 72% of the variance in racial bias was
explained by which project group students were assigned to (ω2 = .716). Because the
ANOVA is an omnibus test, we then need to evaluate how the three groups differed, in
this case using a post-hoc test, which show up next in the output.

This table shows all possible comparisons, and gives the mean difference, SE (standard
error of the difference), df (degrees of freedom), a t statistic, and p (the probability,
which is followed by a suffix for which correction was used, so in our case, it reads
pscheffe). For each comparison, the two groups are significantly different if p < .05. While
the table provides a t statistic with degrees of freedom, it is not uncommon to see
researchers report only the probability values for this statistic. In fact, in other software
COMPARING MORE THAN TWO SAMPLE MEANS • 139

packages (such as the commonly used SPSS package), no test statistic is even provided
in the output (Strunk & Mwavita, 2020). For practical purposes, we’ll interpret these
comparisons based on the probabilities, interpreting those with p < .05 as significant
differences. So, for our purposes, here there is no significant difference when compar-
ing those doing the film project to those doing the pen pal assignment (p = .118), a
significant difference when comparing those doing the film project with those doing
normal coursework (p = .007) and a significant difference between those doing the
pen pal project and those doing normal coursework (p < .001). Knowing that there is
a difference when comparing normal coursework to either the film project or the pen
pal assignment, we can examine the means to determine what that pattern of difference
is. As we discovered above, the pattern is that those doing normal coursework have
higher bias scores than those doing either of the assignments designed to reduce bias
(film or pen pal).
We likely also want to produce descriptive statistics by group (we demonstrated pro-
ducing descriptive statistics overall, including normality tests, in previous chapters). To
do so, we will select the Analyses tab, then Exploration, and then Descriptives. We’ll
select RacialBias and click the arrow to move it to the Variables box. Then we will select
Group, and click the arrow to move it to the Split by box. That will produce output split
by group, so we’ll get descriptive statistics for each of the three groups. Under Statistics,
we will uncheck most of the boxes, while checking the boxes for “N” (which gives the
number of people per group), Mean, and Std. deviation. We could also check any other
boxes for statistics that are relevant.

These descriptive statistics will be helpful in interpreting the pattern of differences. For
example, here we previously found that there was a significant difference between the
Normal Coursework group and both the Film Project and Pen Pal Assignment group.
Here we see that the Normal Coursework group had a higher mean (M = 4.560, SD =
.241) than either the Film Project group (M = 3.860, SD = .305) or the Pen Pal Assignment
140 • BETWEEN-SUBJECTS DESIGNS

group (M = 4.560, SD = .241). We know that difference is statistically significant from the
post-hoc tests, so can support an inference that the pen pal assignment and film project
were associated with significantly lower racial bias scores than normal coursework.

Computing the one-way ANOVA with a priori comparisons in jamovi


If, instead of post-hoc tests, we had decided to use a priori comparisons, the entire pro-
cess of producing the test would be the same, except that instead of clicking Post Hoc
in the main ANOVA menu, we’d click Contrasts. One of jamovi’s limitations (we use the
term limitation lightly here as the software is free to use) is that it does not allow custom
contrasts. To do fully custom contrasts, we’d need to use another program like SPSS
(which we cover in our other textbook; Strunk & Mwavita, 2020) or R. However, jamovi
does have a set of pre-determined contrasts:

• Deviation: Compares each group to the grand mean, omitting the first group. In
our example, it will produce a comparison of Pen Pal versus the grand mean, and
Normal Coursework versus the grand mean.
• Simple: Compares each group to the first group. In our example, it will compare
Pen Pal versus Film, and Normal Coursework versus Film.
• Difference: Compares each group with the mean of previous groups. In this case,
it will compare Pen Pal versus Film, and then will compare Normal Coursework
versus the average of Pen Pal and Film.
• Helmert: Compares each group with the average of subsequent groups. In this case,
it will compare Film versus the average of Pen Pal and Normal Coursework, and
will then compare Pen Pal and Normal Coursework.
• Repeated: Compares each group to the subsequent groups. In this case, it will com-
pare Film versus Pen Pal, and then Pen Pal versus Normal Coursework.

Unfortunately, none of these options is particularly adept at producing the kinds of


contrasts we described earlier in the chapter, though ordering the groups in particular
ways may make those comparisons possible. Because of that, we here simply note how
jamovi handles the various possible planned contrasts and suggest that jamovi users will
typically find the post-hoc tests most useful. However, in designs that clearly call for an
a priori hypothesis about the differences between groups, it may be appropriate to use
another software package to produce the comparisons, or to calculate them by hand.

WRITING UP THE RESULTS


Finally, we need to write up our ANOVA results. As we did with prior tests, we will first
provide a general outline for the results write-up, and then walk through that process
with our example. The general form for an ANOVA results section will be:

1. What test did we use, and why?


2. If there are issues with the assumptions, report them and any appropriate
corrections.
3. What was the result of that test?
COMPARING MORE THAN TWO SAMPLE MEANS • 141

4. If the test was significant, what is the effect size? (If the test was not significant,
simply report effect size in #3.)
5. If the test was significant, report your follow-up analysis (post-hoc or a priori).
6. What is the pattern of group differences?
7. What is your interpretation of that pattern?

Compared with our suggestions for the independent samples t-test, we are suggesting
a slightly longer results section for the one-way ANOVA. That’s because the one-way
ANOVA has more information we can glean and involves the added layer of follow-up
analysis. The write-up will be slightly different depending on whether we use a post-hoc
test or a priori comparisons. Because jamovi is limited in conducting a priori compari-
sons, we will demonstrate only the post-hoc test writing process:

Writing the one-way ANOVA with post-hoc tests


For our example, we will walk through these pieces:
1. What test did we use, and why?
We used a one-way ANOVA to determine if students’ racial bias levels would differ
based on completing normal coursework, a film project, or a pen pal project.
2. If there are issues with the assumptions, report them and any appropriate
corrections.
In our case, we meet the statistical assumptions. As described in prior chapters,
we can evaluate the dependent variable for normality, and find that it is normally
distributed (skew = .143, SE = .580; kurtosis = −.963, SE = 1.121) because both skew
and kurtosis are less than twice their standard errors. Based on Levene’s test, we also
conclude that the group variances are homogeneous (F2, 12 = .150, p = .862). As we
discussed above, there are some concerns about the design assumptions. However,
typically we’ll focus more on design issues in the discussion section (specifically in
limitations) and only report on statistical assumptions in the results section.

Results

We used a one-way ANOVA to determine if students’ racial bias levels


would differ based on completing normal coursework, a film project, or
a pen pal project. There was a significant difference in racial bias scores
based on which assignment students completed (F2, 12
= 19.872,
p < .001). About 72% of the variance in racial bias scores was explained

(Continued)
142 • BETWEEN-SUBJECTS DESIGNS

by which assignment students completed (ω2 = .716). We used Scheffe


post-hoc tests to determine how the groups differed. There was a signifi-
cant difference between those doing the film project and those doing
normal coursework (p = .007) and between those doing the pen pal
assignment and those doing normal coursework (p < .001). However,
there was no significant difference between those doing the film project
and those doing the pen pal assignment (p = .118). Those doing normal
coursework had higher racial bias scores as compared to those doing
either the film project or the pen pal assignment, although racial bias
scores were similar among those doing the film project and pen pal
assignment. See Table 8.1 for descriptive statistics by group. Among
this sample of students, both of the interventions aimed at reducing
racial bias were associated with lower racial bias scores as compared
with normal coursework, but there was no meaningful difference
between the two interventions.

Table 8.1
Descriptive Statistics by Group

Group N M SD SE

Film Project 5 3.860 .288 .129


Pen Pal Assignment 5 3.460 .305 .136
Normal Coursework 5 4.560 .241 .108
Total 15 3.960 .537 .139
COMPARING MORE THAN TWO SAMPLE MEANS • 143

3. What was the result of that test?


There was a significant difference in racial bias scores based on which assignment
students completed (F2, 12 = 19.872, p < .001).
4. If the test was significant, what is the effect size? (If the test was not significant,
simply report effect size in #2.)
About 72% of the variance in racial bias scores was explained by which assignment
students completed (ω2 = .716).
5. If the test was significant, report your follow-up analysis (post-hoc or a priori).
We used Scheffe post-hoc tests to determine how the groups differed. There was a
significant difference between those doing the film project and those doing normal
coursework (p = .007) and between those doing the pen pal assignment and those
doing normal coursework (p < .001). However, there was no significant difference be-
tween those doing the film project and those doing the pen pal assignment (p = .118).
6. What is the pattern of group differences?
Those doing normal coursework had higher racial bias scores as compared to those
doing either the film project or the pen pal assignment, although racial bias scores
were similar among those doing the film project and pen pal assignment.
7. What is your interpretation of that pattern?
Among this sample of students, both of the interventions aimed at reducing racial
bias were associated with lower racial bias scores as compared with normal course-
work, but there was no meaningful difference between the two interventions.
Finally, we can put this together into a short results section:
Finally, we can add the table of descriptive statistics. We could also opt to include
means and standard deviations in the text rather than make a table, as shown above.
9
One-way ANOVA case studies

Case study 1: first-generation students’ academic success 145


Research questions 146
Hypotheses 146
Variables being measured 146
Conducting the analysis 147
Write-up 148
Case study 2: income and high-stakes testing 150
Research questions 150
Hypotheses 150
Variables being measured 151
Conducting the analysis 151
Write-up 152
Notes 154

In the previous chapter, we explored the one-way ANOVA using a made-up example
and some fabricated data. In this chapter, we will present several examples of published
research that used the one-way ANOVA. For each sample, we encourage you to:

1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the t-test.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects’ data but have
been simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.

CASE STUDY 1: FIRST-GENERATION STUDENTS’ ACADEMIC SUCCESS


Kim, A. S., Choi, S., & Park, S. (2018). Heterogeneity in first-generation college students
influencing academic success and adjustment to higher education. Social Science Journal.
Advance online publication. https://doi.org/10.1016/j.soscij.2018.12.002.

145
146 • BETWEEN-SUBJECTS DESIGNS

In this paper, the authors investigated how several variables might differ among three
groups of first-generation college students: (1) first-generation college students with an
older sibling who attended college (FGCS-OS); (2) continuing-generation college stu-
dents (that is, first-generation students with at least one parent who completed some col-
lege but not a degree; CSGS); and (3) first-generation college students who are the first
in their family to attend college at all (F-FGCS). The researchers tested five outcomes, of
which we will highlight two: peer support and institutional support.
Of special note in this case: The authors used five one-way ANOVAs to test difference
in five dependent variables. Doing so, as we’ve highlighted in prior chapters, will inflate
the Type I error rate. At a minimum, the use of multiple univariate tests like the ANOVA
should result in reducing the critical probability value. We normally make that adjust-
ment using the Bonferroni inequality, which involves dividing the critical probability
value by the number of tests. In this case, we would divide .05 (critical probability value,
or alpha) by the number of tests, five, resulting in an adjusted alpha of .01. However, we
also wish to stress that when a design calls for multiple univariate tests, like the ANOVA,
researchers should consider whether a multivariate analysis might be more appropriate.

Research questions
Because we are focusing on only two of the ANOVAs these authors used, we will focus
on these two research questions:

1. Was there a mean difference in peer support among the three groups of first-gen-
eration college students (FGCS-OS, CSGS, and F-FGCS)?
2. Was there a mean difference in institutional support among the three groups of
first-generation college students (FGCS-OS, CSGS, and F-FGCS)?

Hypotheses
The authors hypothesized the following related to peer support:

H0: There was no statistically significant difference among FGCS-OS, CSGS, and


F-FGCS in peer support. (MFGCS-OS = MCSGS = MF-FGCS)
H1: There was a statistically significant difference among FGCS-OS, CSGS, and
F-FGCS in peer support. (MFGCS-OS ≠ MCSGS ≠ MF-FGCS)

The authors hypothesized the following related to institutional support:

H0: There was no statistically significant difference among FGCS-OS, CSGS, and


F-FGCS in institutional support. (MFGCS-OS = MCSGS = MF-FGCS)
H1: There was a statistically significant difference among FGCS-OS, CSGS, and
F-FGCS in institutional support. (MFGCS-OS ≠ MCSGS ≠ MF-FGCS)

Variables being measured


Both dependent variables were measured using Likert-type scales. For peer support,
there were five items, and for institutional support, there were ten items. In both cases,
ONE-WAY ANOVA CASE STUDIES • 147

the item scores were summated to create the scale score. For peer support, the authors
reported coefficient alpha reliabilities of .85 for peer support and .89 for institutional
support, both of which are in the acceptable range. The authors also provided evidence of
content validity in describing how the items were developed and their content assessed
by experts. The independent variable was measured based on questions about who in the
participants’ families had attended college.

Conducting the analysis


1. What test did they use, and why?
The authors used two one-way ANOVAs to determine whether peer support and
institutional support would significantly differ among three groups of first-gener-
ation college students: (1) first-generation college students with an older sibling
who attended college (FGCS-OS); (2) continuing-generation college students (that
is, first-generation students with at least one parent who completed some college
but not a degree; CSGS); and (3) first-generation college students who are the first
in their family to attend college at all (F-FGCS). Because we used two ANOVAs,
we set alpha at .025 using the Bonferroni inequality to control for familywise error.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The independent variables for both tests were summated Likert-type data.
Those data are normally treated as interval-level data.
b. Normality of the dependent variable
The authors in this manuscript did not report information on the normality
of the dependent variable. In the simulated dataset in the online resources, the
data will be perfectly normal. However, in most cases, published papers will
not report information on normality if this assumption was met. It is more
typical to see a discuss of normality only when the data were not normally dis-
tributed. So, it is probably safe to infer from the fact the authors did not report
normality that the data were normally distributed. To test for normality, we
would use skewness and kurtosis statistics as compared to their standard error,
as described in Chapter 3.
c. Observations are independent
The authors did not note any concerns with independence. They surveyed only
one student per family, so there is no concern for family nesting effects. There
might be potential nesting factors in things like residence halls, academic
majors, or social organizations, but those are not accounted for in this design.
d. Random sampling and assignment
The sample is not random. All participants were from a single regional uni-
versity in the midwestern United States. The authors do not specify, but this
is likely a convenience sample. Participants were also not randomly assigned
to groups—first-generation college status is an intact grouping factor. That
is, researchers did not assign students to first-generation status, so it will be
difficult to attribute any differences between groups as being caused by the
first-generation groups.
e. Homogeneity of variance
Using Levene’s test, we can determine that this assumption was met for peer sup-
port (F2, 354 = .424, p = .655) and for institutional support (F2, 354 = .260, p = .771).
148 • BETWEEN-SUBJECTS DESIGNS

3. What was the result of that test?


There was a significant difference between the three groups of students in peer
support (F2, 354 = 6.406, p = .002), but no significant difference in institutional sup-
port (F2, 354 = 2.747, p = .065).1
4. What was the effect size, and how is it interpreted?
For peer support:

SSB   df B  MSw  500.122   2  39.036  500.122  78.072


2   
SST  MSw 14318.693  39.036 14357.792
422.050
  .029
14357.792

For institutional support:


SSB   df B  MSw  604.356   2 109.990  604.356  219.980
2   
SST  MSw 39540.821  109.990 39650.811
384.376
  .010
39650.811
About 3% of the variance in peer support was explained by the groups of first-
generation students. Because there was no significant difference in institutional
support, we would not interpret that effect size estimate, and instead would
report it alongside F and p, but it would indicate about 1% of the variance in
institutional support was explained.2
5. What is the appropriate follow-up test, if any?
To determine how peer support varied among the three groups of first-generation
students, we used Scheffe post-hoc tests.
6. What is the pattern of group differences?
CSGS scored significantly higher than F-FGCS students (p = .002). However,
there was no significant difference between CGCS and FGCS-OS (p = .650) or
between FGCS-OS and F-FGCS (p = .144).

Write-up

Results

We used two one-way ANOVAs to determine whether peer support and


institutional support would significantly differ among three groups of
first-generation college students: (1) first-generation college students
with an older sibling who attended college (FGCS-OS); (2) continuing-
generation college students (that is, first-generation students with at least

(Continued)
ONE-WAY ANOVA CASE STUDIES • 149

one parent who completed some college but not a degree; CSGS); and (3)
first-generation college students who are the first in their family to attend
college at all (F-FGCS). Because we used two ANOVAs, we set alpha at
.025 using the Bonferroni inequality to control for familywise error.
There was a significant difference between the three groups of students in
peer support (F2, 354 = 6.406, p = .002), but no significant difference in
institutional support (F2, 354 = 2.747, p = .065, ω2 = .010). About 3% of the
variance in peer support was explained by the groups of first-generation
students. To determine how peer support varied among the three groups
of first-generation students, we used Scheffe post-hoc tests. CSGS scored
significantly higher than F-FGCS students (p = .002). However, there
was no significant difference between CGCS and FGCS-OS (p = .650) or
between FGCS-OS and F-FGCS (p = .144). See Table 9.1 for descriptive
statistics by group. Among the present sample, CSGS students’ scores
significantly higher in peer support than F-FGCS students, but otherwise
there were no differences in peer or institutional support.

Table 9.1
Descriptive Statistics by Group

Peer Support Institutional Support

Group N M SD M SD

CGCS 218 25.990 6.090 51.600 10.390


FGCS-OS 61 25.150 6.490 51.800 11.250
F-FGCS 78 23.040 6.490 48.500 10.140
Total 357 25.202 6.342 50.957 10.539
150 • BETWEEN-SUBJECTS DESIGNS

In APA style, tables go after the reference page, with each table starting on a new page.
For this example, we might make a table such as:

CASE STUDY 2: INCOME AND HIGH-STAKES TESTING


Fischer, C., Fishman, B., Levy, A., Dede, C. Lawrenze, F., Jia, Y., Kook, K., &
McCoy, A. (2016). When do students in low-SES schools perform better-than-ex-
pected on high-stakes tests? Analyzing school, teacher, teaching, and profes-
sional development. Urban Education. Advance online publication. https://doi.
org/10.1177/0042085916668953.
In this article, as in the prior case study in this chapter, the ANOVA was part of a
larger study, but we focus on attention on the part of the study that used a one-way
ANOVA. The authors wanted to understand how students in low-income schools per-
form on high-stakes standardized tests. They explored how school characteristics, teach-
ing and teacher variables, and professional development were associated with students’
performance on advanced placement (AP) test in biology and chemistry among a sam-
ple of low-income schools. They used one-way ANOVAs to compare some school-level
variables as a part of their larger study. The groups they compared were schools where
student scored lower than expected, as expected, and higher than expected. They used
one-way ANOVAs to determine if there were mean differences in these groups on school
funding and administrative support, among other variables.

Research questions
The authors had two research questions that are reviewed here, although they had mul-
tiple other questions in the study:

1. Were there mean differences in school funding among the three groups of schools?
2. Were there mean differences in administrative support among the three groups of
schools?

Hypotheses
The authors hypothesized the following related to school funding:

H0: There were no significant differences in school funding between the three groups
of schools. (Mlower-than-expected = Mas-expected = Mhigher-than-expected)
H1: There were significant differences in school funding between the three groups of
schools. (Mlower-than-expected ≠ Mas-expected ≠ Mhigher-than-expected)

The authors hypothesized the following related to administrative support:

H0: There were no significant differences in administrative support between the three


groups of schools. (Mlower-than-expected = Mas-expected = Mhigher-than-expected)
H1: There were significant differences in administrative support between the three
groups of schools. (Mlower-than-expected ≠ Mas-expected ≠ Mhigher-than-expected)
ONE-WAY ANOVA CASE STUDIES • 151

Variables being measured


The researchers collected data on the two dependent variables we focus on in this review
in two ways. First, for school funding, they collected data on funding in dollars from
public datasets. They did not evaluate reliability or validity evidence for school fund-
ing because it was collected in actual dollars. Then, for administrative support, they
collected eight indicators of administrative support and created a combined variable
from those indicators. For that scale, they reported a coefficient alpha of .73 to measure
internal consistency, which is in the acceptable range. They also reported the results of
exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) to establish
structural validity. They did not report on other aspects of validity in this article.

Conducting the analysis


1. What test did they use, and why?
The authors used two one-way ANOVAs to determine if there were significant differ-
ences in school funding and administrator support between low-income schools that
performed better than expected, as expected, and lower than expected.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
School funding was measured in real dollars; so was a ratio variable. It has
a potential true, absolute zero, where $0 would indicate an absolute lack of
funding. The administrator support ratings were interval and were based on a
combination of eight indicators.
b. Normality of the dependent variable
In this article, the authors used a graphical analysis to support the claim that
the data were normally distributed, and they noted no deviations from nor-
mality. We could also test using skewness and kurtosis statistics as described in
Chapter 3. As we have noted elsewhere, doing that test in the data in the online
resources will produce a near perfect normal distribution because of how we
have simulated those data.
c. Observations are independent
The authors note no issues with independence. There may be some questions
about nesting within a state or district, but the authors argue these observa-
tions are independent.
d. Random sampling and assignment
The sample was not random but was instead a sample of a large number of
schools. The authors’ sampling strategy was robust, although it was not a ran-
dom sample. Schools were not randomly assigned to groups—instead groups
were determine based on school test scores.
e. Homogeneity of variance
The assumption of homogeneity of variance was met, based on Levene’s test,
for school district funding (F2, 635 = 2.060, p = .128) and for effective adminis-
trative support (F2, 635 = .728, p = .483).
3. What was the result of that test?
There was a significant difference between the three groups of schools in school
district funding (F2, 635 = 4.499, p = .011) and in effective administrator support
(F2, 635 = 3.556, p = .029).
152 • BETWEEN-SUBJECTS DESIGNS

4. What was the effect size, and how is it interpreted?


For institutional funding:

SSB   df B  MSw  49.228   2  5.471


2  
SST  MSw 3523.451  5.471
49.228  10.942 38.286
   .011
3528.922 3528.922

For effective administrative support:

SSB   df B  MSw  68.566   2  9.642 


2  
SST  MSw 6191.249  9.642
68.566  19.284 49.282
   .008
6200.891 6200.891

About 1% of the variance in institutional funding was explained by the school


groups (ω2 = .011), while less than 1% of the variance in effective administrator
support was explained by the school groups (ω2 = .008).
5. What is the appropriate follow-up test, if any?
We used Scheffe post-hoc tests to determine how institutional funding and effec-
tive administrative support differed between the school groups.
6. What is the pattern of group differences?
For school district funding, there was a significant difference between lower-
than-expected schools and as-expected (p = .049) and better-than-expected (p =
.045) schools. There was no significant difference in funding between as-expected
and better-than-expected (p = .593) schools. For effective administrative support,
there was a significant difference between lower-than-expected and as-expected
schools (p = .050).3 However, there were no significant differences in effective
administrator support between lower-than-expected versus better-than-expected
(p = .998) schools, nor between as-expected and better-than-expected schools
(p = .262). Schools performing lower-than-expected had lower funding than as-
expected or better-than-expected schools, and had lower effective administrative
support than those performing as-expected.

Write-up

Results

We used two one-way ANOVAs to determine if there were significant dif-


ferences in school funding and administrator support between low-income

(Continued)
ONE-WAY ANOVA CASE STUDIES • 153

schools that performed better than expected, as expected, and lower than
expected. There was a significant difference between the three groups of
schools in school district funding (F2, 635 = 4.499, p = .011) and in effective
administrator support (F2, 635 = 3.556, p = .029). About 1% of the variance
in institutional funding was explained by the school groups (ω2 = .011),
while less than 1% of the variance in effective administrator support was
explained by the school groups (ω2 = .008). We used Scheffe post-hoc tests
to determine how institutional funding and effective administrative support
differed between the school groups. For school district funding, there was a
significant difference between lower-than-expected schools and as-expected
(p = .049) and better-than-expected (p = .045) schools. There was no sig-
nificant difference in funding between as-expected and better-than-expected
(p = .593) schools. For effective administrative support, there was a signifi-
cant difference between lower-than-expected and as-expected schools
(p  =  .050). However, there were no significant differences in effective
administrator support between lower-than-expected versus better-than-
expected (p  = .998) schools, nor between as-expected and better-than-
expected schools (p = .262). See Table 9.2 for descriptive statistics by
group. Schools performing lower-than-expected had lower funding than
as-expected or better-than-expected schools and had lower effective admin-
istrative support than those performing as-expected. These differences were
small, though statistically significant.

(Continued)
154 • BETWEEN-SUBJECTS DESIGNS

Table 9.2
Descriptive Statistics by Group

School Funding Effective


Administrative
Support

Group N M SD M SD

Lower-than-expected 232 8.650 2.400 3.520 2.990


As-expected 339 9.140 2.250 4.1740 3.190
Better-than-expected   67 9.460 2.560 3.490 3.060
Total 638 8.995 2.352 3.862 3.118

In APA style, tables go after the reference page, with each table starting on a new page.
For this example, we might make a table such as Table 9.2.
Again, we encourage you to look up the original studies these cases highlight. Read
those articles and think about how and why what they have written might be different.
Doing so will also help you to see how these analyses get used in published work in the
field of educational research.
For additional case studies, including example data sets, please visit the textbook
website for an eResource package, including specific case studies on race and racism in
education.

Notes
1 In this case, the authors call this difference significant in the published manuscript. They
include a footnote that it is “significant at the p < .1 level.” We would argue against setting the
alpha criterion this high, particularly when there are multiple ANOVAs being used, and thus
in our version of this Results section, we call that difference nonsignificant.
2 In the published version of the manuscript the authors report eta squared (η2) as their effect
size estimate. We will discuss this effect size estimate in later chapters. However, for now it is
sufficient to know that eta squared tends to overestimate effect sizes. That difference is clear
in this example as the published eta squared estimates are larger than our calculated omega
squared estimates. That is a feature of eta squared—it usually overestimates effect sizes.
3 This is a tricky probability to interpret at first glance. Initially, jamovi will say that p = .050,
which if exactly true would mean the difference was not significant (because only when
p < .05 do we reject the null, so at p = .05, we would fail to reject the null). However, if we click
the settings (three vertical dots in the upper right corner) and change to 4 decimal places (4
dp, the ten thousandths place), we will see the exact value is .0496, which is less than .050, so
we reject the null and interpret this difference as statistically significant.
10
Comparing means across two
independent variables

The factorial ANOVA

Introducing the factorial ANOVA 156


Interactions in the factorial ANOVA 157
Research design and the factorial ANOVA 158
Assumptions of the factorial ANOVA 158
Level of measurement for the dependent variable is interval or ratio 158
Normality of the dependent variable 159
Observations are independent 159
Random sampling and assignment 160
Homogeneity of variance 160
Calculating the test statistic F 160
Calculating the factorial ANOVA 160
Partitioning variance 161
Calculating the factorial ANOVA source table 161
Using the F critical value table 166
Interpreting the test statistics 166
Effect size for the factorial ANOVA 167
Calculating omega squared 167
Computing the test in jamovi 168
Computing the factorial ANOVA in jamovi 169
Determining how cells differ from one another and interpreting the
pattern of cell differences 173
Simple effects analysis for significant interactions 173
Interpreting the main effects for nonsignificant interactions 175
Writing up the results 175
Note 178

In the previous chapters, we explored how to compare more than two groups in the
one-way ANOVA. That design required one categorical independent variable with two
or more levels. Usually, as we noted in the previous chapters, the one-way ANOVA is
used only when there are more than two levels on the independent variable (e.g., more

155
156 • BETWEEN-SUBJECTS DESIGNS

than two groups) because if there were only two, the independent samples t-test would
be the simpler alternative. However, we often have more than one independent variable,
and testing them separately does not allow us to get at the possible interactions between
our independent variables. For example, what if a treatment has different effects based
on gender? We often wonder about the interactions among more than one variable, and
have a statistical design available to test for those effects: the factorial ANOVA.

INTRODUCING THE FACTORIAL ANOVA


The factorial ANOVA requires a single dependent variable, measured at the interval or
ratio level, just like the one-way ANOVA and the independent samples t-test. However,
the factorial ANOVA also requires two independent variables, with two or more groups
each. Often, researchers describe the factorial ANOVA based on how many groups there
are on each variable. For example, if a researcher measured two treatment conditions
(experimental and control, for example) across two levels of school (undergraduate
and graduate, for example), they might refer to this as a 2×2 ANOVA or 2×2 factorial
ANOVA. If they had three groups on one independent variable, and four on the second
independent variable, they might call this a 3×4 or 4×3 ANOVA. The numbers simply
describe the number of groups on each of the independent variables.
Because we will have two independent variables, we also have combinations of inde-
pendent variables possible, creating what are called cells. The table below illustrates this
in the cases of a 2×2 factorial ANOVA:
Variable 1, Group 1 Variable 1, Group 2

Variable 2, Group 1 People in group 1 on People in group 1 on


both variables variable 1 and group 2
on variable 2
Variable 2, Group 2 People in group 2 on People in group 2 on
variable 1 and group 2 both variables
on variable 1

These cells will be under analysis in the factorial ANOVA. Each cell has a mean on the
dependent variable, and those means can be compared. However, the factorial ANOVA
will also analyze two other kinds of means, in addition to these cell means. Those are mar-
ginal means and the grand mean. We have encountered the idea of a grand mean before:
it is the mean of all the scores, regardless of group membership. Marginal means are the
means across only one group membership at a time (ignoring the second independent
variable). To illustrate, let’s imagine the above table, but with three people per group:

Variable 1, Group 1 Variable 1, Group 2

Variable 2, Group 1 3, 5, 7; M = 5.000 2, 4, 7; M = 4.333 M = 4.667


Variable 2, Group 2 1, 2, 4; M = 2.333 2, 3, 5; M = 3.333 M = 2.833
M = 3.667 M = 3.833 GM = 3.750
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 157

We get the cell means by taking the mean of the three scores in each cell. Then the
marginal means for Variable 2 (shown in first two rows of the right-hand column)
are the means of the six scores in each group (ignoring Variable 1 groups). Next, the
marginal means for Variable 1 (shown in the second and third columns of the bottom
row) are the means of the six scores in each group on Variable 1 (ignoring Variable 2
groups). Finally, the grand mean (shown in the bottom right cell) is the mean of all
twelve scores.

Interactions in the factorial ANOVA


We mentioned above that the primary focus of a factorial ANOVA design is on the inter-
actions between two independent variables. Imagine the group mean plot below, where
we have tested two kinds of psychotherapy among patients diagnosed with mood dis-
orders and those diagnosed with anxiety disorders, where the means are a treatment
outcome measure.

1
Mood Disorder Anxiety Disorder
Treatment A Treatment B

We see in this graph an interaction between disorder type and treatment. Those diag-
nosed with a mood disorder showed better outcomes with treatment B, but those diag-
nosed with an anxiety disorder showed better outcomes with treatment A. Without
getting into the specifics of the analysis just yet, there is little difference in the means of
the two treatments if we disregard disorder type. However, by looking at disorder type
and treatment together, we see a dramatic pattern reversal. This kind of interaction
is called a disordinal interaction. In a disordinal interaction, if we plot the means (as
we have done above), the lines will cross, indicating a pattern reversal. What we mean
by a pattern reversal is that, for example, those diagnosed with a mood disorder did
better with treatment B and worse with treatment A, but for those diagnosed with an
anxiety disorder, the pattern is reversed: they did better with treatment A and worse
with treatment B.
158 • BETWEEN-SUBJECTS DESIGNS

The graph below illustrates the other sort of interaction we might find.
7

1
Mood Disorder Anxiety Disorder
Treatment A Treatment B

In this example, we see an ordinal interaction. In an ordinal interaction, the lines do not
cross. That means we do not see a pattern reversal. In our example, participants getting
treatment A had worse outcomes for both disorders. However, we see a bigger difference
in outcomes among those with anxiety disorders. This is not a pattern reversal, as in a
disordinal interaction, but there is an interaction between the two independent variables.

RESEARCH DESIGN AND THE FACTORIAL ANOVA


One design issue that we’ve discussed before is that of balanced sample sizes. In previous
chapters, we have discussed the idea that there should be no group that is more than
twice the size of any other group. We still have the issue of balanced sample sizes in the
factorial ANOVA, but now we need to think about balanced cell sizes. In other words,
the total number of cases in any one cell should not be more than twice the number of
cases in any other cell. Similarly, we previously said we need at least 30 per group in prior
analyses. For the factorial ANOVA, we’ll need at least 30 people per cell.

ASSUMPTIONS OF THE FACTORIAL ANOVA


The factorial ANOVA has all of the same assumptions as the one-way ANOVA, but some
apply a bit differently in this design. We will briefly review each of the assumptions, and
point out those that differ in application from the chapter on the one-way ANOVA.

Level of measurement for the dependent variable is interval or ratio


As with the other designs we have learned, the factorial ANOVA requires the dependent
variable to be at the interval or ratio level. In other words, this test requires a continuous
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 159

dependent variable. This assumption does not differ from the one-way ANOVA or the
independent samples t-test.

Normality of the dependent variable


The factorial ANOVA also assumes a normally distributed dependent variable. This
assumption also does not differ from either the one-way ANOVA or the independent
samples t-test. As with those designs, we will evaluate the normality of the dependent
variable using skewness and kurtosis statistics.

Observations are independent


This assumption is similar to the assumption of independence in prior designs but with
an added wrinkle. As with the one-way ANOVA or independent samples t-test, the fac-
torial ANOVA requires that the observations be independent. In other words, we expect
each individual test score or observation or survey to be independent of all others. We
have described this assumption in more detail in the prior chapters. The added wrinkle
with the factorial ANOVA is that the design also requires that group membership on
the two independent variables not be dependent. In other words, the group in which a
participant is placed on the first variable should not be related to the group in which they
are placed on the second variable.
This assumption gets particularly tricky when working with pseudo-independent
variables (variables we are treating as independent, but that are actually comprised of
intact groups). Say, for example, we are testing a 2×2 design where we have students
who are enrolled in an online course versus those enrolled in a face-to-face course as
one independent variable, and whether or not they have worked on a previous research
study as the other independent variable. We want to know if there is an interaction
between those two variables in terms of end-of-course exam scores. In such a scenario,
we are likely to have an issue with independence. It may be that online students are
less likely to have worked on a research study in the past, perhaps because of distance
or the difficulty of joining a research team from off-campus. That would also mean,
then, that face-to-face students might be more likely to have been involved running
a research study by virtue of proximity and access. In that case, group membership
on the two variables (class type and prior work on a research study) are dependent.
There is a relationship between group membership on one variable and on the second
variable.
Thankfully, this is relatively simple to spot in most cases. In the above scenario, we
would see a disproportionately high number of cases in the online, no prior experi-
ence cell, and in the face-to-face, prior experience cell. Those combinations are more
likely to occur, so the number of participants in those cells will be higher. As we noted
above, this design requires balanced cell sizes, so that situation will create an additional
problem with unbalanced cells. As a final note on this issue, if one or both variables are
true independent variables where participants are randomly assigned to groups, this
issue cannot occur. It can only occur when neither independent variable are randomly
assigned.
160 • BETWEEN-SUBJECTS DESIGNS

Random sampling and assignment


As with the independent samples t-test and one-way ANOVA, the factorial ANOVA
assumes that the sample under analysis was randomly sampled from the population,
and that participants were randomly assigned to groups on the independent variables.
We have discussed more completely the implications of this assumption in the prior
chapters, so will not repeat it entirely here. The one note in factorial ANOVA is that
this design assumes random assignment on both independent variables. As with prior
designs, it’s quite common to have one or both variables be pseudo-independent varia-
bles that use intact groups. This is especially common in the factorial design where even
in cases where one independent variable has been randomly assigned, the second one
is often comprised of intact groups (e.g., testing for differences across the interaction of
randomly assigned treatment types and gender).

Homogeneity of variance
As in the one-way ANOVA, the factorial ANOVA assumes homogeneity of variance. In
the case of the factorial ANOVA, the assumption is that the variances are equal across all
cells. The most common reason for a violation of this assumption is unbalanced sample
sizes. As we explored in the prior chapter on the one-way ANOVA, variance is related to
sample size, and all else being equal, smaller samples have larger variance. So, when the
cell sizes become unbalanced, we will typically see the smaller cells have higher variance
and larger cells have lower variance. This means that one way to protect against viola-
tions of this assumption is to strive for balanced cell sizes.

Levene’s test for equality of error variances


As with the one-way ANOVA and the independent samples t-test, the test for homoge-
neity of variance will be Levene’s test. In the factorial ANOVA, Levene’s test will evaluate
the equality of the variances across all cells. The null hypothesis is that the variances are
equal, so when p < .05, the assumption was met. This assumption is basically the same as
it has been in the prior designs.

CALCULATING THE TEST STATISTIC F


As with all ANOVA designs, the test statistic will be F. However, there is a major differ-
ence in the factorial ANOVA versus the one-way ANOVA, which is that the factorial
ANOVA will produce three F statistics. It will produce one F statistic for the first inde-
pendent variable, one for the second independent variable, and one for the interaction.
Although we are primarily interested in the interaction, we will sometimes also examine
the independent variables one at a time. The test for each independent variable by itself
is sometimes referred to as the main effect. So, we have two main effects and one inter-
action to test in the factorial ANOVA.

Calculating the factorial ANOVA


You should recall that in the one-way ANOVA, our source table had three sources:
between, within, and total variation. In the factorial ANOVA, we will still have within
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 161

and total variation, which are conceptually the same as in the one-way ANOVA (though
calculated slightly differently). However, the between-groups variation will be split up, or
partitioned, into three sources: the main effect of the first independent variable, the main
effect of the second independent variable, and the interaction.

Partitioning variance
In order to better understand how variance is partitioned in the factorial ANOVA, we
present below a source table:

Source SS df MS F

Independent SSIV1 dfIV1 MSIV1 FIV1


Variable 1
Independent SSIV2 dfIV2 MSIV2 FIV2
Variable 2
Interaction SSIV1*IV2 dfIV1*IV2 MSIV1*IV2 FIV1*IV2
Within SSwithin dfwithin MSwithin
Total SStotal dftotal MStotal

As we described earlier, the source table for this design has within and total variation
(just like the one-way ANOVA did), but now partitions “between” variation into the var-
iation on the first independent variable, on the second independent variable, and then
on the interaction. The test also produces three F statistics—one for each of the two main
effects and one for the interaction.

Calculating the factorial ANOVA source table


In this section, we return to an earlier example from this chapter. Imagine, as we did
earlier, that we have two groups of patients: those diagnosed with mood disorders,
and those diagnosed with anxiety disorders. There are two psychotherapy treatment
protocols we wish to test, which we will call Treatment A and Treatment B. Both treat-
ments are used fairly widely for both mood and anxiety disorders. That point is impor-
tant for ethical reasons—it would not be ethical to provide a treatment we already
know will not work or to withhold a particular treatment if we already have reason to
believe it will be better. In this case, both treatments see use in practice and there is
widespread disagreement on which is more effective, so there is no ethical dilemma.
We mentioned above that in practice, we would need to have at least 30 participants
per cell, and this is a 2 (diagnosis type) × 2 (treatment protocol) design, so we would
need at least 120 participants because there are four cells. For the purposes of illus-
tration, we will imagine only three participants per cell to make the calculations more
manageable. Our dependent variable will be treatment outcome, as measured by a
standardized treatment outcome scale. In the table below are the scores for partici-
pants from each cell.
162 • BETWEEN-SUBJECTS DESIGNS

Mood Disorder Anxiety Disorder

Treatment A 1, 3, 2 7, 5, 6
Treatment B 6, 4, 5 3, 5, 4

As with the one-way ANOVA, we’ll need to calculate the between- and within-subjects
variables. However, in the case of the factorial ANOVA, the overall between variation
is partitioned into variance from the first IV (disorder type in our case), the second IV
(treatment type, in our case), and the interaction of those two independent variables. We
will still calculate within and total variation, however. In the table below are the formulas
for calculating the source table:

Source SS df MS F

  kIV1 − 1 SSIV 1 MSIV 1


2
IV1  X IV1  X
df IV 1 MSwithin

X  X
IV2
2 kIV2 − 1 SSIV 2 MSIV 2
IV 2
df IV 2 MSwithin
IV1*IV2 SStotal − SSIV1 − SSIV2 − SSwithin (dfIV1)(dfIV2) SSIV 1∗IV 2 MSIV 1∗IV 2
df IV 1∗IV 2 MSwithin

  Ntotal − [(kIV1) SSwithin
2
Within  X  Xcell
(kIV2)] df within
Total
  Ntotal − 1
2
 XX

There is some new notation in this table. Notice there are several different kinds of means.
We have group means (like X IV1 which is the group means based on the first independ-
ent variable), we have cell means ( Xcell , which is the mean per cell), and the grand mean
(X), which is the mean of all scores regardless of group membership).
To begin our calculations, we will calculate group means, means for each variable
(also known as marginal means because they are in the margins of the table), cell means,
and a grand mean. Here we are using the standard formula for a mean, but we are doing
it multiple times for different groups of participants:
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 163

Mood Disorder Anxiety Disorder

Treatment A 1 3  2 6 7  5  6 18 6  18 24
 2  6  4
3 3 3 3 6 6
Treatment B 6  4  5 15 3  5  4 12 15  12 27
 5  4   4. 5
3 3 3 3 6 6
6  15 21 18  12 30 6  15  18  12 51
  3.5  5   4.25
6 6 6 6 12 12

So, the mean for the mood disorder group is 3.5, for the anxiety disorder group is 5, for
Treatment A is 4, for Treatment B is 4.5, and the grand mean is 4.25.
Using this information, we can calculate all of our sums of squares. In the tables
below, we illustrate the calculations for each sum of squares, beginning with the total
sum of squares:

IV1 IV2 X
X  X X  X
2

Mood Disorder Treatment A 1 1 − 4.25 = −3.25 10.563


3 3 − 4.25 = −1.25 1.563
2 2 − 4.25 = −2.25 5.063
Treatment B 6 6 − 4.25 = 1.75 3.063
4 4 − 4.25 = −0.25 0.063
5 5 − 4.25 = 0.75 0.563
Anxiety Treatment A 7 7 − 4.25 = 2.75 7.563
Disorder
5 5 − 4.25 = 0.75 0.563
6 6 − 4.25 = 1.75 3.063
Treatment B 3 3 − 4.25 = −1.25 1.563
5 5 − 4.25 = 0.75 0.563
4 4 − 4.25 = −0.25 0.063
∑ = 34.256

Notice that the grand mean is constant across all groups, because it is the mean of all
scores regardless of group membership. So, our total sum of squares (the sum of the
right-most column) is 34.256.
Next, we will calculate within sum of squares:
164 • BETWEEN-SUBJECTS DESIGNS

IV1 IV2 X
X  X  X  X 
cell cell
2

Mood Disorder Treatment A 1 1 − 2 = −1 1


3 3−2=1 1
2 2−2=0 0
Treatment B 6 6−5=1 1
4 4 − 5 = −1 1
5 5−5=0 0
Anxiety Treatment A 7 7−6=1 1
Disorder
5 5 − 6 = −1 1
6 6−6=0 0
Treatment B 3 3 − 4 = −1 1
5 5−4=1 1
4 4–4=0 0
∑=8

Notice here that we use the individual cell means, subtracting them from each score. The
within sum of squares (calculated by summing the right-most column) is 8.
Next, we will calculate the main effect of disorder type:

IV1 IV2 X
X IV1 X  X IV1 X 
2

Mood Disorder Treatment A 1 3.5 − 4.25 = −0.75 0.563


3 3.5 − 4.25 = −0.75 0.563
2 3.5 − 4.25 = −0.75 0.563
Treatment B 6 5 − 4.25 = 0.75 0.563
4 5 − 4.25 = 0.75 0.563
5 5 − 4.25 = 0.75 0.563
Anxiety Disorder Treatment A 7 3.5 − 4.25 = −0.75 0.563
5 3.5 − 4.25 = −0.75 0.563
6 3.5 − 4.25 = −0.75 0.563
Treatment B 3 5 − 4.25 = 0.75 0.563
5 5 − 4.25 = 0.75 0.563
4 5 − 4.25 = 0.75 0.563
∑ = 6.756
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 165

In this case, our calculations look a little different. The formula calls for the difference
between the group mean and the grand mean. Everyone in the group has the same group
mean, so we get the same result for all members of a group. The fact that all the square
deviation scores come out the same is a feature of the 2×2 design when the number
of cases is perfectly balanced among the cells. In any other case, we would get differ-
ent values per group. Taking the sum of all those squared deviation scores (adding up
everything in the right-most column), we calculate the sum of squares for disorder type
as 6.756.
Next, we will calculate the sum of squares for treatment type. This will follow a similar
pattern, except now we are concerned with the means per treatment type:

IV1 IV2 X
X IV 2 X  X IV 2 X 
2

Mood Disorder Treatment A 1 4 − 4.25 = −0.25 0.063


3 4 − 4.25 = −0.25 0.063
5 4 − 4.25 = −0.25 0.063
Treatment B 2 4 − 4.25 = −0.25 0.063
4 4 − 4.25 = −0.25 0.063
3 4 − 4.25 = −0.25 0.063
Anxiety Treatment A 6 4.5 − 4.25 = 0.25 0.063
Disorder
4 4.5 − 4.25 = 0.25 0.063
5 4.5 − 4.25 = 0.25 0.063
Treatment B 7 4.5 − 4.25 = 0.25 0.063
5 4.5 − 4.25 = 0.25 0.063
6 4.5 − 4.25 = 0.25 0.063
∑ = 0.756

Finally, we’re ready to calculate the sum of squares for the interaction:
SStotal  SSIV 1  SSIV 2  SSwithin  34.256  6.756  .756  8  18.744
Now, we have all the information we need to complete the source table and calculate our
F ratios:
166 • BETWEEN-SUBJECTS DESIGNS

Source SS df MS F

IV1 6.756 2 − 1 = 1 SSIV 1 6.756 MSIV 1 6.756


= = 6.756 = = 6.756
df IV 1 1 MSwithin 1
IV2 0.756 2 − 1 = 1 SSIV 2 0.756 MSIV 2 0.756
= = 0.756 = = 0.756
df IV 2 1 MSwithin 1
IV1*IV2 18.744 (1)(1) = 1 SSIV 1IV 2 18.744 MSIV 1 IV 2 18.744
 
df IV 1IV 2 1 MSwithin 1
 18.744  18.744
Within 8 12 − [(2) SSwithin 8
(2)] = 8 = = 1
df within 8
Total 34.256 12 − 1 = 11

Using the F critical value table


Now that we have calculated the F value for each of our comparisons, we can deter-
mine if these effects are statistically significant. We’ll start with the interaction, where
F1, 8 = 18.744. Looking at the critical value table (in A3 of this book), we find the critical
value table at 1 and 8 degrees of freedom is 5.32. Because our calculated value (18.744)
“beats” the tabled value (5.32), we can reject the null hypothesis. We conclude that there
was a significant difference in treatment outcomes based on the interaction of treatment
type and diagnosis type.
We can also interpret the main effects of treatment and diagnosis using the same cri-
teria. In this case, the degrees of freedom for all three effects are the same, because each
effect has 1 numerator degree of freedom. In any design other than a 2×2 design, though,
the numerator degrees of freedom would be different. In our case, we see that there was
a significant difference based on diagnosis type but not treatment type. Again, though,
because there is a significant interaction, we would not interpret these main effects.

Interpreting the test statistics


In the factorial ANOVA, if there is a significant interaction, we will interpret the inter-
action and only the interaction. The fact that an interaction exists tells us that nei-
ther of the main effects is sufficient to explain the dependent variable on their own.
Interpreting the main effects when there is a significant interaction can be misleading
and should be avoided. There are some rare exceptions when it might be important to
report and interpret the main effect even when there is a significant interaction. For
example, if all prior research has shown that a certain independent variable is impor-
tant in explaining an outcome, and our factorial design finds that main effect is not
significant when we include an interaction, that might be important to report. But in
general, when there is an interaction, the interaction is where all of our analytic and
interpretive focus will go.
In our case, we determined that treatment outcomes significantly differed based on
the combination of treatment type and diagnosis. What kind of significant interaction do
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 167

we have? If we plot the means, as we have below, we see the lines cross. This is a disordi-
nal interaction, meaning that there is a pattern reversal in the data.

1
Mood Disorder Anxiety Disorder
Treatment A Treatment B

We will need additional follow-up tests to determine exactly which differences within this
pattern are statistically significant. We explain the follow-up analysis below. However, in
general, we see a pattern where it looks like people with mood disorders do better in treat-
ment B, and people with anxiety disorders do better in treatment A. This is what we mean
by a pattern reversal—the treatment that was better for one group is worse for another.
What if our interaction was not significant? In that case, we would examine the two
main effects. If either of them were significant, we proceed with that main effect, inter-
preting it in much the same way as a one-way ANOVA. If the significant main effect had
more than two groups, we’d perform a post-hoc test to determine how the groups differ,
just like we would in the one-way ANOVA. In jamovi, all of these steps can be handled
in the same menu, as we’ll demonstrate later in this chapter.

EFFECT SIZE FOR THE FACTORIAL ANOVA


In the factorial ANOVA, like all other null hypothesis significant tests, we need an effect
size estimate in addition to the F test itself. Because statistical significance is a yes/no
question, we need indicators of magnitude. For the factorial ANOVA, we will use the
same effect size estimate as we did for the independent samples t-test and the one-way
ANOVA: omega squared.

Calculating omega squared


The formula for omega squared in the factorial ANOVA is essentially unchanged from
the one-way ANOVA. The only difference is that we’ll replace “between” in the formula
168 • BETWEEN-SUBJECTS DESIGNS

with “effect” because we have three effects (two main effects and the interaction) we can
test. So, the formula for omega squared is:

SSE   df E  MSw 
2 
SST  MSw
The only difference from what we presented in Chapter 8 is the replacement of “B” for
“between” with “E” for “effect.” All of the needed information is in our source table, so
we simply plug in those values and calculate our effect size estimate for the interaction:

SSE   df E  MSw  18.744  11 17.744


2     .503
SST  MSw 34.256  1 35.256
This gets interpreted in the same way as it did in the one-way ANOVA. So, for our exam-
ple, this means that about 50% of the variance in treatment outcomes was explained by
the combination (or interaction) of treatment type and diagnosis type (ω2 = .503). Note
that our example is contrived, so the effect size in this case is much larger than would be
typical for this kind of research design.
We can calculate effect size estimates for the main effects in the same way. If we were
going to report the main effects at all, we would also need to report effect size estimates
for those effects.
Effect size for diagnosis type:
SSE   df E  MSw  6.756  11 5.756
2     .163
SST  MSw 34.256  1 35.256
That would indicate that about 16% of the variance in treatment outcomes was explained
by the diagnosis type.
Effect size for treatment type:
SSE   df E  MSw  0.756  11 0.244
2     .007  .000
SST  MSw 34.256  1 35.256
As we discovered in prior chapters, in the case of extremely small effects, sometimes the
omega squared formula can return a negative value. However, omega squared cannot
actually be negative, and we report those instances as showing no effect. In other words,
if the formula for omega squared returns a negative value, we interpret this as showing
that none of the variance was explained based on that comparison, and report omega
squared as .000.

COMPUTING THE TEST IN JAMOVI


To calculate the factorial ANOVA in jamovi, we will begin by setting up a data file. We
will begin by setting up a data file. We will need three variables, which we might enter as
shown in the following image:
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 169

We can also label the groups by going to the Data tab, and then clicking Setup. We could
label the groups on Disorder.

Then, we could label the groups on Treatment.

Computing the factorial ANOVA in jamovi


Next, to calculate the factorial ANOVA, go to the Analyses tab, then Linear Models,
then General Linear model.1 In this model, we will make Outcome the “Dependent
Variable” by click on it on the left, and moving it to the appropriate block on the right
using the arrow button. Then, move both Disorder and Treatment to the “Factors”
box using the appropriate arrow button (jamovi labels independent variables as fac-
tors in this menu). Just below that, uncheck the button for β and partial η2, and check
170 • BETWEEN-SUBJECTS DESIGNS

the box for ω2 under Effect Sizes. We generally recommend using omega squared (ω2)
for the effect size estimate when possible. Eta squared (η2) is another effect size esti-
mate that’s commonly used. It is interpreted in the same way as omega squared, but
tends to produce an overestimate of the effect size in most cases (­Keppel & Wickens,
2004).

To produce Levene’s test, under Assumptions Checks, check the box for Homoge-
neity tests. We can also check the box next to Estimated Marginal Means under the
Estimated Marginal Means heading to produce cell means and standard errors.

We can also produce plots of group means under the Plots heading. This allows you to
specify a variable to put across the horizontal axis (X axis) and one to split into separate
lines. Sometimes, it might be a good idea to put one independent variable as horizontal
and the other as separate lines, and produce a second plot switching those two places.
In our case, though, we’ll put Disorder on the horizontal axis and Treatment as separate
lines.
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 171

The first piece of output shows a model summary. This information is not especially
useful for the design we are specifying, and we won’t interpret any of this block of output.

Next is the ANOVA summary table. This table produces some information that is not useful
for our purposes, but also produces all of the information we need to interpret and report
the omnibus test. Specifically, the “Model” line of output is not one we will interpret in a
factorial ANOVA design. It is produced because this program can run a wide array of anal-
yses, some of which would use that output. Next, we see the main effect of Disorder alone,
the main effect of Treatment alone, the interaction of Disorder and Treatment (Disorder *
Treatment), error or within (labelled in jamovi as Residuals), and Total.
172 • BETWEEN-SUBJECTS DESIGNS

Here, we see a significant interaction (F1, 8 = 18.750, p = .003). Because the interaction
is significant, we will not interpret the main effects of disorder or treatment type. We
demonstrated the calculation of effect size from the source table earlier in this chapter.
However, here jamovi has produced the effect size estimate for us, as well. The next piece
of the output is the “Fixed Effects Parameter Estimates,” which we can ignore for our
purposes. Then, it produces Estimated Marginal Means, showing group and cell means
and standard errors. We could also produce group descriptive statistics through the
Exploration → Descriptives menu. Next, jamovi produces the plot we requested.
Plots

5
Outcome

Treatment
4 Treatment A
Treatment B

2
Mood Disorder Anxiety Disorder
Disorder

On this plot, we see a disordinal interaction (which we know was statistically significant
based on the ANOVA table as discussed above). It’s disordinal because the lines cross,
showing a pattern reversal. Finally, jamovi produces Levene’s test at the very bottom of
the output. Because the assumption checks are printed at the bottom of the output, we
have to remember to drop down and check this test before proceeding to interpret the
other tests.
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 173

In these data we have a case where we see some unusual-looking output. It


lists F3, 8 = 8.315e-30 and p = 1.000. This needs a bit of explaining. The format 8.315e-
30 is the same as writing (8.315 * 10−30). That means to get the plain number, we would
move the decimal to the right by thirty places (making it 0.00000000000000000000000
0000008315). To save on space, jamovi uses the scientific notation. For most purposes,
we would probably write this as F3, 8 < .001, p > .999. Because F is so incredibly small,
the probability value rounds up to 1.000, but we know it’s really not quite 1.000, so we
would report it as > .999 instead. Regardless, clearly p > .05, meaning the assumption of
homogeneity of variance was met.

Determining how cells differ from one another and


interpreting the pattern of cell differences
Much like in the one-way ANOVA, a significant F value only tells us that differences
exist. We need follow-up analyses to determine exactly how the cells differ from one
another. In this section, we will present one approach to follow-up analysis in the facto-
rial ANOVA. There are other approaches that might work better in some circumstances,
such as simply running all pairwise comparisons. Below, though, we present one way of
approach follow-up analysis: simple effects analysis.

Simple effects analysis for significant interactions


Given a significant interaction, simple effects analysis will allow us to test for differences on
one independent variable across levels of the other independent variable. In the example
above, we can test for differences between Treatment A and Treatment B among patients with
anxiety disorders and those with mood disorders. It will produce two comparisons: one com-
paring Treatments A and B among those with anxiety disorders, and the second comparing
Treatments A and B among those with mood disorders. The test could be flipped, though.
Notice that this requires us to make a theoretically driven choice. Are we interested in
differences between treatments for each of the disorders? Or in differences between the
disorders for each of the treatments? The research design should drive this decision. In
our example, we want to know about the effectiveness of the two treatments, so compar-
ing the two treatment types makes sense, and we’ll be able to test the differences in the
treatment types for two disorder types. Note also that this follow-up analysis works best
when the variable we want to base the comparison on (in this case, treatment type) has
only two groups. If it has three or more groups, we would need an additional follow-up
analysis beyond the simple effects analysis.
174 • BETWEEN-SUBJECTS DESIGNS

Calculating the simple effects analysis in jamovi


In jamovi, there is a built-in menu for producing the simple effects analysis, though
it can initially be confusing. Under the Simple Effects heading in the General Linear
Model program, we have the option to specify a Simple effects variable, a Moderator,
and/or a Breaking variable. We’ll use the Simple effects variable and the Moderator
variable boxes. The variable we want to compare should be listed as the Simple effects
variable, while the Moderator variable will be the variable on which we want to sep-
arate the analysis. So here, we want to produce a comparison of the two treatment
types for each of the two disorder types. To do so, we’ll move Treatment to the Simple
effects variable, and Disorder to the Moderator box. In the vast majority of cases, we’ll
try to select a variable with only two groups/levels as the Moderator.

Interpreting the pattern of differences in simple effects analysis


Those options produce the following output:

The first set of output shows the test of the difference between Treatment A and treatment
B among those in the Mood Disorder group. We see a significant difference between the
two treatments among those in the Mood Disorder group (F1, 8 = 13.500, p = .006). The
second line shows the same comparison for the Anxiety Disorder group, where we also
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 175

see a significant difference between the two treatments (F1, 8 = 6.000, p = .040). In this
case, because the variable we listed as Moderator had only two groups, we can stop here.
But if that variable had more than two groups, the next set of output would further break
down the comparisons into pairwise comparisons for every group on the Moderator
variable, split by the groups on the Simple effects variable.
Based on this, we know there was a significant difference among those with mood
disorders between the two treatment types. Looking at our means (or the profile plots),
we see that those getting Treatment B had better outcomes compared with Treatment B
among those with mood disorders. The simple effects analysis also confirmed there was
a significant difference between the two treatment types for those with anxiety disorders.
Looking at the means (or profile plots), we can see that those getting Treatment A had
better outcomes among those with anxiety disorders. So, the overall pattern of results is
that those with mood disorders had better outcomes with Treatment B, while those with
anxiety disorders had better outcomes with Treatment A.

Interpreting the main effects for nonsignificant interactions


As we have mentioned already in this chapter, when there is a significant interaction, all
of our interpretive attention will go to the interaction. However, what if the interaction
isn’t significant? In that case, we proceed to interpret the main effects. In our example, if
the interaction had not been significant, we would look at the same source table to see if
either of the main effects were significant. In this case, we find a significant difference in
outcomes based on the disorder type (F1, 8 = 6.750, p = .032), but no significant difference
based on treatment type (F1, 8 = 0.750, p = .412). Here, because disorder type has only
two groups, we do not need any additional follow-up analysis. We can see based on the
descriptive statistics that those with mood disorders (M = 3.500, SD = 1.871) had worse
outcomes than those with anxiety disorders (M = 5.000, SD = 1.414). If we had no sig-
nificant interaction, but a significant main effect on a variable that had more than two
groups, we could add post-hoc tests. The available tests are the same as with the one-way
ANOVA (after all, the main effects are essentially equivalent to a one-way ANOVA),
and can be by selecting the appropriate options under the Post-Hoc Tests heading. It’s
important to emphasize, though, that in the case of the example we have presented here,
we would not interpret the main effects at all. Any time there is a significant interaction,
we do not interpret the main effects and focus our attention on the interaction.

WRITING UP THE RESULTS


After all these steps to analyze the data in a factorial ANOVA, we are ready to write up
the results. We will follow a similar format as we did in prior chapters, with some slight
changes to reflect the nature of the factorial ANOVA. In general, the format for writing
up a factorial ANOVA would be:

1. What test did we use, and why?


2. If there are issues with the assumptions, report them and any appropriate
corrections.
3. What was the result of the factorial ANOVA?
4. If the result was statistically significant, interpret the effect size (if not, report effect
size in #3).
176 • BETWEEN-SUBJECTS DESIGNS

5. If the interaction was significant, report the results of the follow-up analysis such
as simple effects analysis. If the interaction was not significant, report and interpret
the two main effects.
6. What is the pattern of group differences?
7. What is the interpretation of that pattern?

This general format is very similar to what we presented in Chapter 8, but adds some
specificity in items 3 through 5 that may be helpful in thinking through the factorial
design. For our example we’ve followed through most of the chapter, we provide sample
responses to these items, followed by a sample results paragraph.

1. What test did we use, and why?


We used a factorial ANOVA to determine if treatment outcomes varied across the
interaction of treatment type and disorder type for psychotherapy patients.
2. If there are issues with the assumptions, report them and any appropriate corrections.
In our case, we met the assumptions we have tested so far. Our data passed Levene’s
test, so we can assume homogeneity of variance. We should also evaluate for normal-
ity, though. Using the process described in Chapter 3, we can calculate skewness,
kurtosis, and their standard errors. We find that the data are normal in terms of skew
(skewness = −.335, SE = .637), and mesokurtic (kurtosis = −.474, SE = 1.232). So, the
data seem to meet the statistical assumptions. The design assumptions, though, pres-
ent some challenges. The participants were randomly assigned to treatment type (A
vs. B), but it is not possible to randomly assign disorder type. We also do not have a
random sample—these are patients who presented for treatment, meaning we likely
have strong self-selection bias. Those factors will limit the inferences we can draw.
3. What was the result of the factorial ANOVA?
There was a significant difference in treatment outcomes based on the interaction
of treatment type and disorder type (F1, 8 = 18.750, p = .003). (Notice that because
there is a significant interaction, we will not report or interpret the main effects of
treatment type or of disorder type.)
4. If the result was statistically significant, interpret the effect size (if not, report effect
size in #3).
The interaction accounted for about 50% of the variance in treatment outcomes
(ω2 = .503).
5. If the interaction was significant, report the results of the follow-up analysis such
as simple effects analysis. If the interaction was not significant, report and interpret
the two main effects.
To determine how outcomes differed across the interaction, we used simple effects
analysis. Among participants diagnosed with mood disorders, there was a signifi-
cant difference in treatment outcomes between Treatment A and Treatment B (F1,
8
= 13.500, p = .006). Similarly, among those diagnosed with anxiety disorders,
there was a significant difference in treatment outcomes between the two treat-
ment types (F1, 8 = 6.000, p = .040).
6. What is the pattern of group differences?
Among those diagnosed with mood disorders, treatment outcomes were bet-
ter under Treatment B (M = 5.000, SD = 1.000) as compared with Treatment A
(M = 2.000, SD = 1.000). On the other hand, among those with anxiety disorders,
treatment outcomes were better under Treatment A (M = 6.000, SD = 1.000) versus
Treatment B (M = 4.000, SD = 1.000).
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 177

7. What is the interpretation of that pattern?


Among this sample of patients, Treatment A appears to have been more effective
for those with anxiety disorders, while Treatment B appears to have been more ef-
fective for those with mood disorders.
Finally, we will assemble this information in a short results section.

Results

We used a factorial ANOVA to determine if treatment outcomes varied


across the interaction of treatment type and disorder type for psycho-
therapy patients. There was a significant difference in treatment out-
comes based on the interaction of treatment type and disorder type
(F1, 8 = 18.750, p = .003). The interaction accounted for about 50% of
the variance in treatment outcomes (ω2 = .503). To determine how out-
comes differed across the interaction, we used simple effects analysis.
Among participants diagnosed with mood disorders, there was a sig-
nificant difference in treatment outcomes between Treatment A and
Treatment B (F1, 8 = 13.500, p = .006). Similarly, among those diag-
nosed with anxiety disorders, there was a significant difference in treat-
ment outcomes between the two treatment types (F1, 8 = 6.000, p = .040).
Among those diagnosed with mood disorders, treatment outcomes were
better under Treatment B (M = 5.000, SD = 1.000) as compared with
Treatment A (M = 2.000, SD = 1.000). On the other hand, among those
with anxiety disorders, treatment outcomes were better under Treatment
A (M = 6.000, SD = 1.000) versus Treatment B (M = 4.000, SD = 1.000).
Among this sample of patients, Treatment A appears to have been more
effective for those with anxiety disorders, while Treatment B appears to
have been more effective for those with mood disorders.

(Continued)
178 • BETWEEN-SUBJECTS DESIGNS

Table 10.1
Descriptive Statistics for Treatment Outcomes

Disorder Type Treatment Type M SD N

Mood Disorder Treatment A 2.000 1.000 3


Treatment B 5.000 1.000 3
Total 3.500 1.871 6
Anxiety Disorder Treatment A 6.000 1.000 3
Treatment B 4.000 1.000 3
Total 5.000 1.414 6
Total Treatment A 4.000 2.366 6
Treatment B 4.500 1.049 6
Total 4.250 1.765 12

We might also choose to include a table of descriptive statistics. This is not absolutely
necessary as we have included cell means and standard deviations in the text (which we
can produce in the Exploration→Descriptives menu as explained in prior chapters, adding
both independent variables to the “Split By” box), but can still be helpful for readers as it
includes additional information. In the example Table 10.1, we have added some addi-
tional horizontal lines to make it clearer for readers where the change in disorder type falls.
In the next chapter, we’ll work through some examples of published research that used
factorial ANOVA designs, following these same steps.

Note
1 Note that for this option to appear, you must have the GAMLj module installed. To do so, if
you have not already, click the “Modules” button in the upper right corner of jamovi (which
has a plus sign above it). Then click “jamovi library.” Locate GAMLj—General Analyses for
Linear Models and click “Install.” While a factorial ANOVA can be produced in the ANOVA
menu, it does not have options that are quite as robust, particularly for the follow-up analysis
we will demonstrate in this chapter.
11
Factorial ANOVA case studies

Case Study 1: bullying and LGBTQ youth 179


Research questions 180
Hypotheses 180
Variables being measured 180
Conducting the analysis 180
Write-up 181
Case study 2: social participation and special educational needs 184
Research questions 184
Hypotheses 184
Variables being measured 184
Conducting the analysis 184
Write-up 186
Note 189

In the previous chapter, we explored the factorial ANOVA using a made-up example
and some fabricated data. In this chapter, we will present several examples of published
research that used the factorial ANOVA. For each sample, we encourage you to:

1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the factorial ANOVA.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.

CASE STUDY 1: BULLYING AND LGBTQ YOUTH


Perez, E. R., Schanding, G. T., & Dao, T. K. (2013). Educators’ perceptions in address-
ing bullying of LGBTQ/gender nonconforming youth. Journal of School Violence, 12(1),
64–79. https://doi.org/10.1080/15388220.2012.731663.
In this study, the authors were interested in understanding how educators perceived bul-
lying of LGBTQ and gender nonconforming youth as compared to other youth. They sur-
veyed educators on the seriousness of bullying, their empathy for students who are bullied,
and their likelihood to intervene. In this case study, we will focus on the ratings of seriousness.

179
180 • BETWEEN-SUBJECTS DESIGNS

Research questions
The authors had several research questions, of which we focus here on one: Would educa-
tor perceptions of the seriousness of bullying vary based on the combination of the bully-
ing type (verbal, relational, or physical) and the scenario type (LGBTQ or non-LGBTQ)?

Hypotheses
The authors hypothesized the following related to X:

H0: There was no significant difference in educator perceptions of the seriousness of


bullying vary based on the combination of the bullying type (verbal, relational, or
physical) and the scenario type (LGBTQ or non-LGBTQ). (MLGBTQxVerbal =
MLGBTQxRelational = MLGBTQxPhysical = MNon-LGBTQxVerbal = MNon-LGBTQxRelational =
MNon-LGBTQxPhysical).
H1: There was a significant difference in educator perceptions of the seriousness of
bullying vary based on the combination of the bullying type (verbal, relational, or
physical) and the scenario type (LGBTQ or non-LGBTQ). (MLGBTQxVerbal ≠
MLGBTQxRelational ≠ MLGBTQxPhysical ≠ MNon-LGBTQxVerbal ≠ MNon-LGBTQxRelational ≠
MNon-LGBTQxPhysical).

Variables being measured


To measure perceptions of bullying, the authors used the Bullying Attitude Question-
naire-Modified (BAQ-M). The scale consisted of 5-point Likert-type items, which were
averaged to create scale scores. The authors report prior work regarding validity evidence,
and they also report internal consistency measured by coefficient alpha ranging from .68 to
.92, which is in the acceptable range. The authors measured LGBTQ versus non-LGBTQ by
randomly assigning participants to read BAQ-M items that mentioned LGBTQ students
in the scenario or BAQ-M items without mention of LGBTQ students. Participants were
also randomly assigned to rate scenarios involving verbal, relational, or physical bullying.

Conducting the analysis


1. What test did they use, and why?
The authors used a factorial ANOVA to determine if educator perceptions of the
seriousness of bullying incidents varied based on the interaction of the type of
bullying (verbal, relational, or physical) and whether or not the scenario involved
an LGBTQ student.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The dependent variable is comprised of averaged Likert-type data, so will be
treated as interval.
b. Normality of the dependent variable
The authors did not report data on normality, which is typical in journal articles
if the assumption of normality was met. In practice, we would test for normality
as a preliminary step in the analysis, using skewness and kurtosis statistics, even
if we did not ultimately include that information in the published article.
FACTORIAL ANOVA CASE STUDIES • 181

c. Observations are independent


The authors note no factors that threaten independence.
d. Random sampling and assignment
The sample was not random and involved a sample of experienced educators
who were recruited via social media and email. The participants were, however,
randomly assigned to groups on both independent variables.
e. Homogeneity of variance
The assumption of homogeneity of variance was not met (F5, 180 = 5.182, p <
.001). All cell sizes were relatively equal, with the largest cell having n = 32 and
the smallest n = 28. In addition, the ratio of the largest standard deviation to the
smallest was less than three. So it is likely safe to proceed assuming homogene-
ity of variance.
3. What was the result of that test?
There was a significant difference in educator perceptions of seriousness based on
the interaction (F2, 180 = 12.377, p < .001).1
4. What was the effect size, and how is it interpreted?
SSE   df E  MSw  8.528   2  .344  8.528  .688 7.840
2      .098
SST  MSw 79.436  .344 79.780 79.780
About 10% of the variance in educator perceptions of seriousness was explained
by the combination of the type of bullying and whether the scenario involved an
LGBTQ student (ω2 = .098).
5. What is the appropriate follow-up analysis?
To explore the disordinal interaction and determine how cells differed from one
another, we used simple effects analysis.
6. What is the result of the follow-up analysis?
There was a significant difference between those rating scenarios involving LGBTQ
students and those rating scenarios that did not involve LGBTQ among those
rating verbal bullying (F1, 180 = 15.120, p < .001), those rating relational bullying
(F1, 180 = 19.604, p < .001), and those rating physical bullying (F1, 180 = 3.901, p = .050).
7. What is the pattern of group differences?
Among the present sample, participants rated verbal bullying and relational bully-
ing as more serious among LGBTQ scenarios, but rated physical bullying as more
serious among non-LGBTQ scenarios.

Write-up

Results

We used a factorial ANOVA to determine if educator perceptions of the


seriousness of bullying incidents varied based on the interaction of the
type of bullying (verbal, relational, or physical) and whether the scenario
involved an LGBTQ student or not. There was a significant difference in

(Continued)
182 • BETWEEN-SUBJECTS DESIGNS

educator perceptions of seriousness based on the interaction (F2, 180 =


12.377, p < .001). About 10% of the variance in educator perceptions of
seriousness was explained by the combination of the type of bullying and
whether the scenario involved an LGBTQ student (ω2 = .098). To explore
the disordinal interaction and determine how cells differed from one
another, we used simple effects analysis. There was a significant differ-
ence between those rating scenarios involving LGBTQ students and those
rating scenarios that did not involve LGBTQ among those rating verbal
bullying (F1, 180 = 15.120, p < .001), those rating relational bullying (F1, 180
= 19.604, p < .001), and those rating physical bullying (F1, 180 = 3.901, p =
.050). See Table 11.1 for descriptive statistics, and Figure 11.1 for a plot
of cell means. Among the present sample, participants rated verbal bully-
ing and relational bullying as more serious among LGBTQ scenarios, but
rated physical bullying as more serious among non-LGBTQ scenarios.

Table 11.1
Descriptive Statistics for Seriousness Ratings

Scenario Type Bullying Type M SD N

LGBTQ Verbal 4.800 .340 32


Relational 4.660 .480 33
Physical 4.570 .870 32
Total 4.677 .606 97
Non-LGBTQ Verbal 4.220 .680 30
Relational 4.010 .630 31
Physical 4.870 .290 28
FACTORIAL ANOVA CASE STUDIES • 183

Table 11.1  (CONTINUED)

Scenario Type Bullying Type M SD N

Total 4.351 .668 89


Total Verbal 4.519 .603 62
Relational 4.345 .643 64
Physical 4.710 .677 60
Total 4.521 .655 186

4.5

3.5

2.5

1.5

1
Verbal Relational Physical
LGBTQ Non-LGBTQ

Figure 11.1  Plot of Cell Means.

Notice that, because the interaction was significant, all of our interpretive attention is
on the interaction. In fact, we have not interpreted the main effects at all. In this sce-
nario, it’s particularly clear that main effects would be misleading in the presence of
an interaction. For example, we would find (using main effects) that LGBTQ scenarios
were rated higher in seriousness, but that pattern is reversed for physical bullying. So
when there is an interaction, our attention will normally be entirely on that interaction.
For the table, it would be placed after the References page, on a new page, with one
table per page.
Next, the figure would go on a new page after the final table, with one figure per page
(if more than one figure is included).
184 • BETWEEN-SUBJECTS DESIGNS

CASE STUDY 2: SOCIAL PARTICIPATION AND


SPECIAL EDUCATIONAL NEEDS
Bossaert, G., de Boer, A. A., Frostad, P., Pijl, S. J., & Petry, K. (2015). Social participation
of students with special educational needs in different educational systems. Irish Educa-
tional Studies, 34(1), 43–54. https://doi.org/10.1080/03323315.2015.1010703.
In this article, the authors examine inclusive education for students with special
educational needs. The authors focus on social participation of students with special
educational needs and suggest that outcome might be a better measure of inclusive
education than other measures that have been used. They compared students across
three countries (Norway, the Netherlands, and the Flemish region of Belgium), and
compared students with special educational needs and those without special educa-
tional needs. They tested the interaction of special educational needs status and coun-
try on social participation.

Research questions
In this study, the authors examined a single primary research question: Would social
participation differ across the interaction of country and special educational needs?

Hypotheses
The authors hypothesized the following related to social participation:

H0: There was no difference in social participation based on the interaction of special


educational needs status and country. (MNorwayXSpecialNeeds = MNetherlandsXSpecialNeeds =
MBelgiumXSpecialNeeds = MNorwayXNoSpecialNeeds = MNetherlandsXNoSpecialNeeds = MBelgiumXNoSpecialNeeds)
H1: There was no difference in social participation based on the interaction of special
educational needs status and country. (MNorwayXSpecialNeeds ≠ MNetherlandsXSpecialNeeds ≠
MBelgiumXSpecialNeeds ≠ MNorwayXNoSpecialNeeds ≠ MNetherlandsXNoSpecialNeeds ≠ MBelgiumXNoSpecialNeeds)

Variables being measured


The authors measured social acceptance by gathering peer nominations data from their
classmates. Special education needs were measured by school diagnostic categories,
including: typically developing students with special educational needs; students with
behavioral problems; and students with other special educational needs. Country was
measured by the location of participants’ schools. The authors did not offer psychomet-
ric evidence for any of these variables because all variables except social acceptance were
based on known categories. For social acceptance, the authors offer a theoretical ration-
ale for their method of measuring acceptance using peer nominations.

Conducting the analysis


1. What test did they use, and why?
The authors used the factorial ANOVA to determine if students’ social acceptance
would vary based on the interaction of country (Norway, the Netherlands, and
Belgium) and special educational needs status.
FACTORIAL ANOVA CASE STUDIES • 185

2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The dependent variable was social acceptance as measured by peer nominations.
This variable was measured at the ratio level because it had a true, absolute zero.
b. Normality of the dependent variable
The authors reported that there were problems with normality. They reported
that peer acceptance was not normally distributed, but did not specify how the
data were non-normal. In most instances, it would be appropriate to provide the
normality statistics. As a reminder, the data in the online resources are simu-
lated data and so will be normally distributed, although the distribution was not
normal in the published study.
c. Observations are independent
The authors did not discuss this assumption in the published article. There are
potential factors like teacher or school variance that might cause some depend-
ence. But, in this case, the authors are most interested in comparing the coun-
tries and have included that as an independent variable. They also acknowledge
that countries are internally heterogeneous in how schools might operate and
note this as a limitation of the study.
d. Random sampling and assignment
The sample was not random but involved multi-site data collection. The authors
also acknowledge the limitation of their sampling strategy, which was not broad
in each country, in making international comparisons. Neither independent
variable was randomly assigned as both independent variables involved intact
groups. For special educational needs, there were very uneven group sizes,
which presents some complications for an ANOVA design.
e. Homogeneity of variance
The assumption of homogeneity of variance was not met (F11, 1323 = 1.988,
p = .026). The cell sizes are very uneven, with the largest cell having n = 469,
and the smallest having n = 14. However, the largest standard deviation (3.120)
divided by the smallest (1.340) is less than three (3.120/1.340 = 2.328). It is likely
safe to proceed with the unadjusted factorial ANOVA based on the criteria we
provided in Chapter 8, but we will note the heterogeneity of variance (lack of
homogeneity of variance) in the Results section.
3. What was the result of that test?
There was no significant difference in social participation based on the interaction
of the country and SEN grouping (F6, 1323 = .592, p = .737).
4. What was the effect size, and how is it interpreted?
For the interaction:
SSE   df E  MSw  16.328   6  4.598  16.328  27.588
2   
SST  MSw 6424.712  4.598 6429.310
11.588
  .002  .000
6429.310
Remember that omega squared cannot be negative. In the case of extremely small
effects, the formula might return a negative value, as it has done here. But we will
report and interpret this as .000, or that none of the variance in social acceptance
was explained by the interaction of country and SEN group.
186 • BETWEEN-SUBJECTS DESIGNS

5. What is the appropriate follow-up analysis?


Because there is no significant interaction, the appropriate follow-up analysis is to
examine the main effects of country and SEN group.
6. What is the result of the follow-up analysis?
There was a significant difference in social acceptance among the three countries
(F2, 1323 = 5.097, p = .006). There was also a significant difference in social accept-
ance between the four SEN groups (F3, 1323 = 17.549, p < .001).
We can also calculate omega squared for both of the main effects:Country:
SSE   df E  MSw  46.866   2  4.598  46.866  9.196 37.670
2      0.006
SST  MSw 6424.712  4.598 6429.310 6429.310
SEN Groups:
SSE   df E  MSw  242.046   3  4.598  242.046  13.794
2   
SST  MSw 6424.712  4.598 6429.310
228.252
  0.036
6429.310
Because there were significant differences on main effects that have more than
two groups, we also need a post-hoc test:
We used Scheffe post-hoc tests to determine how the three countries differed.
There was a significant difference between Belgium and Norway (p = .005), but no
significant differences between Belgium and the Netherlands (p = .841) or between
the Netherlands and Norway (p = .137).
We used Scheffe post-hoc tests to determine how the four SEN groups differed.
There was a significant difference between typically developing students and spe-
cial education needs students (p < .001), students with behavioral needs (p = .001),
and students with other special needs (p = .002). There was no significant differ-
ence between students with special educational needs and those with behavioral
needs (p > .999) or other special needs (p > .999). There was also no significant
difference between those with behavioral needs and those with other special needs
(p > .999).
7. What is the pattern of group differences?
Social acceptance was higher in Belgium when compared with the Netherlands.
Also, students labeled as typically developing had higher social acceptance than
those in any of the three groups of special educational needs.

Write-up

Results

We used the factorial ANOVA to determine if students’ social acceptance


would vary based on the interaction of country (Norway, the Netherlands,

(Continued)
FACTORIAL ANOVA CASE STUDIES • 187

and Belgium) and special educational needs status (typically developing,


special educational needs, behavioral needs, or other educational needs).
The assumption of homogeneity of variance was not met (F11, 1323 = 1.988,
p = .026), likely due to the unbalanced sample sizes between groups.
There was no significant difference in social participation based on the
interaction of the country and SEN grouping (F6, 1323 = .592, p = .737,
ω2 = .000). Because there was no interaction, we examined the main
effects of country and SEN group.
There was a significant difference in social acceptance among the
three countries (F2, 1323 = 5.097, p = .006). Country accounted for about
3% of the variance in social acceptance (ω2 = .034). We used Scheffe
post-hoc tests to determine how the three countries differed. There was
a significant difference between Belgium and Norway (p = .005), but no
significant differences between Belgium and the Netherlands (p = .841)
or between the Netherlands and Norway (p = .137).
There was also a significant difference in social acceptance between
the four SEN groups (F3, 1323 = 17.549, p < .001). SEN group accounted
for about 21% of the variance in social acceptance (ω2 = .207). We used
Scheffe post-hoc tests to determine how the four SEN groups differed.
There was a significant difference between typically developing students
and special education needs students (p < .001), students with behavioral
needs (p = .001), and students with other special needs (p = .002). There
was no significant difference between students with special educational
needs and those with behavioral needs (p > .999) or other special needs

(Continued)
188 • BETWEEN-SUBJECTS DESIGNS

(p > .999). There was also no significant difference between those with
behavioral needs and those with other special needs (p > .999).
Overall, while there was no significant interaction between country
and SEN groups, there were significant main effects for both variables.
Specifically, students in Belgium had higher social acceptance, on aver-
age, than those in the Netherlands. Typically developing students had
higher social acceptance than students from the three other SEN groups,
suggesting social acceptance is lower, on average, for students with spe-
cial needs, regardless of the type of special educational need. See Table
11.2 for descriptive statistics.

Table 11.2
Descriptive Statistics for Social Acceptance

Country SEN Group M SD N

Belgium Typically developing 4.250 2.280 469


Special educational 3.090 1.990 43
needs
Behavioral needs 2.970 2.240 29
Other special needs 3.360 1.340 14
Total 4.071 2.273 555
Netherlands Typically developing 4.180 2.230 187
Special educational 3.3310 2.290 29
needs
Behavioral needs 3.300 1.900 20
Other special needs 3.330 3.120 9
FACTORIAL ANOVA CASE STUDIES • 189

Table 11.2  (CONTINUED)

Country SEN Group M SD N

Total 3.974 2.265 245


Norway Typically developing 3.860 2.000 461
Special educational 2.300 1.910 37
needs
Behavioral needs 1.910 1.700 11
Other special needs 2.460 2.010 26
Total 3.644 2.057 535
Total Typically developing 4.077 2.166 1117
Special educational 2.880 2.073 109
needs
Behavioral needs 2.886 2.067 60
Other special needs 2.877 2.101 49
Total 3.882 2.195 1335

The table would go after the references page, starting on a new page, and if there was
more than one table, they would be one per page.
As with the previous case studies in this text, we encourage you to compare our ver-
sion of the Results with what is in the published article. How do they differ from each
other? Why do they differ? How did this analysis fit into the overall research design
and article structure? Comparing can be helpful in seeing the many different styles and
approaches researchers use to writing about this design.
For additional case studies, including example data sets, please visit the textbook
website for an eResource package, including specific case studies on race and racism in
education.

Note
1 Please note that the values from the simulated data provided in the online course resources
differ slightly from the authors’ calculations. This is an artifact of the simulation process and
the authors’ results are not incorrect or in doubt.
Part IV
Within-subjects designs

191
12
Comparing two within-subjects
scores using the paired samples t-test

Introducing the paired samples t-Test 194


Research design and the paired samples t-Test 194
Assumptions of the paired samples t-Test 194
Level of measurement for the dependent variable is interval or ratio 195
Normality of the dependent variable 195
Observations are independent 195
Random sampling and assignment 195
Calculating the test statistic t 196
Calculating the paired samples t-test 196
Partitioning variance 196
Using the t critical value table 198
One-tailed and two-tailed t-tests 198
Interpreting the test statistics 198
Effect size for the paired samples t-test 198
Determining and interpreting the pattern of difference 199
Computing the test in jamovi 199
Writing up the results 202

In this section of the book, we will explore within-subjects designs. The simplest of these
designs is the paired samples design. The difference between between-subjects designs
and within-subjects designs is the nature of the independent variable. In between-sub-
jects designs, the independent variable was always a grouping variable. For example,
an independent samples t-test might be used to determine the difference between an
experimental and a control group. In within-subjects designs, the independent variable
is based on repeated measures. For example, we might have a within-subjects independ-
ent variable with two levels, such as a pre-test post-test design. Rather than having two
different groups we wish to compare, we would have two different sets of scores from the
same participants we wish to compare. The designs are called within-subjects because we
are comparing data points from the same participants rather than comparing groups of
participants.

193
194 • WITHIN-SUBJECTS DESIGNS

INTRODUCING THE PAIRED SAMPLES t-TEST


The paired samples t-test, then, works with independent variables that are within-sub-
jects and have only two levels (much like the independent samples t-test, which required
an independent variable with two levels, but that test was between-subjects).

RESEARCH DESIGN AND THE PAIRED SAMPLES t-TEST


Research design in the paired samples t-test can involve any situation in which we wish
to compare two data points within the same set of participants. The most obvious exam-
ple is the pre-test post-test design. For example, imagine we are interested in improving
students’ mathematics self-efficacy (their sense of their ability to succeed in mathemat-
ics). We begin the project by administering a measure of mathematics self-efficacy. Then
we ask participants to complete a journaling task each day for three weeks, with journ-
aling prompts that are designed to enhance student self-efficacy. At the end of the three
weeks, we administer the mathematics self-efficacy measure a second time. The paired
samples t-test will allow us to determine if participants’ mathematics self-efficacy sig-
nificantly increased over the three-week program. There are many designs like this that
use pre- and post-test measures. For example, test scores before and after a workshop,
depression scores before and after psychotherapy, body mass index before and after a
nutritional program, and many more. Notice that for this design to work, the measure
needs to be the same at pre- and post-test.
The fact that these designs involve giving the same measure more than once presents
a special challenge for research design. When people take the same test more than once,
especially if that test is a test of ability or achievement, their scores tend to improve across
multiple administrations. The reason for this is clear on ability or achievement tests—
people get better at the tasks when they practice. However, practice effects can occur on
attitudinal or social/behavioral measures as well. Simply taking the pre-test can sensitize
participants to the construct being measured, which in itself can cause changes in the
test scores. Because of this issue, researchers must be careful to use measures that have
demonstrated test-retest reliability and for which practice effects have been evaluated.
However, there are also many other ways to design within-subjects research. Rather
than pre- and post-tests, for example, participants might take the same measure in two
different situations. For example, a researcher might ask participants to rate the credibil-
ity of two different speakers or texts. In a relatively common design, participants might
be asked to rate two different products on some scale, like tasting two varieties of apples
and rating their flavor. It is possible to use this design any time we have two points of data
from the same participant, provided that those two data points are comparable. In all the
examples above, the participants completed the same measure for both data points, so the
data are comparable. This design can’t be used if the data are not comparable. For exam-
ple, we could not use this design if we measured self-efficacy at pre-test and achievement
at post-test, or if we asked participants to rate Granny Smith apples for flavor and Golden
Delicious apples for texture. The data must be comparable for the design to work.

ASSUMPTIONS OF THE PAIRED SAMPLES t-TEST


The assumptions for this test will largely be familiar from previous tests. In fact, we’ve
encountered all of these assumptions before. However, some of them apply in a slightly
COMPARING TWO WITHIN-SUBJECTS SCORES • 195

different way in the within-subjects design than they did in between-subjects designs.
We’ll briefly review all of the assumptions and how they apply to this design.

Level of measurement for the dependent variable is interval or ratio


Like all of the analyses we’ve encountered so far, the paired samples t-test requires a
dependent variable that is continuous. That is, the dependent variable must be either
interval or ratio. This is something that would be built into the research design, like it
was in the previous designs.

Normality of the dependent variable


As with the prior analyses, this test requires the dependent variable to be normally dis-
tributed. We’ve discussed in prior chapters how to test for normality using skewness
and kurtosis statistics. Those same statistics will apply here. However, the assumption
of normality in the within-subjects design applies to all data for the dependent varia-
ble. Because this design involves two sets of data from every participant, our data will
be organized as two variables, as we will explore later in this chapter. To appropriately
evaluate this assumption, then, we have to combine all data, regardless of level on the
within-subjects variable. For example, in a pre-test post-test design, we would combine
all of the scores from the pre-test and the post-test, and evaluate them together for nor-
mality. Other than that quirk, it is the same test for normality as we’ve used in the past.

Observations are independent


As with prior designs, this test will assume that all cases are independent of all other
cases. However, this comes with a special exception in the case of the paired samples
t-test. Namely, there will be dependent observations within a subject. For example, we
assume some dependency between pre-test and post-test scores from the same par-
ticipant. So the real issue here remains the independence of the participants and their
scores. As with the previous designs, the biggest issue with independence is likely to be
related to issues of nesting (like students within teachers, teachers within schools, etc.).

Random sampling and assignment


The random sampling half of this assumption remains as it was in prior designs. The test
assumes that the sample has been randomly pulled from the population. We’ve discussed
in prior chapters why that assumption is almost never met, and that the question for us in
using this test is how biased the sample might be. The more biased the sample, the more
limited the inferences we can draw, particularly with regard to generalizability. However,
what will random assignment mean in this design? There are no groups, so it’s not possible
to randomly assign to groups. Instead, everyone has been tested or measured twice. So,
in this design, the issue of random assignment is about the order of administration. Were
people randomly assigned to an order of administration on the levels of the within-sub-
jects variable? For example, if we ask participants to rate the credibility of two speakers,
we could randomly assign some people to rate speaker 1 first and speaker 2 second, and
others to rate speaker 2 first and speaker 1 second. The advantage of doing so is that this
196 • WITHIN-SUBJECTS DESIGNS

randomization of order, often referred to as counterbalancing, will help control for order
effects. Take the example of a taste test design, where participants will rate two new flavor
options for soda—grape and watermelon. It might be that the watermelon has a strong
aftertaste, so a person tasting it first might think the grape flavor tastes worse than they
would if they’d had it first. Because order effects are hard to anticipate in many cases, ran-
domly assigning order of administration or counterbalancing the order of administration
can help test and control for those order effects. We also mentioned earlier that on ability
tests and cognitive tests, there is often a practice effect, which counterbalanced order of
administration can help control for. The issue is that many within-subjects designs are
longitudinal, like the pre-test post-test design. In those cases, counterbalancing is not
possible, which means we cannot rule out order effects or practice effects.

CALCULATING THE TEST STATISTIC t


In other tests so far, the test statistic has been a ratio of between-subjects variation over
within-subjects variation or error. However, in this design, we have no between-subjects
factor (no grouping variable) and instead want to know about variation within subjects.
As a result, the paired samples t-test will be a ratio of variance between the levels of
the within-subjects variable over the variance within levels of the within-subjects vari-
able. The logic works very similarly to other tests we’ve learned so far. Like in the inde-
pendent samples t-test, the numerator of the formula will be a mean difference, and the
denominator will be the standard error of the difference. However, we will get to these
two values in a different manner in the paired samples t-test to account for the with-
in-subjects design.

Calculating the paired samples t-test


The formula for the paired samples t-test is:

D
t=
SED

where D is the difference between the two data points from each subject. The numer-
ator, then, is the mean difference between the two data points (the two levels of the
within-subjects independent variable). The denominator is the standard error of the
difference.

Partitioning variance
Calculating the mean difference is fairly straightforward. We simply calculate the dif-
ference between the two levels of the within-subjects variable for every participant and
then take the mean of those difference scores. To illustrate, we’ll return to an example
from earlier in the chapter. Imagine we’ve recruited participants to complete a workshop
designed to increase their mathematics self-efficacy. We give participants a measure of
their mathematics self-efficacy before and after the workshop. Were their mathematics
self-efficacy scores higher following the workshop? Based on those two scores, we can
calculate the difference scores and the mean difference as follows:
COMPARING TWO WITHIN-SUBJECTS SCORES • 197

Pre-Test Post-Test D

3 6 6−3=3
4 4 4−4=0
2 4 4−2=2
3 5 5−3=2
4 3 3 − 4 = −1
1 3 3−1=2

 D 3  0  2  2 1 2 8
D    1.333
N 6 6
So the mean difference is 1.333. That will be the numerator for the t formula. We next
need to calculate the standard error of the difference for the denominator. The standard
error is calculated as:
SSD
SED 
N  N  1
However, to use this formula, we’ll first need to calculate the sum of squares for the dif-
ference scores, which is calculated as:
 D
2
 
SSD   D 2
N


This formula has some redundant parentheses to make it very clear when to square these
figures. The sum of squares will be calculated as the sum of the squared difference scores
(notice here the difference scores are squared, then summed) minus the sum of the dif-
ference scores squared (notice here the scores are summed, then squared) over sample
size. For our example, we could calculate this as follows:

Pre-Test Post-Test D D2

3 6 6−3=3 32 = 9
4 4 4−4=0 02 = 0
2 4 4−2=2 22 = 4
3 5 5−3=2 22 = 4
4 3 3 − 4 = -1 12 = 1
1 3 3−1=2 22 = 4
∑=8 ∑ = 22

 D
2
82 64
 
SSD   D 2 
N
 22 
9
 22   22  10.667  11.333
6
198 • WITHIN-SUBJECTS DESIGNS

We can then use the sum of squares to calculate the standard error of the difference:
SSD 11.333 11.333
SED     0.378  0.615
N  N  1 6  6  1 30
Finally, we put all of this into the t formula:
D 1.333
=t = = 2.167
SED 0.615

Using the t critical value table


Using the critical value table is essentially the same in the paired samples test as it was
in the independent samples test. The degrees of freedom, however, are calculated dif-
ferently. In the paired samples t-test, there will be n − 1 degrees of freedom. So, in our
example, we will use Table A2, and find the row with n − 1 − 6 − 1 = 5 degrees of freedom.
The critical value for a one-tailed test would be 2.01 and for a two-tailed test would be
2.57, given that we had six participants.

One-tailed and two-tailed t-tests


Notice that in this example, it makes quite a bit of difference whether our test is one-
tailed or two-tailed, as our calculated t value exceeds the one-tailed critical value but
not the two-tailed critical value. We suggested in an earlier chapter that if there is no
evidence to the contrary we should default to a two-tailed test. However, in this case,
the research question was whether mathematics self-efficacy scores would be higher fol-
lowing the workshop. Because the question is directional (post-test scores will be higher
than pre-test scores), this is a directional or one-tailed test.

Interpreting the test statistics


Our calculated value of 2.167 is more than the critical value of 2.01, so we would reject the
null hypothesis. We conclude that because p < .05, there is a significant difference between
pre-test and post-test scores. Remember that for this comparison, the sign of the test sta-
tistic (whether it is negative or positive) does not matter for comparing to the critical value.
The sign only matters if the hypothesis is one-tailed (or directional), and then only to make
sure it is the direction we hypothesized. In the case of our example, we conclude that there
was a significant difference in mathematics self-efficacy between pre-test and post-test.

EFFECT SIZE FOR THE PAIRED SAMPLES t-TEST


For the paired samples t-test, we will use omega squared as the effect size estimate, using
the formula below:
t2 1
2  2
t  n 1
This formula should look familiar as it is very similar to the formula for effect size in
the independent samples t-test. The major difference in the paired samples t-test is that,
because there are no groups, the denominator no longer involves summing the sample
sizes of the two groups. For our example, then, omega squared would be:
COMPARING TWO WITHIN-SUBJECTS SCORES • 199

t2 1 2.1672  1 4.696  1 3.696


2  2
 2
   0.381
t  n  1 2.167  6  1 4.696  5 9.696
We could interpret this as indicating that about 38% of the variance in mathematics
self-efficacy was explained by the change from pre-test to post-test. A common mistake
in interpreting this is to try to assign the difference to the intervention (e.g., that 38%
of the variance in mathematics self-efficacy was explained by the workshop). That’s not
true, especially in a pre-test post-test design. While we know that scores changed from
before to after the workshop, we don’t have evidence that demonstrates the workshop
caused that change. Time passed, and a number of other factors might have contributed.
We also cannot rule out practice or order effects. So, though it’s not a particularly satisfy-
ing interpretation, we can only suggest that 38% of the variance in mathematics self-ef-
ficacy was explained by the change from (or difference between) pre-test to post-test.

DETERMINING AND INTERPRETING THE PATTERN OF DIFFERENCE


Interpreting the pattern of difference is perhaps the simplest step in the paired samples
t-test. The within-subjects independent variable has only two levels. The t-test result
showed us that the difference between those two levels was statistically significant. All
that’s left is to determine which set of scores was higher/lower. In our example, we can
clearly see that post-test scores were higher. We can determine that based on the direc-
tion of the difference scores, on the means for each time point, or by examining the
scores. So, in this example, participants had significantly higher mathematics self-effi-
cacy scores after the workshop.

COMPUTING THE TEST IN JAMOVI


To begin, in jamovi, we’ll need to set up two variables: one to capture the pre-test scores,
and another to capture the post-test scores. We have no grouping variable, so these are the
only two variables we will need. We can name the first column Pre and the second column
Post, using the Data tab and Setup menu in jamovi, and then enter our data.

We mentioned earlier that to test for normality we would need to first combine the pre-
and post-test scores into a single variable, because the assumption of normality is about
the entire distribution of dependent variable scores. (Thinking back to the between-sub-
jects designs, we didn’t have to do this because the dependent variable was already in a
single variable in jamovi—in the within-subjects designs, it’s split up into two or more.)
To do this, we’ll simply copy and paste the scores from both pre-test and post-test into
200 • WITHIN-SUBJECTS DESIGNS

a new variable. It doesn’t matter what that variable is named because it’s only temporary
for the normality test.

Then we’ll analyze that new variable for normality in the same way as we have in the past.
In the Analyses tab, we’ll click Exploration and then Descriptives. In the menu that comes
up, we’ll select the new variable we created (which here is named C by default), and move it
to the Variable box using the arrow button. Then under Statistics, we’ll click Skewness and
Kurtosis. We could also uncheck the other options we don’t need at this time.
COMPARING TWO WITHIN-SUBJECTS SCORES • 201

The resulting output will look like this.

This is evaluated just like in the previous chapters. The absolute value of skewness is less
than two times the standard error of skewness (.000 < 2(.637)) and the absolute value
of kurtosis is less than two times the standard error of kurtosis (.654 < 2(1.232)), so the
distribution is normal.
Next, we can produce the paired samples t-test in the Analyses tab, then the “T-Tests”
menu, then “Paired Samples T-Test”. In the resulting menu, the box on the right shows
“Paired Variables.” Here we will “pair” our pre- and post-test scores, by clicking first on
Pre, then the arrow button, then Post, then the arrow button.
Next, we can produce the paired samples t-test by clicking Analyze → Compare Means
→ Paired-Samples T Test. We might want to check the boxes to produce the mean differ-
ence, confidence interval, and descriptives as well. The effect size option here produces
Cohen’s d, which is not ideal for our purposes, so we’ll instead hand calculate omega
squared. By default, under the “Hypothesis” options, it will specify a two-tailed hypothe-
sis. Our recommendation is to leave this setting alone, and if the test is one-tailed, simply
divide p by two. That tends to produce less confusion than using the hypothesis options
in the software.
202 • WITHIN-SUBJECTS DESIGNS

The resulting output produces the paired samples t-test, and descriptives. First, let’s look
at the test itself.

We see that t at 5 degrees of freedom is −2.169, and p = .082. Remember that this is
the two-tailed probability. Because our hypothesis was one-tailed (that scores would
improve at post-test), we can divide that probability by half, so p = .041. It also tells us the
mean difference between pre- ad post-test is −1.333 (negative because post-test is higher,
and the difference is pre-test minus post-test). 95% of the time in another sample of the
same size from the same population, the difference would be between -2.913 and 0.247,
based on the 95% confidence interval. So, we have a significant difference in students’
mathematics self-efficacy from pre-test to post-test. Next, we can look at the descriptives
to see how the scores changed.

WRITING UP THE RESULTS


Finally, we’re ready to write our results up in an APA-style Results section. We’ll follow
the same basic format as we did with the independent samples t-test, with some minor
changes:

1. What test did we use, and why?


2. If there were any issues with the statistical assumptions, report them.
3. What was the result of the test?
4. If the test was significant, what was the effect size? (If the test was not significant,
simply report effect size in #3.)
5. What is the pattern of differences?
6. What is your interpretation of that pattern?

For our example, we might answer these questions as follows:


1. What test did we use, and why?
We used a paired samples t-test to determine if students’ mathematics self-efficacy
tests were higher after the workshop than before it.
2. If there were any issues with the statistical assumptions, report them.
We did not find issues with the statistical assumptions. The issues with the design
limitations are more likely to be addressed in the Discussion section.
COMPARING TWO WITHIN-SUBJECTS SCORES • 203

3. What was the result of the test?


There was a significant difference in mathematics self-efficacy scores in pre-test
versus post-test (t5 = −2.169, p = .041).
4. If the test was significant, what was the effect size? (If the test was not significant,
simply report effect size in #3.)
About 38% of the variance in self-efficacy scores was explained by the change from
pre- to post-test (ω2 = .381).
5. What is the pattern of differences?
Scores were significantly higher at post-test (M = 4.167, SD = 1.169) (M = 2.8333,
SD = 1.169).
6. What is your interpretation of that pattern?
Among this sample, students’ mathematics self-efficacy was higher following the
workshop.
Finally, we could pull all of this together into a short APA style Results section:

Results

We used a paired samples t-test to determine if students’ mathematics


self-efficacy tests were higher after the workshop than before it. There
was a significant difference in mathematics self-efficacy scores in pre-
test versus post-test (t5 = −2.169, p = .041). About 38% of the variance
in self-efficacy scores was explained by the change from pre- to post-
test (ω2 = .381). Scores were significantly higher at post-test (M = 4.167,
SD  = 1.169) (M = 2.833, SD = 1.169). Among this sample, students’
mathematics self-efficacy was higher following the workshop.

In this example, a table of descriptive statistics is unnecessary because we have already


included means and standard deviations in the Results section.
13
Paired samples t-test case studies

Case study 1: guided inquiry in chemistry education 205


Research questions 206
Hypotheses 206
Variables being measured 206
Conducting the analysis 206
Write-up 208
Case study 2: student learning in social statistics 208
Research questions 208
Hypotheses 209
Variables being measured 209
Conducting the analysis 209
Write-up 210
Notes 210

In the previous chapter, we explored the paired samples t-test using a made-up example
and some fabricated data. In this chapter, we will present several examples of published
research that used the paired samples t-test. We should note that, in these examples, the
simulated data provided in the online resources will not produce the exact result of the
published study. However, they will reproduce the essence of the finding—so don’t be
surprised to look up the published study and see somewhat different results.1 For each
sample, we encourage you to:

1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the t-test.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: the online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.

CASE STUDY 1: GUIDED INQUIRY IN CHEMISTRY EDUCATION


Vishnumolakala, V. R., Southam, D. C., Treagust, D. F., Mocerino, M., & Qureshi, S.
(2017). Students’ attitudes, self-efficacy, and experiences in a modified process-oriented

205
206 • WITHIN-SUBJECTS DESIGNS

guided inquiry learning undergraduate chemistry classroom. Chemistry Education


Research and Practice, 18(2), 340–352. https://doi.org/10.1039/C6RP00233A.
The researchers in this study followed first-year undergraduate chemistry students,
measuring their attitudes, self-efficacy, and self-reported experiences both before and
after a process-oriented guided inquiry learning intervention (POGIL). The purpose
of the intervention was to increase students’ attitudes and emotions about chemistry
coursework through the POGIL intervention. The authors identified two dependent var-
iables: emotional satisfaction and intellectual accessibility.

Research questions
The authors asked two research questions in the portion of the article we review in this
case study:
1. Were emotional satisfaction scores significantly higher after the POGIL interven-
tion than before the intervention?
2. Were intellectual accessibility scores significantly higher after the POGIL interven-
tion than before the intervention?

Hypotheses
The authors hypothesized the following related to emotional satisfaction:
H0: There was no significant difference in pre-test emotional satisfaction compared
to post-test. (Mpre = Mpost)
H1: There was no significant difference in pre-test emotional satisfaction compared
to post-test. (Mpre ≠ Mpost)
The authors hypothesized the following related to intellectual accessibility:
H0: There was no significant difference in pre-test intellectual accessibility compared
to post-test. (Mpre = Mpost)
H1: There was no significant difference in pre-test intellectual accessibility compared
to post-test. (Mpre ≠ Mpost)

Variables being measured


The authors measured both intellectual accessibility and emotional satisfaction using the
Attitudes toward the Study of Chemistry Inventory (ASCI). The ASCI is an eight-item scale
of seven-point Likert-type items. The authors reference prior research about reliability and
validity evidence related to the ASCI but do not provide information from the current study.

Conducting the analysis


1. What test did they use, and why?
The authors used two paired samples t-tests to determine if perceptions of intel-
lectual accessibility and emotional satisfaction toward chemistry would improve
PAIRED SAMPLES t-TEST CASE STUDIES • 207

significantly from before a POGIL intervention to after the intervention. Because


we used two paired samples t-tests, we adjusted the Type I error rate using the
Bonferroni inequality to set alpha at .025 in order to control for familywise error.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
Both intellectual accessibility and emotional satisfaction were measured using
averaged Likert-type data, so are interval.
b. Normality of the dependent variable
The authors do not discuss normality in the published paper. This is typical
when the data are normally distributed. Ideally, we would test for normality
prior to running the analysis by using skewness and kurtosis statistics, even if
those will not ultimately be reported in the manuscript. As a reminder, the data
on the course online resources will be normally distributed because of how they
were simulated.
c. Observations are independent
The authors note no potential nesting factors or other factors that might cause
dependence.
d. Random sampling and assignment
Participants were not randomly sampled and appear to be a convenience sam-
ple of university students. They were also not randomly assigned to order of
administration because the design was longitudinal (pre-test versus post-test)
so counterbalancing the order of administration was not possible.
3. What was the result of that test?
There was a significant difference in intellectual accessibility (t212 = −5.248,
p < .001) and in emotional satisfaction (t12 = −3.406, p < .001).2
4. What was the effect size, and how is it interpreted?
For intellectual accessibility:
t2 1 5.2482  1 27.542  1 26.542
2 2
 2
   .111
t  n  1 5.248  213  1 27.542  212 239.542
For emotional satisfaction:
t2 1 3.4062  1 11.601  1 10.601
2      .047
t 2  n  1 3.4062  213  1 11.601  212 223.601
About 11% of the variance in intellectual accessibility (ω2 = .111) and about 5%
of the variance in emotional satisfaction (ω2 = .047) was explained by the change
from before the intervention to after the intervention.
5. What is the pattern of group differences?
Intellectual accessibility was higher after the intervention (M = 4.190, SD = 1.060)
than before the intervention (M = 3.750, SD = .720). Similarly, emotional satisfac-
tion was higher after the intervention (M = 4.410, SD = .980) than before the inter-
vention (M = 4.100, SD = .880).
208 • WITHIN-SUBJECTS DESIGNS

Write-up

Results

We used two paired samples t-tests to determine if perceptions of intel-


lectual accessibility and emotional satisfaction toward chemistry would
improve significantly from before a POGIL intervention to after the
intervention. Because we used two-paired samples t-tests, we adjusted
the Type I error rate using the Bonferroni inequality to set alpha at .025
in order to control for familywise error. There was a significant differ-
ence in intellectual accessibility (t212 = −5.248, p < .001) and in emo-
tional satisfaction (t12 = −3.406, p < .001). About 11% of the variance
in intellectual accessibility (ω2 = .111) and about 5% of the variance in
emotional satisfaction (ω2 = .047) was explained by the change from
before the intervention to after the intervention. Intellectual accessibil-
ity was higher after the intervention (M = 4.190, SD = 1.060) than
before the intervention (M = 3.750, SD = .720). Similarly, emotional
satisfaction was higher after the intervention (M = 4.410, SD = .980)
than before the intervention (M = 4.100, SD = .880).

CASE STUDY 2: STUDENT LEARNING IN SOCIAL STATISTICS


Delucci, M. (2014). Measuring student learning in social statistics: A pre-
test-posttest study of knowledge gain. Teaching Sociology, 42(3), 231–239. https://
doi.org/10.1177/0092055X14527909.
In this article, the author was interested in assessing students’ gains in statistical
knowledge during an undergraduate sociology course. The design is relatively straight-
forward: he administered pre-test and post-test assessments of statistical knowledge to
students. In the full study, he assesses one class section per year for six years. In our case
study, we will focus on the overall comparison across all six sections.

Research questions
This study had one research question: Would students’ statistical knowledge test scores be
higher after an undergraduate sociology research course than they were before the course?
PAIRED SAMPLES t-TEST CASE STUDIES • 209

Hypotheses
The author hypothesized the following related to statistical knowledge:

H0: There was no difference in statistical knowledge from pre-test to post-test.


(Mpre = Mpost)
H1: Statistical knowledge scores were higher at post-test compared to pre-test.
(Mpre < Mpost)

Variables being measured


The author had one primary outcome measure: statistical knowledge. He measured
it using a multiple choice exam and offers no real reliability or validity information
about the test. It would be useful to include such information so that researchers could
assess the measurement of statistical knowledge, content/domain coverage, and score
reliability.

Conducting the analysis


1. What test did they use, and why?
The author used the paired samples t-test to determine if there was a significant
difference in statistical knowledge after an undergraduate sociology research
course as compared to before the course.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The dependent variable is measured as percent correct on a statistics quiz.
Percentage scores are ratio-level data.
b. Normality of the dependent variable
The author does not report normality information in the manuscript. That
is fairly typical in published work when the data were normally distributed.
However, it is good practice to test for normality using skewness and kurtosis
and their standard errors prior to using the t-test, even if it will not be included
in the manuscript. As a reminder, to do that test in a within-subjects design, we
would first need to combine pre- and post-test data into a single column.
c. Observations are independent
The author did not discuss any issues with independence. The scores are
exam scores, so administration is likely independent. It is possible there were
cohort effects, or that participants may have been studying in groups, but the
author was not able to measure or account for those sorts of nesting or pairing
factors.
d. Random sampling and assignment
The sample is a convenience sample of students who took a particular under-
graduate sociology research course. They are also not randomly assigned to
order of administration (there is no counterbalancing) because the design is
longitudinal (pre-test and post-test) so counterbalancing is impossible.
3. What was the result of that test?
There was a significant difference in statistical knowledge scores from pre-test to
post-test (t184 = −16.812, p < .001).3
210 • WITHIN-SUBJECTS DESIGNS

4. What was the effect size, and how is it interpreted?


For this design:

t2 1 16.8122  1 282.643  1 281.643


2      .604
2 2
t  n  1 16.812  185  1 282.643  185  1 466.643

About 60% of the variance in statistical knowledge test scores was explained by
the change from before the sociology course to after the course (ω2 = .604).
5. What is the pattern of group differences?
Participants had significantly higher scores after the course (M = 64.800,
SD = 13.000) than they did before the course (M = 43.900, SD = 11.100).

Write-up

Results

The author used the paired samples t-test to determine if there was a
significant difference in statistical knowledge after an undergraduate
sociology research course as compared to before the course. There was
a significant difference in statistical knowledge scores from pre-test to
post-test (t184 = −16.812, p < .001). About 60% of the variance in statis-
tical knowledge test scores was explained by the change from before the
sociology course to after the course (ω2 = .604). Participants had signifi-
cantly higher scores after the course (M = 64.800, SD = 13.000) than
they did before the course (M = 43.900, SD = 11.100).

For additional case studies, including example data sets, please visit the textbook
website for an eResource package, including specific case studies on race and racism in
education.

Notes
1 We simulate the data for the online resources by simulating data with a certain mean and
standard deviation. That works perfectly for the between-subjects designs, which really mea-
sure only mean differences. But for within-subjects designs like the paired-samples t-test, this
turns out fairly differently. The paired samples t-test uses the mean of differences per case,
rather than the mean difference overall. Because we do not know the mean difference per
case from the published work, we cannot simulate data that perfectly reproduce those results.
PAIRED SAMPLES t-TEST CASE STUDIES • 211

However, the overall mean difference and the direction of the result will be the same as the
published study. In most cases, this results in a smaller effect size for the simulated data in the
online resources than for the actual published study. The published results are not in doubt,
but we cannot perfectly reproduce them in our simulated data.
2 As a reminder, this value will not match the published study exactly. As a second note about
this value, the author reports a positive t-test value, so may have dropped the sign and
reported the absolute value, or set the comparison up as post- vs. pre-test instead of pre-test
vs. post-test. In addition, because this is a directional hypothesis (one-tailed test), we would
divide the probability values in half to get the one-tailed probabilities
3 As a reminder, this value will not match the published study exactly. As a second note about
this value, the author reports a positive t-test value, so may have dropped the sign and
reported the absolute value, or set the comparison up as post- vs. pre-test instead of pre-test
vs. post-test. This would not affect the actual test values.
14
Comparing more than two points
from within the same sample

The within-subjects ANOVA

Introducing the within-subjects ANOVA 214


Research design and the within-subjects ANOVA 214
Assumptions of the within-subjects ANOVA 214
Level of measurement for the dependent variable is interval or ratio 215
Normality of the dependent variable 215
Observations are independent 215
Random sampling and assignment 215
Sphericity 216
Calculating the test statistic F 216
Partitioning variance 217
Between, within, and subject variance 217
Completing the source table 217
Using the F critical value table 220
Interpreting the test statistic 221
Effect size for the within-subjects ANOVA 221
Calculating omega squared 221
Eta squared 222
Determining how within-subjects levels differ from one
another and interpreting the pattern of differences 222
Comparison of available pairwise comparisons 223
Interpreting the pattern of pairwise differences 223
Computing the within-subjects ANOVA in jamovi 223
Writing Up the Results 229

In Chapter 12, we learned how to use the paired samples t-test, and in Chapter 13, we
saw case studies of published work using that analysis. However, the paired samples
t-test was only able to test differences in two within-subjects data points, much like the
independent samples t-test could only compare two groups. When there are more than
two groups to compare, the one-way ANOVA is the appropriate test. But when there are

213
214 • WITHIN-SUBJECTS DESIGNS

more than two within-subjects data points, the within-subjects or repeated measures
ANOVA will be the correct analysis.

INTRODUCING THE WITHIN-SUBJECTS ANOVA


This analysis is much like the paired samples t-test, except that it will allow for more
than two within-subjects measures. For example, imagine we have planned a course
designed to help pre-service teachers to engage in culturally responsive teaching prac-
tices. We might give those teachers a measure of culturally responsive teaching before
the course (a pre-test), and another after the course (a post-test). We could test the dif-
ference between pre- and post-test scores using the paired samples t-test. However, in
many cases we are interested in evaluating whether improvements last after the end of a
course or intervention. In this example, we would want to know whether those teachers
keep using culturally responsive practices after the course is over. So, we might follow
up with the participants six months after the course and administer the same measure.
Will the teachers still show higher use of culturally responsive teaching practices at this
six-month follow-up? The within-subjects ANOVA will allow us to answer this question.
This analysis is also commonly referred to as a repeated measures ANOVA.

RESEARCH DESIGN AND THE WITHIN-SUBJECTS ANOVA


Research design in the within-subjects ANOVA is very much like the paired samples
t-test, except there can be more than two within-subjects data points. In our above
example, the within-subjects points were longitudinal (before, after, and six-month fol-
low-up). Other longitudinal designs would work too. For example, following students
across their first, sophomore, junior, and senior years, or tracking learning after multi-
ple workshops. The designs do not have to be longitudinal, though. To return to some
examples we offered in Chapter 12, we could measure taste ratings of multiple different
kinds of apples, or gather credibility perceptions of multiple speakers. So long as we
gather comparable data from all participants across multiple measurement points, the
within-subjects ANOVA will be appropriate.
As with all designs presented in this text, we also need large, balanced samples. The sam-
ples will be almost automatically balanced because of the within-subjects design. That is,
we should have the same number of observations in each within-subjects level because all
participants should complete all measurements. We also, though, need large samples for this
analysis. In general, the minimum sample size for this design will be 30. Notice that there
is no mention of “per group” here because there will be 30 in each level of the within-sub-
jects variable if we have 30 participants in total. In this way, within-subjects designs offer an
advantage over between-subjects designs because we can work with fewer total participants.

ASSUMPTIONS OF THE WITHIN-SUBJECTS ANOVA


The assumptions of the within-subjects ANOVA are the same as the other analyses we
have learned, with one new assumption we have not yet encountered that is specific to
within-subjects designs with more than two levels on the within-subjects variable. We
will briefly review the assumptions we’ve seen before, and give some more attention to
the new assumption, which is called sphericity.
COMPARING MORE THAN TWO POINTS • 215

Level of measurement for the dependent variable is interval or ratio


As with all of the analyses so far in this text, the dependent variable must be continuous
in nature. That is, the dependent variable must be measured at the interval or ratio level.
This is a design assumption, and we satisfy this assumption by ensuring we measure the
dependent variable in a way that produces continuous data. As with all other designs in
this text, this means the dependent variable cannot be nominal or ordinal—it cannot be
a categorical variable.

Normality of the dependent variable


Relatedly, as we have discovered in prior chapters, the dependent variable must also be
normally distributed. We discovered, in Chapter 12, a slight catch to this assumption in
within-subjects designs. In a within-subjects design, the assumption of normality applies
to all of the dependent variable data, regardless of the level of the within-subjects varia-
ble. In Chapter 12, we combined the dependent variable scores across levels of the with-
in-subjects variable. We will do the same in this design to test for normality; it’s just that
now there will be more data points to combine for that purpose. We will still evaluate the
combined dependent variable data using skewness and kurtosis statistics.

Observations are independent


This assumption has been present for all of the designs we’ve encountered in this text.
The observations must be independent, so we will need to evaluate the data for things
like nested structure (e.g., students are nested in teachers because each teacher will
have multiple students in class). We also learned with the paired samples t-test that this
assumption becomes especially tricky in the case of within-subjects designs because
the observations are necessarily dependent within an individual participant. That with-
in-person (or “within-subjects”) dependence is built into the statistical design, though.
Still, we will need to carefully evaluate the design for things like order or practice effects.

Random sampling and assignment


As we discussed in the paired samples t-test, one way to account or control for order or
practice effects is to counterbalance the order of administration. For example, if we are
gathering flavor ratings from participants about different kinds of apples, we can ran-
domly assign people to different orders of administration (tasting 1, 2, 3; 1, 3, 2; 2, 1, 3;
2, 3, 1; 3, 1, 2; and 3, 2, 1, randomly assigned). However, with longitudinal designs this is
not possible. In our earlier example about culturally responsive teaching, teachers were
assessed before the workshop, after the workshop, and six months later. In that design,
there is no possibility of counterbalancing the order of administration. Everyone will do
the pre-test first, post-test second, and six-month follow-up test third. So, in the case of
longitudinal design, it is not possible to counterbalance. Longitudinal designs offer many
benefits, as is clear in our example, but they do have a limitation that we cannot control
for order or practice effects.
The other part of this assumption is that the sample has been randomly selected
from the population (random sampling). We’ve discussed this in all prior analyses and
described why random sampling is not feasible in research with human participants.
216 • WITHIN-SUBJECTS DESIGNS

Like with the previous analyses, in this design we will need to assess the adequacy of the
sampling strategy to determine how much generalization is reasonable from the data.
How far we can expect these results to translate beyond the sample is dependent on how
robust the sampling strategy was.

Sphericity
This design is the first time we are encountering the assumption of sphericity. It is a
related idea to homogeneity of variance, but that idea works differently in a within-sub-
jects design. Recall that, in between-subjects designs, the assumption of homogeneity of
variance was that the variance of each group is equal. That assumption cannot apply to
within-subjects designs, as there are no groups. In place of that assumption, we find the
assumption of sphericity. The assumption of sphericity is that all pairwise error variances
are equal. In other words, the error variance of each pair of levels on the within-subjects
variable is equal. For example, the error variance of pre-test vs. post-test is equal to pre-
test vs. six-month follow-up is equal to the error variance of post-test vs. six-month fol-
low-up. Because this assumption deals with pairwise error variance, it was not applicable
in the paired samples t-test (with only two levels of the within-subjects variable, there is
only one pair, so no comparison is possible).

Mauchly’s test for sphericity


The assumption of sphericity is evaluated in much the same way as homogeneity of
variance in that we will have a test for the assumption, and the null hypothesis for that
test is the same as the assumption. The test for this assumption is Mauchly’s test for
sphericity. The null hypothesis for Mauchly’s test is that the pairwise error variances are
equal. In other words, the null hypothesis for Mauchly’s test is sphericity. Notice this
is similar to Levene’s test, where the null hypothesis was homogeneity of variance. So
the decision is the same for Mauchly’s test as it was with Levene’s test: When we fail to
reject the null hypothesis on Mauchly’s test, the assumption of sphericity was met. In
other words, with p > .05 for Mauchly’s test, we have met the assumption of sphericity.

Corrections for lack of sphericity


Violations of the assumption of sphericity are somewhat more common than is hetero-
geneity of variance. We are more likely to fail Mauchly’s test than we were to fail Levene’s
test. Fortunately, there are several corrections available for lack of sphericity. The most
commonly used, and the most appropriate correction in the vast majority of cases, is the
Greenhouse-Geisser correction. This correction is available in jamovi and works simi-
larly to the correction for heterogeneity of variance we encountered earlier. When the
correction is selected, it is a separate line in the output and adjusts the degrees of freedom.

CALCULATING THE TEST STATISTIC F


We will take up the example of culturally responsive teaching following a workshop to
illustrate calculating the within-subjects ANOVA. One important note about this design
is that jamovi produces simplified output, which will leave out one source of variance
that we calculate (the Subjects source of variance). That omission means there will be
COMPARING MORE THAN TWO POINTS • 217

differences in how we illustrate the calculations and how they work out in jamovi. We
will highlight those differences in the section on using jamovi for the analysis.

Partitioning variance
The analysis works similarly, in terms of the calculations, to the one-way ANOVA. We will
have a source table with four sources of variance: Between, Subjects, Within, and Total.
Each source of variance will have a sum of squares, degrees of freedom, and mean square.
In our illustration of hand calculations, we will have two F ratios—Between and Subjects.
The jamovi software will produce the Between F ratio (marked by a “RM Factor” label).

Between, within, and subject variance


In the one-way ANOVA, “between” meant between groups. In the within-subjects
ANOVA, “between” means between levels of the within-subjects variable. The Subjects
source of variance is the variation within participants (across levels of the within-sub-
jects variable). Within variation will be the variation within a level of the within-subjects
variable. Total variation will still mean the total variation in the set of scores. These terms
will be calculated using the following formulas:

Source SS df MS F

k − 1
Between
  SSbetween MSbetween
2
 Xk  X
dfbetween MSwithin
nsubjects − 1
  SSsubjects MSsubjects
2
Subjects  X subject  X
df subjects MSwithin
Within SStotal − SSbetween − SSsubjects (dfbetween)(dfsubjects) SSwithin
df within
ntotal − 1
Total
 
2
 XX

Completing the source table


This source table has some calculations that are a bit different than our prior designs. In
the between sum of squares, we will calculate deviations between the mean for a level
of the within-subjects variables and the grand mean. In the subjects sum of squares, we
will calculate deviations between the mean for individual participants versus the grand
mean. For total sum of squares, we will calculate the deviations between each score and
the grand mean. Finally, the within sum of squares is calculated based on the other three
sources’ sums of squares. To illustrate, in our example we have teachers completing a
workshop on culturally responsive teaching, and they take an assessment for culturally
responsive teaching practices before the workshop, afterward, and six months later. Their
scores might be distributed as follows:
218 • WITHIN-SUBJECTS DESIGNS

Participant Pre-Test Post-Test Six Months Subject Means


Post-Test

1 8 17 14
39
= 13.00 =
X
3
2 6 18 16 40
=
X = 13.33
3
3 9 20 15 44
=
X = 14.67
3
4 4 17 14 35
=
X = 11.67
3
5 7 19 18 44
=
X = 14.67
  3
Test means 34 91 77 202
=
X = 6.80 X
= = 18.20 =
X = 15.40 = X = 13.47
5 5 5 15

Notice that we have calculated the mean for each “test” (or level of the within-subjects
variable) across the bottom row, the mean for each subject across the right-most column,
and the grand mean in the bottom right cell. We will use those means to calculate the
sources of variance. We’ll begin by calculating total variation:

Participant Test X
X  X
2
X−X

1 Pre 8 8 − 13.47 =  − 5.47 (−5.47)2 = 29.92


Post 17 17 − 13.47 = 3.53 (3.53)2 = 12.46
Six months 14 14 − 13.47 = 0.53 (0.53)2 = 0.28
2 Pre 6 6 − 13.47 =  − 7.47 (−7.47)2 = 55.80
Post 18 18 − 13.47 = 4.53 (4.53)2 = 20.52
Six months 16 16 − 13.47 = 2.53 (2.53)2 = 6.40
3 Pre 9 9 − 13.47 =  − 4.47 (−4.47)2 = 19.98
Post 20 20 − 13.47 = 6.53 (6.53)2 = 42.64
Six months 15 15 − 13.47 = 1.53 (1.53)2 = 2.34
4 Pre 4 4 − 13.47 =  − 9.47 (−9.47)2 = 89.68
Post 17 17 − 13.47 = 3.53 (3.53)2 = 12.46
Six months 14 14 − 13.47 = 0.53 (0.53)2 = 0.28
5 Pre 7 7 − 13.47 =  − 6.47 (−6.47)2 = 41.86
Post 19 19 − 13.47 = 5.53 (2.53)2 = 30.58
Six months 18 18 − 13.47 = 4.53 (4.53)2 = 20.52

To get the sum of squares total, we add up the squared deviations for all scores, which
gives a sum of 385.72. So the total sum of squares is 385.72.
COMPARING MORE THAN TWO POINTS • 219

Next, we’ll calculate the between sum of squares, which will be the mean of each
observation minus the grand mean. For this calculation, we’ll be using the means for Pre
(6.80), Post (18.20), and Six months (15.40):

Participant Test X
X 
2
Xk − X X
k

1 Pre 8 6.80 − 13.47 =  − 6.67 (−6.67)2 = 44.49


Post 17 18.20 − 13.47 = 4.73 (4.73)2 = 22.37
Six months 14 15.40 − 13.47 = 1.93 (1.93)2 = 3.72
2 Pre 6 6.80 − 13.47 =  − 6.67 (−6.67)2 = 44.49
Post 18 18.20 − 13.47 = 4.73 (4.73)2 = 22.37
Six months 16 15.40 − 13.47 = 1.93 (1.93)2 = 3.72
3 Pre 9 6.80 − 13.47 =  − 6.67 (−6.67)2 = 44.49
Post 20 18.20 − 13.47 = 4.73 (4.73)2 = 22.37
Six months 15 15.40 − 13.47 = 1.93 (1.93)2 = 3.72
4 Pre 4 6.80 − 13.47 =  − 6.67 (−6.67)2 = 44.49
Post 17 18.20 − 13.47 = 4.73 (4.73)2 = 22.37
Six months 14 15.40 − 13.47 = 1.93 (1.93)2 = 3.72
5 Pre 7 6.80 − 13.47 =  − 6.67 (−6.67)2 = 44.49
Post 19 18.20 − 13.47 = 4.73 (4.73)2 = 22.37
Six months 18 15.40 − 13.47 = 1.93 (1.93)2 = 3.72

To get the sum of squares between, we add up all of these squared deviation scores,
which sum to 352.90. So the sum of squares between is 352.90.
Next, we will calculate the subjects sum of squares. For this calculation, we’ll use the
mean of each participant minus the grand mean. We’ll use the means for participant 1
(13.00), participant 2 (13.33), participant 3 (14.67), participant 4 (11.67), and participant
5 (14.67):

Participant Test X
X 
2
X subject − X X
subject

1 Pre 8 13.00 − 13.47 =  − 0.47 (−0.47)2 = 0.22


Post 17 13.00 − 13.47 =  − 0.47 (−0.47)2 = 0.22
Six months 14 13.00 − 13.47 =  − 0.47 (−0.47)2 = 0.22
2 Pre 6 13.33 − 13.47 =  − 0.14 (−0.14)2 = 0.02
Post 18 13.33 − 13.47 =  − 0.14 (−0.14)2 = 0.02
Six months 16 13.33 − 13.47 =  − 0.14 (−0.14)2 = 0.02
3 Pre 9 14.67 − 13.47 = 1.20 (1.20)2 = 1.44
Post 20 14.67 − 13.47 = 1.20 (1.20)2 = 1.44
(Continued)
220 • WITHIN-SUBJECTS DESIGNS

Participant Test X
X 
2
X subject − X X
subject

Six months 15 14.67 − 13.47 = 1.20 (1.20)2 = 1.44


4 Pre 4 11.67 − 13.47 =  − 1.80 (−1.80)2 = 3.24
Post 17 11.67 − 13.47 =  − 1.80 (−1.80)2 = 3.24
Six months 14 11.67 − 13.47 =  − 1.80 (−1.80)2 = 3.24
5 Pre 7 14.67 − 13.47 = 1.20 (1.20)2 = 1.44
Post 19 14.67 − 13.47 = 1.20 (1.20)2 = 1.44
Six months 18 14.67 − 13.47 = 1.20 (1.20)2 = 1.44

To get the subjects sum of squares, we add together all the squared deviation scores,
which sum to 19.08. So the subjects sum of squares is 19.08.
Finally, we can use the formula to determine the within sum of squares:

SStotal  SSbetween  SSsubjects  385.72  352.90  19.08  13.74


Now we’re ready to complete the source table:

Source SS df MS F

Between 352.90 k − 1 SSbetween MSbetween


=3 − 1 = 2 dfbetween MSwithin
352.90 176.45
= = 176= .45 = 102.59
2 1.72
Subjects 19.08 nsubjects − 1 SSsubjects MSsubjects
=5 − 1 = 4 df subjects MSwithin
19.08 4.77
= = 4.77 = = 2.77
4 1.72
Within 13.74 (dfbetween)(dfsubjects) SSwithin
=(2)(4) = 8 df within
13.74
= = 1.72
8
Total 385.72 ntotal − 1
=15 − 1 = 14

Using the F critical value table


Using the F critical value table, we can assess the statistical significance of both tests we’ve
calculated. For the between F ratio, we have 2 numerator and 8 denominator degrees of
freedom. So the F critical value is 4.46. Our calculated F value of 102.59 exceeds the crit-
ical value, so we reject the null hypothesis and conclude there was a significant difference
between the three time points in culturally responsive teaching. Our second test, sub-
jects, has 4 numerator and 8 denominator degrees of freedom. Using the critical value
COMPARING MORE THAN TWO POINTS • 221

table, we find that the critical value is 3.84. Our calculated value of 2.77 is less than the
critical value, so we fail to reject the null hypothesis and conclude there was no signifi-
cant difference between participants in culturally responsive teaching.

Interpreting the test statistic


In practice, the only F ratio that researchers typically report on and interpret is the
between source of variance. Typically, we have no reason to suspect differences between
participants, and those differences would not be meaningful in our research design if
they did exist. If our question is whether the workshop was associated with improve-
ments in culturally responsive teaching practices, and whether those changes would be
sustained, the differences between participants are not of interest for the design.
However, there are cases where participants might be in groups of some kind, and
we might suspect that group membership interacts with the within-subjects variable in
some way. In cases like that, the within-subjects ANOVA will not be a sufficient analysis.
If we have a categorial independent variable (grouping variable) in addition to the with-
in-subjects variable, the design would need to account for both independent variables.
That is what the mixed design ANOVA will do, which we will learn in Chapter 16. For
this design, though, the subjects source of variance will rarely be of interpretive interest.
In fact, as we’ve mentioned earlier in this chapter, that source of variance won’t even be
calculated by jamovi. So our focus will be on the between source of variance.

EFFECT SIZE FOR THE WITHIN-SUBJECTS ANOVA


In prior chapters, we have introduce two effect size estimates: Cohen’s d and omega squared.
We’ve suggested that omega squared should normally be the preferred effect size estimate for
ANOVA and t-test designs. That remains the case. However, because of the way that jamovi
calculates the within-subjects ANOVA, it is not possible to calculate omega squared using
the jamovi output. Fortunately, jamovi will produce another effect size estimate for us in the
case of the within-subjects ANOVA, called partial eta squared (η2p, though it’s often reported
simply as η2). We will show how omega squared would be calculated in this design, and then
illustrate how eta squared is calculated and how that estimate differs from omega squared.

Calculating omega squared


For the within-subjects ANOVA, omega squared is calculated using the same formula
we’ve used for previous ANOVA designs. We’ll make the same substitutions in the for-
mula as we have in the past so that the variance names line up more closely to our source
table, but the formula is fundamentally the same:
SSbetween   dfbetween  MSwithin 
2 
SStotal  MSwithin
From our source table, calculated earlier in this chapter, omega squared would be cal-
culated as:
SSbetween   dfbetween  MSwithin  352.90   2 1.72 
2  
SStotal  MSwithin 385.72  1.72
352.90  3.44 349.46
   .902
387.44 387.44
222 • WITHIN-SUBJECTS DESIGNS

This would be interpreted in the same way as previous omega squared estimates.
About 90% of the variance in culturally responsive teaching practices was explained by
the difference between pre-test, post-test, and the six-month follow-up.

Eta squared
The problem with the formula for omega squared in this case is that jamovi will not pro-
duce the subjects or total sources of variance in the output. The denominator in omega
squared calls for the total sum of squares. However, jamovi does not print that sum of
squares in the output, and we cannot calculate it indirectly because it also does not pro-
duce the subjects sum of squares. As a result, we must use a different effect size estimate.
The estimate we will use in those cases is partial eta squared.
Eta squared has several advantages and disadvantages compared with omega squared.
Perhaps the biggest advantage is that jamovi will calculate partial eta squared for us in
this design. We will illustrate the process for calculating partial eta squared, but it is
not necessary to hand calculate this statistic. Another advantage of using eta squared in
this design is that it is interpreted essentially identically to omega squared. We will still
interpret this statistic as a proportion of variance explained. However, eta squared also
has disadvantages. These are very concisely summarized in Keppel and Wickens (2004).
One major disadvantage of eta squared is that it almost always overestimates the true
effect size. It does so because it does not account for sample size or subject variation in its
formula. As a result, eta squared is also an estimate for the sample, and makes no attempt
to estimate population effect size, unlike omega squared.
All of that being said, the formula for eta squared in this design is:
SSbetween
2 p 
SSbetween  SSwithin
Notice this formula does not adjust the numerator for sample size or error, and the
denominator does not consider subjects or total variation. As a result, this will usually
produce a larger effect size estimate than would omega squared. In fact, the only cases
in which eta squared will not overestimate the effect size is in those cases where it equals
omega squared. It is important, when interpreting this statistic, to keep in mind that is
usually an overestimate. In our example:
SSbetween 352.90 352.90
2 p     0.963
SSbetween  SSwithin 352.90  13.74 366.64
So, this would be interpreted as indicating that about 96% of the variance in culturally
responsive teaching practices was explained by the difference between pre-test, post-
test, and six-month follow-up. This is a larger effect size estimate than we obtained with
omega squared. It has overestimated the proportion of variance explained by about 6%,
and partial eta squared will almost always overestimate in this way. Still, it is a usable
effect size estimate that most researchers will default to in within-subjects designs.

DETERMINING HOW WITHIN-SUBJECTS LEVELS DIFFER FROM ONE


ANOTHER AND INTERPRETING THE PATTERN OF DIFFERENCES
To determine which levels of the within-subjects variable differ from one another, we
will use tests called pairwise comparisons. These function very similarly to post-hoc
COMPARING MORE THAN TWO POINTS • 223

tests in a one-way ANOVA. They will compare each pair of levels on the within-subjects
variable. In our example, that would involve comparing pre-test to post-test, pre-test
to six-month follow-up, and post-test to six-month follow-up. Much like we did with
post-hoc tests, we will examine the pairwise comparisons to determine which pairs are
significantly different. We are not demonstrating the calculations for these comparisons,
and will rely on the jamovi software to produce them.

Comparison of available pairwise comparisons


There are several options available for the pairwise comparisons in jamovi, and other
options exist in other programs. By default, jamovi selects the Tukey correction, which
we discussed in prior chapters. It is one of the more liberal tests available in terms of error
rates, which means it is also one of the most powerful (or most likely to detect a signif-
icant difference). In jamovi, another option is “no correction” which is also sometimes
called Least Significant Differences or LSD. That is the most liberal test (as it has no error
correction at all), which also means it has the highest Type I error rate, and the highest
likelihood of detecting a difference. We discussed this tradeoff in prior chapters—tests
that are more powerful, meaning more likely to detect differences, also have higher Type
I error rates. The other popular options in jamovi are Scheffe and Bonferroni. The Scheffe
correction is the most conservative, with Bonferroni falling somewhere between Tukey
and Scheffe in error rate and power. Much like with post-hoc tests, the choice is up to the
individual researcher as to which test to choose, but there should be a rationale for that
selection. Perhaps more exploratory work or work with smaller samples might call for a
Bonferroni test, while more confirmatory work or work with a larger sample might call
for the Scheffe comparisons. Importantly, researchers make a decision on which test to
use and stick with that test—we cannot try out a few options and see which one we like
the best.

Interpreting the pattern of pairwise differences


We will interpret the pattern of pairwise differences much as we did with post-hoc tests
in the one-way ANOVA. We begin by determining which pairs are significantly different.
Then, using descriptive statistics or plots, determine the direction of those differences.
Finally, we will try to summarize and interpret the pattern of those differences. We will
illustrate this process later in this chapter with the jamovi output for our example.

COMPUTING THE WITHIN-SUBJECTS ANOVA IN JAMOVI


Starting with a blank file in jamovi, we will need to set up three variables: one for pre-
test, one for post-test, and one for the six-month follow-up. We can set up these variables
in the Data tab using the Setup menu, and then enter our data.
224 • WITHIN-SUBJECTS DESIGNS

To assess for normality, we will use the same procedure as we did in the paired samples
t-test. We will copy all of the data across the three levels of the within-subjects variable
into a new variable, so that we can test the entire set of dependent variable data for nor-
mality together.

Then, under Analyses, we’ll click Exploration → Descriptives, and then select the new
variable and move it into the Variables column. We then select skewness and kurtosis
under Statistics, and uncheck the other options we won’t need.
COMPARING MORE THAN TWO POINTS • 225

We can then evaluate the output to determine if the data were normally distributed.

The data can be considered normally distributed if the absolute value of skewness is less
than two times the standard error of skewness, and if the absolute value of kurtosis is less
than two times the standard error of kurtosis. In this case, .595 is less than two times .580
(which would be 1.16), and 1.143 is less than two times 1.121 (which would be 2.242), so
the data were normally distributed.
Next, to produce the within-subjects ANOVA, we will go to Analyses → ANOVA →
Repeated Measures ANOVA.
226 • WITHIN-SUBJECTS DESIGNS

This menu works a bit differently from those we’ve seen before. At the top, “RM Factor
1” is the default name jamovi gives to our independent variable. By clicking on that label,
we can type in a more descriptive name, perhaps in this case something like, “Time.” By
default, it has created two levels, but we can add as many as we need. We can leave them
as “Level 1,” Level 2,” and so on, or we can rename by clicking and typing a new label,
like, “Pre-test,” “Post-test,” any “6-month follow-up.” Notice that to label the third “level,”
simply click on the grey “Level 3.” If we had more than three levels to the independent
variable, we could continue adding them in this way. Next, click on the variables on the
left, and use the arrow key to move them to the appropriate “Repeated Measures Cells”
on the right (matching up the labels to the correct variables).

Moving down the menu, we can then click “Partial η2p” to produce the effect size esti-
mate. Under Assumption Checks, we can click “Sphericity tests.” Note that if we were to
fail the assumption of sphericity, the checkbox to add the Greenhouse-Geisser correc-
tion is here as well. You may also notice a box that can be checked to produce Levene’s
test. This menu can be used for multiple different analyses, including those that have
a between-subjects variable as part of the design. Levene’s test only works when there
is a between-subjects independent variable, so that option is not useful for the current
design.
COMPARING MORE THAN TWO POINTS • 227

Next, under Post Hoc Tests, we can produce the pairwise comparisons. We’ll select our
within-subjects independent variable, here labelled “Time” and click the arrow button
to select it for analysis (moving it to the box on the right). We can then check the box
for which correction we’d like to use below. For now, we’ll leave Tukey selected.

Because this program is more general, meaning it can run many different repeated
measures analysis, it produces some output that we do not need nor have we yet learned
how to interpret. We’ll mention these as we move through the output. We’ll start about
halfway down the output under Assumptions to check the assumption of sphericity.

We see that W = .525, p = .380. Because p > .050, we fail to reject the null. Remember that
with Mauchly’s test, the null hypothesis is that the pairwise error variances are equal. In
228 • WITHIN-SUBJECTS DESIGNS

other words, the null is that sphericity was met. So here, we fail to reject the null, mean-
ing the data met the assumption of sphericity. As a result, there is no need for adding
the Greenhouse-Geisser correction. Next, we’ll look at the Repeated Measures ANOVA
output.

Here, the Between term is labelled “Time” and the Within or Error term is labelled
“Residual.” We see that F at 2 and 8 degrees of freedom is 102.796, and p is < .001. So,
there was a significant difference based on time (pre- versus post- versus 6-month fol-
low-up). We also see that partial eta squared is .963, which would mean that about 96%
of the variance in scores was explained by time. Note that this is an absurdly high effect
size estimate—they would typically be much smaller, but these are made up data to illus-
trate the analysis.
The next piece of output we’ll look at is “Post Hoc Tests.” Here are the results of the
Tukey pairwise comparisons.

We here see a significant difference between pre-test and post-test (p < .001), a signifi-
cant difference between pre-test and the six-month follow-up (p < .001), and a significant
COMPARING MORE THAN TWO POINTS • 229

difference between post-test and the six-month follow-up (p = .023). The jamovi output
also includes t and degrees of freedom for these comparisons. It would be appropriate to
include those in the results section, though it is less typical to see them there. One of the
reasons they may be less commonly reported for the follow-up analysis (and frequently,
only p is reported) is that other popular software packages like IBM SPSS produce only
the probability values for the pairwise comparisons. Looking at the descriptive statistics
(which we can produce as we’ve done in prior chapters in the Analyses → Descriptives
menu), we can see that scores were higher at post-test and at the six-month follow-up
than at pre-test. We also see that scores were higher at post-test than at the six-month
follow-up. So the pattern is that teachers scored higher in culturally responsive teaching
after the workshop and had a decline from post-test to the six-month follow-up. How-
ever, scores were still higher at six months than they were at pre-test.

WRITING UP THE RESULTS


We will follow a similar format to prior chapters for writing up the results, with some
additions and modifications to suit the design:

1. What test did we use, and why?


2. If there were any issues with the statistical assumptions, report them.
3. What was the result of the omnibus test?
4. Report and interpret effect size (if the test was significant, otherwise report effect
size in step 3).
5. If the test was significant, what follow-up analysis is appropriate?
6. What are the results of the follow-up analysis?
7. What is the interpretation of the pattern of results?

For our example:

1. What test did we use, and why?


We used a within-subjects ANOVA to determine if teachers’ culturally responsive
teaching differed across pre-test, post-test, and a six-month follow-up.
2. If there were any issues with the statistical assumptions, report them.
We met all statistical assumptions (normality, sphericity) in this case. There may
be issues with sampling adequacy, which we normally discuss in the Limitations
portion of the Discussion section.
3. What was the result of the omnibus test?
There was a statistically significant difference in culturally responsive teaching
scores between the three measurements (F2, 8 = 102.796, p < .001).
4. Report and interpret effect size (if the test was significant, otherwise report effect
size in step 3).
About 96% of the variance in culturally responsive teaching practices was explained
by the difference between pre-test, post-test, and six-month follow-up (η2 = .963).
[Note: This would be an extremely large portion of variance. In applied research,
effect sizes are usually much smaller, but this is an invented example, so the num-
bers are very inflated.]
5. If the test was significant, what follow-up analysis is appropriate?
We used Bonferroni pairwise comparisons to determine how scores differed across
the three measurements.
230 • WITHIN-SUBJECTS DESIGNS

6. What are the results of the follow-up analysis?


Pre-test scores were significantly different from post-test (p < .001) and six-month
follow-up scores (p = .004). Post-test scores were also significantly different from
the six-month follow-up scores (p = .040).
7. What is the interpretation of the pattern of results?
Teachers use of culturally responsive teaching practices was higher after the work-
shop (M = 18.200, SD = 1.304) compared to pre-test scores (M = 6.800, SD =
1.924). At the six-month follow-up (M = 15.400, SD = 1.673), scores were lower
than immediately following the workshop but were still higher than before the
workshop.
Finally, we can pull all of this together for a Results section paragraph:

Results

We used a within-subjects ANOVA to determine if teachers’ culturally


responsive teaching differed across pre-test, post-test, and a six-month
follow-up. There was a statistically significant difference in culturally
responsive teaching scores between the three measurements (F2, 8
= 102.796, p < .001). About 96% of the variance in culturally respon-
sive teaching practices was explained by the difference between pre-
test, post-test, and six-month follow-up (η2 = .963). We used Bonferroni
pairwise comparisons to determine how scores differed across the three
measurements. Pre-test scores were significantly different from post-
test (p < .001) and six-month follow-up scores (p = .004). Post-test
scores were also significantly different from the six-month follow-up
scores (p = .040). Teachers’ use of culturally responsive teaching prac-
tices was higher after the workshop (M = 18.200, SD = 1.304) compared
to pre-test scores (M = 6.800, SD = 1.924). At the six-month follow-up
(M = 15.400, SD = 1.673), scores were lower than immediately follow-
ing the workshop but were still higher than before the workshop.

In the next chapter, we will explore some examples from published research using the
within-subjects ANOVA.
15
Within-subjects ANOVA case studies

Case study 1: mindfulness and psychological distress 231


Research questions 232
Hypotheses 232
Variables being measured 232
Conducting the analysis 232
Write-up 233
Case study 2: peer mentors in introductory courses 234
Research questions 234
Hypotheses 234
Variables being measured 235
Conducting the analysis 235
Write-up 236
Notes 237

In the previous chapter, we explored the within-subjects ANOVA using a made-up exam-
ple and some fabricated data. In this chapter, we will present several examples of published
research that used the within-subjects ANOVA. For each sample, we encourage you to:

1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the within-subjects ANOVA.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.

CASE STUDY 1: MINDFULNESS AND PSYCHOLOGICAL DISTRESS


Felver, J. C., Morton, M. L., & Clawson, A. J. (2018). Mindfulness-based stress reduction
reduces psychological distress in college students. College Student Journal, 52(3), 291–298.
In this article, the researchers report on their work around mindfulness training
for college students. Specifically, they were interested in testing whether psychological
outcomes would improve after a mindfulness-based stress reduction program. They
followed students across a pre-test, a post-test after eight weeks of the program, and a

231
232 • WITHIN-SUBJECTS DESIGNS

follow-up eight weeks after the post-test. The follow-up test is important to this design
because it allowed the researchers to assess whether any differences found at post-test
were maintained after the program ended.

Research questions
The authors asked several research questions; in this case study, however, we will focus on
one: Was psychological distress, as measured by the global severity index, significantly
different at post-test and the follow-up than it was before the mindfulness program?

Hypotheses
The authors hypothesized the following related to the global severity index:

H0: There was no significant difference in global severity index scores between the
pre-test, post-test, and follow-up. (Mpre = Mpost = Mfollow-up)
H1: There was a significant difference in global severity index scores between the pre-
test, post-test, and follow-up. (Mpre ≠ Mpost ≠ Mfollow-up)

Notice that, although the authors theorized that scores would improve at post-test and
at the follow-up, the formal hypothesis do not specify a direction. The ANOVA design
doesn’t allow any specification of directionality in the omnibus test.

Variables being measured


The authors measured the global severity index as an indicator of psychological dis-
tress. This score came from the Brief Symptom Inventory, which is an 18-item scale that
yields several subscale scores, including the global severity index. The authors suggest
that prior researchers have reported validity evidence. The authors also did not report
score reliability, arguing that it is a poor measure in small samples. However, it would be
best practice to offer additional information on existing validity evidence and to report
reliability coefficients from the present sample.

Conducting the analysis


1. What test did they use, and why?
The authors used the within-subjects ANOVA to determine if global severity index
scores would significantly differ across pre-test, post-test, and follow-up tests.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The dependent variable is measured using averaged Likert-type data, so is
interval.
b. Normality of the dependent variable. The authors do not discuss normality in
their manuscript. That is fairly typical for many journals if the assumption of
normality was met. In practice, we would test for normality prior to using the
analysis, though it will often go unreported if the data were normally distributed.
c. Observations are independent
The authors did not note any issues of dependence in the data or any nesting
factors.
WITHIN-SUBJECTS ANOVA CASE STUDIES • 233

d. Random sampling and assignment


This is a convenience sample—all participants were from a single private univer-
sity. It also involved students who voluntarily signed up for mindfulness training,
which may create further sampling bias. The participants were not randomly as-
signed to order of administration (no counterbalancing) because the design was
longitudinal.
e. Sphericity
The assumption of sphericity was not met (W2 = .251, p < .001), so we applied
the Greenhouse-Geisser correction.
3. What was the result of that test?
There was a significant difference in the global severity index scores across the
three tests (F1.143, 14.863 = 8.295, p =.010).1
4. What was the effect size, and how is it interpreted?
About 39% of the variance in global severity index scores was explained by the
change between pre-test, post-test, and follow-up test (η2 = .390).2
5. What is the appropriate follow-up analysis?
To determine how scores varied across the pre-test, post-test, and follow-up test,
we used Bonferroni pairwise comparisons.
6. What is the result of the follow-up analysis?
There was no significant difference between pre-test and post-test scores
(p = .368), but there was a significant difference between pre-test and follow-up
scores (p = .011) and between post-test and follow-up scores (p < .001).3
7. What is the pattern of group differences?
Global severity index scores were significantly lower at post-test (M = 8.500,
SD = 5.910) than they were at either pre-test (M = 11.930, SD = 7.790) or post-test
(M = 11.930, SD = 7.790). This may suggest that mindfulness is associated with
longer-term reductions in psychological distress.

Write-up

Results

We used the within-subjects ANOVA to determine if global severity index


scores would significantly differ across pre-test, post-test, and follow-up
tests. The assumption of sphericity was not met (W2 = .251, p < .001), so we
applied the Greenhouse-Geisser correction. There was a significant differ-
ence in global severity index scores across the three tests (F1.143, 14.863 =
8.295, p = .010). About 39% of the variance in global severity index scores
was explained by the change between pre-test, post-test, and follow-up test

(Continued)
234 • WITHIN-SUBJECTS DESIGNS

(η2 = .390). There was no significant difference between pre-test and post-test
scores (p = .368), but there was a significant difference between pre-test and
follow-up scores (p = .011) and between post-test and follow-up scores
(p < .001). Global severity index scores were significantly lower at post-test
(M = 8.500, SD = 5.910) than they were at either pre-test (M = 11.930, SD =
7.790) or post-test (M = 11.930, SD = 7.790). This may suggest that mindful-
ness is associated with longer-term reductions in psychological distress.

CASE STUDY 2: PEER MENTORS IN INTRODUCTORY COURSES


Asgari, S., & Carter, F. (2016). Peer mentors can improve academic performance: A qua-
si-experimental study of peer mentorship in introductory courses. Teaching of Psychol-
ogy, 43(2), 131–135. https://doi.org/10.1177/0098628316636288.
The authors of this article were interested in how peer mentoring could improve aca-
demic performance in introductory psychology courses. They measured students who
were in introductory courses across four exams, using the exam scores as dependent var-
iables. In the larger study, they evaluate whether there were differences in performance
between students in peer mentoring programs and students not in such programs. How-
ever, this case study will focus on their preliminary analysis, in which they tested for dif-
ferences in performance across the four exams, which were given at four different points
in the course.

Research questions
In the larger study, the researchers have several research questions. In this case study,
we focus on one: Were there differences in performance across the four exams in these
introductory psychology courses?

Hypotheses
The authors hypothesized the following related to exam scores:
H0: There was no significant difference in exam scores across the four exams.
(M1 = M2 = M3 = M4)
H1: There was a significant difference in exam scores across the four exams.
(M1 ≠ M2 ≠ M3 ≠ M4)
In the full article, they conduct additional analyses of exam scores, but here we focus on
their use of the within-subjects ANOVA.
WITHIN-SUBJECTS ANOVA CASE STUDIES • 235

Variables being measured


The researchers measured exam scores as a way of assessing academic performance. The
exam scores were administered via multiple-choice tests comprised of 25 items, making
25 the highest possible score. The authors did not offer any evidence related to reliability
or validity for the exam scores. It would be ideal to include that information. In this case,
the authors may not have included it because the scores were classroom tests, for which
obtaining psychometric data may have been difficult.

Conducting the analysis


1. What test did they use, and why?
The authors used a repeated measures ANOVA (another name for the within-
subjects ANOVA) to determine if exam scores significantly varied across the four
exams.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The dependent variable was exam score, which is ratio. It has a possible score of
zero, which would indicate missing all test items.
b. Normality of the dependent variable
The authors did not report information on normality. Most exam scores tend
to be negatively skewed and leptokurtic, but the authors either did not test nor-
mality or they found the data were normally distributed, so did not report nor-
mality in the manuscript.
c. Observations are independent
The authors noted no factors that might create dependence. However, there
are potential nesting factors, including things like college classification, college
major, or time of day for each class section, or even the class sections.
d. Random sampling and assignment
Participants were not randomly sampled and appear to be a convenience sam-
ple of psychology students from a single university. There was also no random
assignment to order of administration. Counterbalancing was not possible
because the design was longitudinal.
e. Sphericity
The assumption of sphericity was not met (W5 = .001, p < .001), so we applied
the Greenhouse-Geisser correction. Note that in the published article, the
authors do not note any violation of sphericity. It could be that either: (1) they
did not test for sphericity or did not apply a correction for lack of sphericity; or
(2) their data passed Mauchly’s test, as our data are simulated to approximate
their results but may be different in this regard. It is more likely that the authors
did not assess this assumption, particularly given that they are treating the with-
in-subjects ANOVA as a preliminary analysis.
3. What was the result of that test?
There was a significant difference in scores across the four exams (F1.028, 37.003 =
29.452, p < .001).4
236 • WITHIN-SUBJECTS DESIGNS

4. What was the effect size, and how is it interpreted?


The differences among the four exams accounted for about 45% of the variance in
exam scores (η2 = .450).
5. What is the appropriate follow-up analysis?
To determine how scores differed across the four exams, we used Bonferroni pair-
wise comparisons.
6. What is the result of the follow-up analysis?
Compared to exam one, there was no significant difference in exam two (p = .051), but
there was a significant difference in exam three (p < .001) and exam four (p < .001).
Compared to exam two, there was no significant difference in exam three (p = .750),
but there was a significant difference in exam four (p < .001). Finally, compared to
exam three, there was a significant difference in exam four (p < .001).
7. What is the pattern of group differences?
Students scored higher on the fourth exam compared to any of the other three ex-
ams. Scores were also higher on the third exam compared to the first exam.

Write-up

Results

We used a repeated measures ANOVA (another name for the within-subjects


ANOVA) to determine if exam scores significantly varied across the four
exams. The assumption of sphericity was not met (W5 = .001, p < .001), so
we applied the Greenhouse-Geisser correction. There was a significant dif-
ference in scores across the four exams (F1.028, 37.003 = 29.452, p < .001). The
differences among the four exams accounted for about 45% of the variance
in exam scores (η2 = .450). To determine how scores differed across the four
exams, we used Bonferroni pairwise comparisons. Compared to exam one,
there was no significant difference in exam two (p = .051), but there was a
significant difference in exam three (p < .001) and exam four (p < .001).
Compared to exam two, there was no significant difference in exam three (p
= .750), but there was a significant difference in exam four (p < .001). Finally,
compared to exam three, there was a significant difference in exam four (p <
.001). See Table 15.1 for descriptive statistics. Students scored higher on the

(Continued)
WITHIN-SUBJECTS ANOVA CASE STUDIES • 237

fourth exam compared to any of the other three exams. Scores were also
higher on the third exam compared to the first exam.

Table 15.1
Descriptive Statistics for Exam Scores

Exam M SD

One 21.700 2.222


Two 19.378 2.909
Three 20.838 2.764
Four 25.514 2.244

To add the table, we would start on a new page after the references page, and insert one
table per page.
Finally, we encourage you to compare these example results sections to the published
papers. The statistics will not match exactly due to the way we simulate data for the online
course practice data sets. But how did the authors present the results? Why might they
have presented results differently than our standard layout would have presented them?
In the next chapter, we will introduce the mixed ANOVA. This design will add another
layer to our analysis, allowing us to include one between-subjects variable and one with-
in-subjects variable.
For additional case studies, including example data sets, please visit the textbook
website for an eResource package, including specific case studies on race and racism in
education.

Notes
1 This value will not match the published results exactly. Because of the process used to simu-
late data for the online course page, it will not exactly match for within-subjects designs. The
authors’ published results are not in question here, but our simulated outcomes are slightly
different.
2 Again note that this value will not match published values exactly. That is an artifact of the
way that we have simulated data for the online course resources to allow students to practice
the analysis, not a commentary on the published results.
3 This, again, will not match the published results exactly due to the simulation process for the
practice data in the online resources. The authors report a “marginally significant” differ-
ence from pre-test to post-test. That is a way that some researchers will describe differences
between p = .05 and .10. However, we discourage the use of this criterion as significance is
already a fairly low bar for determine differences and raising it to allow discussion of nonsig-
nificant differences will create unacceptable inflation of Type I error.
4 Remember that our calculated values will differ from the published values as an artifact of the
way we simulated data for the online course practice datasets.
16
Mixed between- and within-subjects
designs using the mixed ANOVA

Introducing the mixed ANOVA 239


Research design and the mixed ANOVA 240
Interactions in the mixed ANOVA 240
Assumptions of the mixed ANOVA 241
Level of measurement for the dependent variable is interval or ratio 241
Normality of the dependent variable 241
Observations are independent 242
Random sampling and assignment 242
Homogeneity of variance 243
Sphericity 243
Calculating the test statistic F 243
Effect size in the mixed ANOVA using ETA squared 244
Computing the mixed ANOVA in jamovi 244
Determining how cells differ from one another and interpreting
the pattern of cell difference 249
Post-hoc tests for significant interactions 249
Writing up the results 251
Notes 253

In this book so far, we have explored between-subjects designs, including the independ-
ent samples t-test, the one-way ANOVA, and the factorial ANOVA. We also discussed
within-subjects designs, including the paired samples t-test and the within-subjects
ANOVA. Now we will put these together in a design that has one between-subjects and
one within-subjects variable.

INTRODUCING THE MIXED ANOVA


The mixed ANOVA is called “mixed” because it involves a mixture of between-subjects
and within-subjects variables (specifically, one of each). That is, participants will be in
groups on a between-subjects independent variable (for example, an experimental con-
dition and a control condition). They will also have repeated measures data (for example,

239
240 • WITHIN-SUBJECTS DESIGNS

a pre-test and a post-test). The mixed ANOVA will allow us to test for an interaction of
the between-subjects and the within-subjects variable. In simple terms, we might want
to know if the change from pre-test to post-test is different between the two groups. The
mixed ANOVA will provide a means to test such a question.

RESEARCH DESIGN AND THE MIXED ANOVA


The mixed ANOVA can be a very powerful and useful analysis in certain research design
situations. For example, it will let us test if an experimental condition results in big-
ger gains from pre- to post-test than a control condition. That scenario is one of the
most common for the mixed ANOVA and is one we will return to throughout this
chapter. However, the mixed ANOVA can be used in any situation where there is one
between-subjects independent variable and one within-subjects independent variable.
Those independent variables could have two or more levels. So, although the most com-
mon mixed design is a 2x2 design (two groups measured two times), other designs are
certainly possible.
Taking an example for this most common design, imagine we are interested in the
efficacy of motivational interviewing (a technique for improving motivation toward
particular tasks) for improving college students’ health outcomes in the first year
of college. The first year of college would be selected because there is evidence that
health markers, especially body fat percentages, might be worse after the first year
of college. We might randomly assign students to receive either motivational inter-
viewing or a more traditional health education curriculum. We could then test their
body fat percentage at pre-test and again at post-test to determine if there are any
changes. In this design, we have one between-subjects variable (motivational inter-
viewing vs. traditional curriculum) and one within-subjects variable (pre-test vs.
post-test).

Interactions in the mixed ANOVA


Much like the factorial ANOVA, in the mixed ANOVA, our primary analytic interest
and focus are on interactions. In the mixed ANOVA, the potential interaction would be
between the between-subjects and within-subjects variable. The interaction, like with
the factorial ANOVA, can be ordinal (no pattern reversal, so if we graph the means, the
lines will not cross) or disordinal (a pattern reversal, so the lines will cross).
However, if the between-subjects variable is randomly assigned (is an experimental
variable) as in our example with motivational interviewing, we generally hope to find
no difference at pre-test, but a significant difference at post-test. The reason it would
be important, in an experimental design, to see no difference at pre-test is that a pre-
test difference would suggest the random assignment failed. In other words, if we truly
randomly assign people to groups, the mean score before the experimental intervention
should be zero. The groups should not be different before the study begins. In non-ex-
perimental designs, this is less of an issue. For example, if we were studying experiences
of bullying in schools, we might expect to find a pre-test difference in experiences of
being bullied between LGBTQ and cisgender and heterosexual students, as LGBTQ stu-
dents report higher levels of bullying throughout the school. We would then hope to see
a reduction in those levels, perhaps for both groups, but likely more strongly for LGBTQ
students, after an anti-bullying education program.
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 241

Ordinal
7

1
Time 1 Time 2

Group 1 Group 2

ASSUMPTIONS OF THE MIXED ANOVA


There are no new assumptions in the mixed ANOVA that we have not already encoun-
tered in previous designs. However, there are some new combinations of assumptions
that apply specifically because we have both a between-subjects and within-subjects var-
iable. However, we will briefly review each assumption here with emphasis on how it
applies to this design.

Level of measurement for the dependent variable is interval or ratio


Like all of the previous designs, this design requires a dependent variable that is meas-
ured at the interval or ratio level. The dependent variable, then, must be continuous in
nature (not categorical). This assumption is no different than in any previous design.

Normality of the dependent variable


The dependent variable must also be normally distributed. Because this design involves
a within-subjects variable, it has the same special requirement as other within-subjects
designs. This assumption deals with the entire distribution of the dependent variable, so
we must first combine the dependent variable across levels of the within-subjects varia-
ble before testing for normality. We did this in the paired samples t-test and within-sub-
jects ANOVA. The process is the same in this design.
242 • WITHIN-SUBJECTS DESIGNS

Disordinal
7

1
Time 1 Time 2
Group 1 Group 2

Observations are independent


We have previously discussed the assumption of independence of observations as it
applies to within-subjects designs and between-subjects designs. In this case, we have a
mixed design, so everything from both kinds of designs applies. That means we are first
concerned with independence of observations among the groups, as we were in other
between-subjects designs. As with previous designs, the most common issue affecting
independence is nesting (students nested within teachers, professors nested within a col-
lege). But we are also concerned with independence among the levels of the within-sub-
jects variables. A common issue here is order effects or practice effects. It might make
a difference in which order we administered the within-subjects tests. In longitudinal
designs (like a pre-test post-test design), it is impossible to control for order or practice
effects. But in other designs, it may be possible to counterbalance the order of adminis-
tration to control for those effects.

Random sampling and assignment


This assumption is no different in the mixed design than in previous designs. The design
assumes random sampling from the population. As we have discussed in prior chapters,
this assumption is essentially never met, and our question will be about how adequate
the sampling was. If generalizability is a goal, then the sampling strategy is important
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 243

because the more biased the sample was, the less generalizable the results will be. The
design also assumes random assignment to groups on the between-subjects variable.
That assumption matters if the goal is causal inference. Experimental design (random
assignment to groups) is not sufficient for causal inference, but it does satisfy one of the
requirements of causal inference. Namely, experimental design helps to isolate the effect
of the independent variable from other potential causal factors. To establish a causal
claim, we would also need to establish a rationale for the causal claim (why would this
independent variable cause this dependent variable?), demonstrate a reliable relation-
ship between the variables (which usually means multiple samples and studies showing
a consistent result), and the causal factor would need to precede the outcome in time
(e.g., the cause has to come before the effect, which usually means longitudinal design).
If the goal is not causal inference, the assumption of random assignment to groups is less
important. As we have discussed in prior chapters, many variables we are interested in
researching cannot practically, legally, or ethically be randomly assigned.

Homogeneity of variance
This design involves a between-subjects variable, so the assumption of homogeneity of
variance applies. It works a bit differently, though, because of the presence of the with-
in-subjects variable. We will produce a Levene’s test for each level of the within-subjects
variable. So, if there are a pre-test and a post-test, we will test for equality of variance
between the groups among pre-test data, and then a second test for equality of variance
between the groups among post-test data. Because of this, jamovi will print multiple Lev-
ene’s test results in this design. For the assumption to be considered met, all of the Levene’s
tests should be passed. Remember that, for Levene’s test, the null hypothesis is that the
variances are equal (homogeneity of variance), so when p > .05, the assumption is met.

Sphericity
Because there is a within-subjects variable, the assumption of sphericity may also apply.
Remember that the assumption did not apply1 in paired samples t-test because that
design involves only two levels on the within-subjects variable. Sphericity is the assump-
tion that the pairwise error variances are equal, so when there are only two levels to the
within-subjects variable, there is only one pair, so the pairwise error variance cannot be
compared. This is an important thing to remember in this design, because jamovi pro-
duces Mauchly’s test only if we ask it to (by checking the box under Assumption Checks).
If there are only two levels of the within-subjects independent variable, the assumption
does not apply, so there is no need for Mauchly’s test. But if there are more than two lev-
els of the within-subjects variable, then we must produce Mauchly’s test, and the result
of Mauchly’s test needs to be nonsignificant (because the null hypothesis for Mauchly’s
test is sphericity or the equality of the pairwise error variances).

CALCULATING THE TEST STATISTIC F


For the mixed ANOVA, it is our opinion that demonstrating the hand calculations might
be overly complex and benefit students very little in conceptually understanding the
design. As a result, we will focus for this final design of the textbook on calculating and
understanding the test in the software only. As with the within-subjects ANOVA, part
244 • WITHIN-SUBJECTS DESIGNS

of the issue is that jamovi does not have all of the sources of variance we would hand
calculate. However, there will be three effects of interest:

The interaction effect. The mixed ANOVA will produce an interaction term,
1.
which, in this case, is the interaction of the between-subjects and within-subjects
independent variables. If there is a significant interaction, the entire focus of our
interpretation will be on the interaction. In other words, if there is a significant
interaction, we would ignore the other two effects in most cases.
The between-subjects effect. This test will also produce a test for between-subjects
2.
effects. That is the main effect of the between-subjects independent variable. In other
words, this tests whether there was a difference between groups, disregarding the
within-subjects variable. If the interaction is not significant, but the between-sub-
jects effect is significant, then we would interpret any group differences. If there
are only two groups, then the interpretation will be based on the means of each
group. When there are only two groups, this effect becomes conceptually the same
as an independent samples t-test—no follow-up analysis is needed. But if there is no
interaction, a significant between-subjects difference, and more than two groups,
then it is appropriate to use a post-hoc test. In that situation, the between-subjects
effect is conceptually the same as a one-way ANOVA, and any post-hoc tests will be
interpreted just like they would be in a one-way ANOVA design.
The within-subjects effect. The mixed ANOVA also produces a test of within-sub-
3.
jects differences, disregarding the between-subjects variable. If the interaction is not
significant, we would interpret this effect, in addition to the between-subjects effect.
If there is a significant within-subjects difference, and there are only two groups,
this is interpreted like a paired samples t-test. No follow-up analysis would be nec-
essary—simply interpret the mean difference. However, if it is significant and there
are more than two groups, then the test is conceptually the same as a within-sub-
jects ANOVA. So, the appropriate follow-up would be to use pairwise comparisons.

In the remainder of this chapter, we focus on how to interpret a significant interaction.


However, note that above we have briefly explained how to conduct follow-up analysis
if the interaction is not significant. First, we would interpret the within-subjects and
between-subjects effects. Then, for any significant effects, we would engage the proper
follow-up analysis. Note that the jamovi menus for the mixed ANOVA do include options
for post-hoc tests. Those are only needed if the interaction is not significant.

EFFECT SIZE IN THE MIXED ANOVA USING ETA SQUARED


As with the within-subjects ANOVA, jamovi does not produce enough information for us
to calculate omega squared, so we will again default to partial eta squared. It is still inter-
preted as a percent of variance explained by the effect, and jamovi will produce partial eta
squared for all three of the effects we described above. As we noted in Chapter 14, partial eta
squared tends to overestimate effect size, so it should be interpreted somewhat cautiously.

COMPUTING THE MIXED ANOVA IN JAMOVI


Returning to our example earlier about anti-bullying programs in a school, we will illus-
trate how to conduct the mixed ANOVA in jamovi. First, imagine we have the following
set of scores for our example:
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 245

Group Pre-Intervention Post-Intervention


Experiences of Bullying Experiences of Bullying

LGBTQ 6 5
LGBTQ 5 4
LGBTQ 7 6
LGBTQ 5 3
LGBTQ 6 4
Cisgender/Heterosexual 3 3
Cisgender/Heterosexual 4 3
Cisgender/Heterosexual 3 4
Cisgender/Heterosexual 2 3
Cisgender/Heterosexual 2 1

In reality, for this design, like the others covered in this text, the ideal sample size is at
least 30 per group. This design is a 2x2 mixed ANOVA because there are two within-sub-
jects levels (pre- and post-) and two groups (LGBTQ and cisgender/heterosexual), and
we would want at least 60 participants.
To begin, we’ll set up the data file. In the Data tab, using the Setup button, we can
specify variable names. We’ll need three variables—one for pre-test data, one for post-
test data, and one for group membership. We can enter our data, and then using the
Setup button for our grouping variable, label the two groups.

Before running the primary analysis, we would need to evaluate the assumption of nor-
mality, using the same process we described in Chapter 14. Next, to run the analysis, we
will go to the Analyses tab, then ANOVA, then Repeated Measures ANOVA. Notice this
is the same menu we used to produce the within-subjects ANOVA in Chapter 14. The
resulting menu is the same as it was in Chapter 14. We can name the “Repeated Meas-
ures Fact,” which by default is labelled RM Factor 1. Here we have changed that to Time.
Next, we can name the levels of the within-subjects variable, which here are Pre-test and
246 • WITHIN-SUBJECTS DESIGNS

Post-test. Then we can specify in the “Repeated Measures Cells” which variables corre-
spond to those levels (here, Pre- and Post-). Finally, in the only step that differs from what
we did in Chapter 14, we move Group to the “Between Subject Factors” box.

Next, under Effect Size, we will select partial eta squared. Under “Assumption Checks”
we will select the “Equality of variances test (Levene’s)”. In this case, we do not need to
select the sphericity test (which would produce Mauchly’s test) because there are only
two levels of the within-subjects variable. But the option to produce it is here, as are the
corrections if we failed that assumption.

At this point, we can pause to look at the output before deciding on follow-up procedures.
If the interaction was no significant, we would use the follow procedures described in
Chapter 14 if the main effect of the within-subjects variable was significant, or those
from Chapter 6 or 8 if the main effect of the between-subjects variable was significant
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 247

(depending on how many groups that variable had). First, though, we should evaluate
the assumption of homogeneity of variance.

Here, we see that the assumption is met for both pre-test and post-test data because p
> .050 in both cases. As we mentioned in previous chapters, the odd notation for the
F ratio associated with Levene’s test for the pre-test data is scientific notation, which is
equivalent to 1.285(10-29) or .00000000000000000000000000001285. As we previously
noted, we would likely report this as F1, 8 < .001, p > .999. However, in most cases, because
the assumption is met, we would not need to report this value in the manuscript. For
post-test, p = .713 which is also > .050, so the assumption was met. So, we next turn to
the ANOVA results.

We first interpret the interaction, here shown as Time * Group. There was a significant
interaction (F1, 8 = 7.538, p = .025). About 49% of the variance in experiences of bullying
was explained by the combination of time (pre-test vs. post-test) and group (LGBTQ vs.
cisgender/heterosexual; η2 = .485). Because the interaction was significant, we would not
248 • WITHIN-SUBJECTS DESIGNS

interpret either of the main effects, except for in exceptional circumstances where the
research questions require us to do so. Just like with the factorial ANOVA, in the mixed
ANOVA if there is a significant interaction, we focus all of our interpretation on the
interaction. The presence of an interaction means that neither independent variable can
be adequately understood in isolation.
However, to make sure it is clear how to find and interpret the two main effects, we
will briefly describe them here. Again, this would not really be done in this case because
the interaction was significant. However, the main effect of the within-subjects varia-
ble is, in this case, on the line marked Time. So, there was also a significant difference
between pre- and post-test scores (F1, 8 = 7.538, p = .025, η2 = .485).2 The main effect
for the between-subjects variable is in the next table down, marked Between Subjects
Effects, on the row labelled Group. From that, we can determine that there was a sig-
nificant difference between LGBTQ and cisgender/heterosexual students’ scores (F1, 8 =
16.277, p = .004, η2 = .485). Again, because the interaction was significant, we would
not interpret the main effects in this example but wanted to briefly demonstrate where
to find them. The follow-up procedures for a significant main effect when there was no
significant interaction are discussed earlier in this chapter.
Going back to the analyses options (which you can return to simply by clicking any-
where in the output for the analysis you want to change), there is an additional option
we will use to produce a plot of the group means, which helps us determine the type of
interaction. To do so, under Estimated Marginal Means, we will drag Time and Group
into the space for Term 1. By default, the box for Marginal means plots will already
be checked. Also, by default, under Plot, the option for Error bars is set to Confidence
Interval. That may work well, or we might want to change it to None for a somewhat
cleaner-looking plot.

The plot it produces will look like the figure below.


MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 249

Estimated Marginal Means

Time
* Group

5
Dependent

Group
LGBTQ
4 Cisgender/Heterosexual

Pre-test Post-test
Time

From this plot we can determine that the interaction was ordinal, as discussed in Chap-
ter 10, because the lines do not cross. Next, we will need a follow-up analysis to deter-
mine how the cells differ from one another.

DETERMINING HOW CELLS DIFFER FROM ONE ANOTHER


AND INTERPRETING THE PATTERN OF CELL DIFFERENCE
To determine how the cells differ, we will need a follow-up analysis. That analysis will
help us test differences in the data for significance, rather than simply visually inspect-
ing the profile plot to determine what appears to be going on in the data. The ideal fol-
low-up analysis would be the simple effects analysis, which we also used in the factorial
ANOVA. One issue is that jamovi does not produce that analysis in the mixed ANOVA
design. It can be produced in other software packages, such as SPSS (Strunk & Mwavita,
2020). It can also be approximated by using jamovi’s post-hoc tests (which compare all
cells to all other cells) selectively.

Post-hoc tests for significant interactions


In jamovi, the best available option for following up on the significant interaction is to do
post-hoc tests. These are available under the Post Hoc Tests heading. Select the interac-
tion (here Time * Group) and move it to the box on the right to select it for analysis. By
default, jamovi uses the Tukey correction, but has others (including LSD/No correction,
Scheffe, and Bonferroni) available. We have discussed their relative error rates in previ-
ous chapters. For now, we will leave Tukey selected (Figure 16.10).
250 • WITHIN-SUBJECTS DESIGNS

The resulting output looks like the table below.

This output includes a test of each cell versus every other cell. This differs from the simple
effects analysis, which would have produced a comparison of LGBTQ vs. cisgender/het-
erosexual participants at pre-test and again at post-test (see Strunk & Mwavita, 2020 for
further discussion of that analysis and how it is produced in other software packages).
But, if we want to do the same thing, all the information is here. Pre-test LGBTQ versus
pre-test cisgender/heterosexual is the first line of the output, and we see a significant
difference (p = .003). Post-test LGBTQ versus post-test cisgender/heterosexual is the last
line, and that comparison was not significant (p = .104). From those two test statistics
combined with the plot, we can conclude that LGBTQ students experienced significantly
more bullying than cisgender/heterosexual students at pre-test; after the intervention,
however, there was no significant difference between the two groups. That would tend to
suggest the intervention was effective.
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 251

Using jamovi’s output lets us make other comparisons as well. For example, we could
also test whether bullying experiences changed from pre- to post-test for LGBTQ students,
which is found on the second line of output. We see a significant difference (p = .019).
We could also ask if there were changes in bullying experiences from pre- to post-test for
cisgender/heterosexual students. That is found on the fifth line of output (second to last)
and we see no significant difference (p > .999). This has produced a total of six compari-
sons, and so far we have interpreted five of them. The only one we have not interpreted is
the comparison of pre-test scores for LGBTQ students to post-test scores for cisgender/
heterosexual students. That comparison does not make a lot of sense to interpret in this
case. In fact, we would encourage only interpreting the comparisons necessary to answer
the research question. As we have discussed in previous chapters, conducting and inter-
preting many different comparisons has the potential to increase Type I error rates.

WRITING UP THE RESULTS


In writing up the results, we suggest the following basic format:

1. What test did you use, and why?


2. Note any issues with the statistical assumptions, including any corrections that
were needed.
3. What is the result of the omnibus test?
4. If significant, report and interpret the effect size. (If nonsignificant, report effect
size alongside F and p in #3.)
5. What is the appropriate follow-up analysis? (Note: If the interaction is not signifi-
cant, the follow-up is to examine the two main effects.)
6. What is the result of the follow-up analysis?
7. How do you interpret the pattern of results?

For our above example, we’ll provide a sample Results section using the comparison by
groups.
1. What test did you use, and why?
We used a mixed ANOVA to determine if student reports of bullying signifi-
cantly differed across the interaction of LGBTQ students versus cisgender/het-
erosexual students, and from before (pre-test) to after (post-test) an anti-bullying
intervention.
2. Note any issues with the statistical assumptions, including any corrections that
were needed.
No issues with the statistical assumptions found. Homogeneity of variance was
met, and sphericity did not apply. Note that, in this example, to test normality, we
would need to combine the scores from pre- and post-test data, just like we did
with the within-subjects ANOVA in Chapter 14.
3. What is the result of the omnibus test?
There was a significant difference based on the interaction (F1, 8 = 7.538, p = .025).
4. If significant, report and interpret the effect size. (If nonsignificant, report effect
size alongside F and p in #3.)
About 49% of the variance in bullying was explained by the combination of student
252 • WITHIN-SUBJECTS DESIGNS

groups (LGBTQ versus cisgender/heterosexual) and change from pre- to post-test


(η2 = .485).
5. What is the appropriate follow-up analysis? (Note: If the interaction is not signifi-
cant, the follow-up is to examine the two main effects.)
To determine how LGBTQ and cisgender/heterosexual students differed at pre-test
and at post-test, we used Tukey post-hoc tests.
6. What is the result of the follow-up analysis?
There was a significant difference between the two groups at pre-test (p = .003), but
there was no significant difference between the two groups at post-test (p = .104).
Experiences of bullying significantly decreased from pre- to post-test for LGBTQ
students (p = .019), but there was no significant change for cisgender/heterosexual
students (p > .999).
7. How do you interpret the pattern of results?
Prior to the anti-bullying intervention, LGBTQ students reported significantly
higher rates of bullying than cisgender/heterosexual peers after the intervention,
however, there was no significant difference in reports of bullying.

Results

We used a mixed ANOVA to determine if student reports of bullying


significantly differed across the interaction of LGBTQ students versus
cisgender/heterosexual students, and from before (pre-test) to after
(post-test) an anti-bullying intervention. There was a significant differ-
ence based on the interaction (F1, 8 = 7.538, p = .025), and the interaction
was ordinal. About 49% of the variance in bullying was explained by
the combination of student groups (LGBTQ versus cisgender/hetero-
sexual) and change from pre- to post-test (η2 = .485).
To determine how LGBTQ and cisgender/heterosexual students dif-
fered at pre-test and at post-test, we used Tukey post-hoc tests. There
was a significant difference between the two groups at pre-test (p =
.003), but there was no significant difference between the two groups at
post-test (p = .104). Experiences of bullying significantly decreased
from pre- to post-test for LGBTQ students (p = .019), but there was no

(Continued)
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 253

significant change for cisgender/heterosexual students (p > .999). See


Table 16.1 for descriptive statistics. Prior to the anti-bullying interven-
tion, LGBTQ students reported significantly higher rates of bullying
than cisgender/heterosexual peers after the intervention, however, there
was no significant difference in reports of bullying.

Table 16.1
Descriptive Statistics by Group

Pre-Test Post-Test

Group M SD M SD

LGBTQ 5.800   .837 4.400 1.140


Cisgender/Heterosexual 2.800   .837 2.800 1.095
Total 4.300 1.767 3.600 1.350

As a reminder, it would be acceptable to include descriptive statistics within the text,


but here we have opted to create a table instead. We described the method for getting
descriptive statistics in previous chapters. But here, we would go to the Analyses tab,
then Exploration, then Descriptives. We then would select Pre and Post and move them
to the Variables box, and move Group to the Split by box, then select Mean and
Standard Deviation under Statistics.
In the next chapter, we will explore two examples from the published research litera-
ture that used the mixed ANOVA to further illustrate its applications.

Notes
1 Technically, the assumption applied but was automatically met. We often describe the
assumption of sphericity as only applying when there are more than two levels of the within-­
subjects independent variable, but actually it is simply always met when there are only two
levels. Because sphericity is the assumption that the pairwise error variances are equal, and
with only two levels on the within-­subjects independent variable there is only one possible
‘pair’, the assumption is automatically met (there is no other pair to which to compare). So, it’s
not technically true that the assumption does not apply, but it is an easy way to think about
254 • WITHIN-SUBJECTS DESIGNS

why we don’t need Mauchly’s test if there are only two levels to the within-­subjects indepen-
dent variable.
2 In this case, the test statistics for the within-­subjects variable and interaction are identical –
this is unusual and is caused by the way that we ‘made up’ data for use in the example. We
note it here in case it might initially seem like an error or appear curious—it’s just an artifact
of how the data were created for this example.
17
Mixed ANOVA case studies

Case study 1: implicit prejudice about transgender individuals 255


Research questions 256
Hypotheses 256
Variables being measured 256
Conducting the analysis 257
Write-up 258
Case study 2: suicide prevention evaluation 259
Research questions 260
Hypotheses 260
Variables being measured 260
Conducting the analysis 260
Write-up 261
Note 263

In the previous chapter, we explored the mixed ANOVA using a made-up example and
some fabricated data. In this chapter, we will present several examples of published
research that used the mixed ANOVA. For each sample, we encourage you to:

1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the mixed ANOVA.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.

CASE STUDY 1: IMPLICIT PREJUDICE ABOUT


TRANSGENDER INDIVIDUALS
Kanamori, Y., Harrell-Williams, L. M., Xu, Y. J., & Ovrebo, E. (2019). Transgender affect
misattribution procedure (transgender AMP): Development and initial evaluation of
performance of a measure of implicit prejudice. Psychology of Sexual Orientation and
Gender Diversity, Online first publication. https://doi.org/10.1037/sgd/0000343.

255
256 • WITHIN-SUBJECTS DESIGNS

In this article, the researchers report on a method of measuring implicit bias against
transgender people. To do so, they use a technique called the affect misattribution pro-
cedure. Participants received scores for neutral primes (where they reacted to neutral
words like “relevant,” “green,” and “cable”), and also reacted to a set of transgender
primes (where they reacted to words like “transgender” and “transman”). The procedure
is relatively involved, but participants were presented with one of the set of prime words,
then ambiguous stimuli (in this case, Chinese language symbols, which no participants
were familiar with), after which they rated the “pleasantness” of the ambiguous stimuli.
Their question was whether there would be a difference in ratings based on the priming
words (neutral versus transgender), with all participants completing both sets of ratings,
and on whether the participant had regular contact with transgender people (yes or no)
as a between-subjects variable.

Research questions
The authors present several questions in the full paper, but here we focus on one research
question: Was there a significant difference in ratings based on the interaction of prime
type (transgender versus neutral primes) and whether the participant had regular con-
tact with transgender people.

Hypotheses
The authors hypothesized the following related to ratings:

H0: There was no significant difference in ratings based on the interaction of prime


type (transgender versus neutral primes) and whether the participant had regular
contact with transgender people. (MTransgenderPrimeXNoContact = MTransgenderPrimeXPriorContact
= MNeutralPrimeXNoContact = MNeutralPrimeXPriorContact)
H1: There was a significant difference in ratings based on the interaction of prime
type (transgender versus neutral primes) and whether the participant had regular
contact with transgender people. (MTransgenderPrimeXNoContact ≠ MTransgenderPrimeXPriorContact
≠ MNeutralPrimeXNoContact ≠ MNeutralPrimeXPriorContact)

Variables being measured


The between-subjects independent variable was whether participants had prior contact
with transgender people, which was measured by self-report. The ratings, which serve as
the dependent variable, we measured using the transgender AMP procedure, which the
authors describe in great detail in this article. They provide reliability and validity evi-
dence from multiple sources, which is particularly important given that they are among
the first to use this procedure. In the end, the ratings serve as a proxy for implicit bias,
where higher scores indicate higher levels of implicit bias. The scoring and procedure are
rather involved, and we encourage you to read the original article to understand more
about their use of this novel measurement method.
MIXED ANOVA CASE STUDIES • 257

Conducting the analysis


1. What test did they use, and why?
The authors used a mixed ANOVA to determine if ratings would vary based on the
interaction of prime type (neutral versus transgender primes) and whether or not
participants had prior contact with transgender people. Note that all participants
completed ratings following both sets of prime, making prime type a within-sub-
jects variable, while prior contact was a between-subjects variable.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The authors used averaged ratings, where the possible range was 1-5, as the
dependent variable. The level of measurement was interval.
b. Normality of the dependent variable
The authors do not report on normality, which is fairly common if the distribu-
tion was normal. However, as a preliminary step before the analysis, we would
normally check for skewness and kurtosis as compared with their standard
errors. But it is fairly common for those to not be reported in the final manu-
script if the data were normally distributed.
c. Observations are independent
The authors note no issues with dependence, and there are no obvious nesting
factors to consider.
d. Random sampling and assignment
The authors do not offer information on the sampling strategy, but it appears it
was probably a convenience sample. Participants are not randomly assigned to
groups—whether they had prior contact with transgender people could not be
randomly assigned. Participants were randomly assigned to order of adminis-
tration, so the order of administration was randomly counterbalanced.
e. Homogeneity of variance
This assumption was met for the transgender prime ratings (F1, 78 = .771,
p = .383), and was also met for the neutral prime ratings (F1, 78 = .244, p = .622).
f. Sphericity
The assumption of sphericity is not applicable to this design because there are
only two levels to the within-subjects variable. Because sphericity deals with
pairwise error variances, it cannot be assessed when there are only two levels of
the within-subjects variable.
3. What was the result of that test?
There was no significant difference in ratings based on the interaction (F1, 78 = 1.107,
p = .296).1
4. What was the effect size, and how is it interpreted?
About 1% of the variance in ratings was explained by the interaction (η2 = .014).
However, we will not interpret this effect size in the manuscript because there is no
significant difference, so interpreting the size of that difference does not make sense.
5. What is the appropriate follow-up analysis?
Because there was no significant interaction, we then examined the main effects of
prime type and prior contact.
258 • WITHIN-SUBJECTS DESIGNS

6. What is the result of the follow-up analysis?


There was a significant difference between ratings after the transgender prime ver-
sus ratings after the neutral prime (F1, 78 = 4.138, p = .045, η2 = .050). There was
also a significant difference in ratings between those with prior contact with trans-
gender people and those without prior contact (F1, 78 = 11.070, p = .001, η2 = .124).
(There is no need for any additional follow-up like post-hoc tests or pairwise com-
parisons because this is a 2x2 design.)
7. What is the pattern of group differences?
Participants provided higher ratings, indicating more implicit bias, following the
transgender word primes than the neutral word primes. Also, those without prior
contact with transgender people provided higher ratings, indicating more implicit
bias, than those with prior contact.

Write-up

Results

We used a mixed ANOVA to determine if ratings would vary based on the


interaction of prime type (neutral versus transgender primes) and whether
or not participants had prior contact with transgender people. There was
no significant difference in ratings based on the interaction (F1, 78 = 1.107,
p = .296, η2 = .014). Because there was no significant interaction, we then
examined the main effects of prime type and prior contact. There was a
significant difference between ratings after the transgender prime versus
ratings after the neutral prime (F1, 78 = 4.138, p = .045, η2 = .050). There
was also a significant difference in ratings between those with prior con-
tact with transgender people and those without prior contact (F1, 78 =
11.070, p = .001, η2 = .124). See Table 17.1 for descriptive statistics by
cell. Participants provided higher ratings, indicating more implicit bias,
following the transgender word primes than the neutral word primes.
Also, those without prior contact with transgender people provided higher
ratings, indicating more implicit bias, than those with prior contact.

(Continued)
MIXED ANOVA CASE STUDIES • 259

Table 17.1
Descriptive Statistics for Ratings

Prime Type Prior Contact M SD N


with Transgender
People

Transgender Previous contact 2.300 .490 35


words
No contact 2.590 .430 45
Total 2.463 .477 80
Neutral words Prior contact 2.230 .400 35
No contact 2.370 .390 45
Total 2.309 .398 80

The table would go on a new page after the references page, with one table per page. In
this case, a figure is probably unnecessary because there is no interaction to visualize.

CASE STUDY 2: SUICIDE PREVENTION EVALUATION


Shannonhouse, L., Lin, Y. D., Shaw, K., Wanna, R., & Porter, M. (2017). Suicide interven-
tion training for college staff: Program evaluation and intervention skill measurement.
Journal of American College Health, 65(7), 450–456. https://doi.org/10.1080/07448481.2
017.1341893.
In this article, the authors report on an experimental project to improve college and
university staff attitudes and competencies about suicide and suicide prevention. The
authors randomly assigned participants to receive suicide intervention training or to
be on a waitlist for training (with waitlist participants serving as the control group).
All participants completed a pre-test set of surveys and a post-test set of surveys. The
authors tested several outcomes, but in this case study, we focus on one: attitudes about
suicide.
260 • WITHIN-SUBJECTS DESIGNS

Research questions
Related to attitudes about suicide, the research question was: Was there a difference in
attitudes about suicide based on the interaction of pre- versus post-test and placement in
the experimental versus control groups?

Hypotheses
The authors hypothesized the following related to attitudes about suicide:

H0: There was no difference in attitudes about suicide based on the interaction of pre-
versus post-test and placement in the experimental versus control groups.
(MPreXControl = MPreXIntervention = MPostXControl = MPostXIntervention)
H1: There was a difference in attitudes about suicide based on the interaction of pre-
versus post-test and placement in the experimental versus control groups.
(MPreXControl ≠ MPreXIntervention ≠ MPostXControl ≠ MPostXIntervention)

Variables being measured


The between-subjects independent variable was experimental group versus control
group, which was randomly assigned. The dependent variable was attitudes about sui-
cide. It was measured using items adapted from the Washington Youth Suicide Prevent
Program. The authors report a low coefficient alpha for internal consistency reliability
at pre-test (.51) but a good reliability coefficient at post-test (.84). They do not report
validity evidence and focus their discussion on reliability, including test-retest reliability.

Conducting the analysis


1. What test did they use, and why?
The authors used a mixed ANOVA to determine if there was a significant differ-
ence in attitudes toward suicide based on the interaction of pre-test versus post-
test and whether participants were in the experimental or control group.
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The dependent variable is based on averaged Likert-type data, so is interval.
b. Normality of the dependent variable
The authors do not discuss normality in the article. This is fairly typical when the
data are normally distributed. Normally, we would test normality using skew-
ness and kurtosis statistics as compared to their standard errors, but it may not
be included in the final manuscript if the data were normal. The online course
dataset will produce a normal distribution due to how the data were simulated.
c. Observations are independent
The authors note no factors that might create nested structure or otherwise cre-
ate dependence in the observations.
d. Random sampling and assignment
MIXED ANOVA CASE STUDIES • 261

The sample is not a random sample and appears to be a convenience sample.


Participants were randomly assigned to the experimental or control group.
However, because the design is longitudinal (pre-test post-test) random assign-
ment of order of administration, or counterbalancing, was not possible.
e. Homogeneity of variance
The assumption of homogeneity of variance was met for pre-test data (F1, 70 = .631,
p = .430) and for post-test data (F1, 70 = 1.427, p = .236).
f. Sphericity
The assumption of sphericity did not apply because there were only two levels of
the within-subjects variable (pre-test and post-test).
3. What was the result of that test?
There was a significant difference in attitudes about suicide based on the interac-
tion (F1, 70 = 21.514, p < .001).
4. What was the effect size, and how is it interpreted?
The interaction explained about 24% of the variance in attitudes about suicide
(η2 = .235).
5. What is the appropriate follow-up analysis?
To follow up on the significant ordinal interaction, we used Tukey post-hoc tests.
6. What is the result of the follow-up analysis?
There was no significant difference between the experimental and control group at
pre-test (p = .522). At post-test, however, the experimental and control groups were
significantly different in attitudes about suicide (p < .001). There was no significant
difference between pre-test and post-test scores for the control group (p = .999),
but there was a significant difference between pre-test and post-test among the
experimental group (p < .001).
7. What is the pattern of group differences?
The experimental and control groups had similar scores before the intervention,
but after the intervention the experimental group had significantly higher scores
on attitudes about suicide. (Note: Here we want to re-emphasize this design as be-
ing particularly well-suited for experimental research. In this case, we can see that
the randomly assigned groups truly did not differ at pre-test, but at post-test there
is a large difference between the groups. This offers some strong evidence for the
efficacy of this intervention on attitudes about suicide.)

Write-up

Results

We used a mixed ANOVA to determine if there was a significant differ-


ence in attitudes toward suicide based on the interaction of pre-test versus
post-test and whether participants were in the experimental or control

(Continued)
262 • WITHIN-SUBJECTS DESIGNS

group. There was a significant difference in attitudes about suicide based


on the interaction (F1, 70 = 21.514, p < .001). The interaction explained
about 24% of the variance in attitudes about suicide (η2 = .235). To follow
up on the significant ordinal interaction, we used Tukey post-hoc tests.
There was no significant difference between the experimental and control
group at pre-test (p = .522). At post-test, however, the experimental and
control groups were significantly different in attitudes about suicide (p <
.001). There was no significant difference between pre-test and post-test
scores for the control group (p = .999), but there was a significant differ-
ence between pre-test and post-test among the experimental group (p <
.001). See Table 17.2 for descriptive statistics, and Figure 17.1 for a plot
of cell means. The experimental and control groups had similar scores
before the intervention, but after the interventions the experimental group
had significantly higher scores on attitudes about suicide.

Table 17.2
Descriptive Statistics for Attitudes about Suicide

Test Group M SD N

Pre-test Control 12.800 2.150 25


Experimental 20.020 2.480 47
Total 19.763 2.382 72
Post-test Control 19.200 2.290 25
Experimental 23.400 1.800 47
Total 21.942 2.815 72
MIXED ANOVA CASE STUDIES • 263

25

24

23

22

21

20

19

18

17

16

15
Pre-test Post-test
Control Experimental

Figure 17.1  Plot of Cell Means for Attitudes about Suicide.

The table would be placed after the references page, on a new page, with one table per
page.
In this case, it is probably also appropriate to include a figure to allow readers to visualize
the interaction. The figure would go on a new page after any tables.
In concluding this final chapter of case studies, we want to restate our advice to read
the studies on which these cases are based. Pay attention to how different authors in
different fields and different publications put an emphasis on varying aspects of the anal-
ysis, use different terminology, or write about the analyses differently. Also notice that
many of the authors have written about their work in a more aesthetically pleasing or
creative manner than we have. Our Results sections follow closely the outlines we’ve sug-
gested in the analysis chapters, but it is clear from reading the published work that there
are many ways to write, some of which may be easier to read.

Note
1 As a reminder, these values from the online course dataset will not match the published work
exactly because our method of simulating the data cannot precisely replicate the published
study for within-subjects designs. However, the pattern of differences, means, and standard
deviations will be the same, and this should not be read as casting doubt on the published
results.
Part V
Considering equity in quantitative
research

265
18
Quantitative methods for
social justice and equity

Theoretical and practical considerations1

Quantitative methods are neither neutral nor objective 268


Quantitative Methods and the Cultural Hegemony of Positivism 269
Dehumanization and reimagination in quantitative methods 270
Practical considerations for quantitative methods 270
Measurement issues and demographic data 270
Other practical considerations 271
Possibilities for equitable quantitative research 272
Choosing demographic items for gender and sexual identity 273
Note 274

Even a superficial review of the research cited in policy briefs, produced by and for U.S.
federal agencies, and referred to in public discourse would reveal that the vast major-
ity of that research is quantitative. In fact, some federal agencies have gone so far as to
specify that quantitative methods, and especially experimental methods, are the gold
standard in social and educational research (Institute for Education Sciences, 2003). In
other words—those with power in policy, funding, and large-scale education initiatives
have made explicit their belief that quantitative methods are better, more objective, more
trustworthy, and more meritorious than other methodologies.
Visible in the national and public discourses around educational research is the nat-
uralization of quantitative methods, with other methods rendered as exotic or unusual.
In this system, quantitative methods take on the tone of objectivity, as if the statistical
tests and theories are some sort of natural law or absolute truth. This is in spite of the
fact that quantitative methods have at least as much subjectivity and rocky history as
other methodologies. But because they are treated as if they were objective and without
history, quantitative methods have a normalizing power, especially in policy discourse.
In part because of that normalization, quantitative methods are also promising for use
in research for social justice and equity. The assumption that these methods are superior,
more objective, or more trustworthy than qualitative and other methodologies can be
a leverage point for those working to move educational systems toward equity. Several

267
268 • CONSIDERING EQUITY IN QUANTITATIVE RESEARCH

authors (e.g., Strunk & Locke, 2019) have written about specific approaches to and appli-
cations of quantitative methods for social justice and equity, but our purpose in this
chapter is to more broadly review the practical and theoretical considerations in using
quantitative methods for equitable purposes. We begin by exploring the ways in which
quantitative methods are not, in fact, neutral given their history and contemporary uses.
We then describe the ways that quantitative methods operate in hegemonic ways in
schools and broader research contexts. Next, we examine the potential for dehumani-
zation in quantitative methods and how researchers can avoid those patterns. We then
offer practical considerations for doing equitable quantitative research and highlight the
promise of quantitative work in social justice and equity research.

QUANTITATIVE METHODS ARE NEITHER NEUTRAL NOR OBJECTIVE


Although in contemporary discourse, quantitative methods are often presented as if
they are neutral, objective, and dispassionate, their history reveals they are anything but.
One of the earliest and most prominent uses of quantitative methods was as a means of
social stratification, classification, and tracking. In one such example, early advances in
psychometric testing were in the area of intelligence testing. Those efforts were explic-
itly to determine “ability” levels among candidates for military service and officer corps
(Bonilla-Silva & Zuberi, 2008). In other words, the earliest psychometric tests helped to
determine who was fit to fight and die, and who was fit to lead and decide.
Those same tests would come to be used to legitimate systems of white supremacy
and racial stratification. Efforts such as The Bell Curve (Herrnstein & Murray, 1994) used
intelligence tests as evidence of the inferiority of people of color, and thus justify their
marginalized place in society. That book would become highly influential, though con-
tested, in psychological and educational research. Of course, since its publication, The
Bell Curve has been criticized and debunked by numerous scholars (Richardson, 1995),
as have intelligence tests in general (Steele & Aronson, 1995). In addition to demon-
strating the flawed logic and problematic methods in The Bell Curve, others have also
demonstrated that intelligence tests as a whole are racially biased, culturally embedded,
and scores affected by a large range of outside factors (Valencia & Suzuki, 2001). Still,
the work in intelligence testing, a key early use of quantitative methods, continues to
animate white supremacist discourses and oppressive practices (Kincheloe, Steinberg, &
Gresson, 1997). Meanwhile, as a whole, quantitative methodologists have not engaged
in critical reflection on the history of our field, and have instead argued for incremental
changes, ethical standards, or methodological tweaks to mitigate documented biases in
our tests and methods (DeCuir & Dixson, 2004).
We here use intelligence testing as one example of the ways quantitative methods have
served oppressive ends. However, there are many more examples. Statistical comparisons
have been used to “track” children into various educational pathways (e.g., college prep,
vocational education, homemaking) in ways that are gendered, racialized, and classed
(Leonardo & Grubb, 2018). At one point, quantitative methods were used to justify the
“super-predator” rhetoric that vastly accelerated the mass incarceration epidemic in the
United States (Nolan, 2014). Randomized controlled trials (the Institute of Education
Sciences “gold standard” method) have contributed to the continued de-professional-
ization of teachers and a disregard for context and societal factors in education (IES,
2003). It would be nearly impossible to engage in any review of the ways quantitative
methods have been used in the U.S. context that would not lead to the conclusion they
QUANTITATIVE METHODS FOR SOCIAL JUSTICE AND EQUITY • 269

have exacerbated, enabled, and accelerated white supremacist cisheteropatriarchy as the


dominant ideology.
Beyond these specific examples is the larger ideological nature of quantitative meth-
ods. These methods come embedded with hidden assumptions about epistemology,
knowledges, and action. As we will describe below, though, quantitative methods are
often cleansed of ideological contestation in ways that render those assumptions and
beliefs invisible, with researchers regarding their quantitative work as objective truth
(Davidson, 2018). Yet even in areas often treated as generic or universal, like motivation
theory, quantitative work often embeds assumptions of whiteness in the theoretical and
empirical models (Usher, 2018). Quantitative methods, then, are caught up in ideologi-
cal hegemony in ways that are both hidden and powerful.

QUANTITATIVE METHODS AND THE CULTURAL


HEGEMONY OF POSITIVISM
Giroux (2011) describes a culture of positivism that pervades U.S. education and is linked
with quantitative methods. Positivism is a default position—the objective and absolute
nature of reality are treated as taken for granted, leaving any other position as exotic,
abnormal, and othered. Simultaneously, the culture of positivism acts to remove a sense of
historicity from teaching and learning (Giroux, 2011). Students learn via the hidden cur-
riculum that the current mode of thinking and validating knowledge is universal and has
not shifted meaningfully. Of course, that is simply not true, and vast changes in knowledge
generation and legitimization have occurred rapidly. But in part through this lost historic-
ity, positivistic thought is stripped of any sense of contestation. There might be alternative
views presented, but the notions of positivism are presented as without any true contention.
Within that dynamic, and embedded in a culture of positivism, quantitative methods
are also stripped of any sense of controversy. Other methods exist, and we can learn
about them, but as an alternative to quantitative methods. Statistical truths are presented
as the best truths and as truths that can only be countered with some other, superior,
quantitative finding. In fact, quantitative methods have become so enmeshed with the
culture of positivism that quantitative methods instructors routinely suggest that their
work is without any epistemological tone at all—it is simply normal work.
That claim does not hold up to any amount of scrutiny, though. The statistical models
themselves are infused with positivism throughout. Take the most popular statistical
model—the General Linear Model (GLM). GLM tests have assumptions that must be
met for the test to be properly applied, and those assumptions belie the positivist and
post-positivist nature of the model. Assuming random assignment not only assumes
experimental methods but also elevates those methods as ideal or better than other
methods. Assuming predictors are measured without error implies that anything such
as error-free observations exists and centers the concern over error and measurement (a
central feature of post-positivist thinking). The assumption of independent observation
directly stems from a positivist approach and suggests that more interpretivist or con-
structivist approaches lack adequate rigor. Also, all of these models position explanation,
prediction, and control as the goals of research, goals that critical scholars often critique.
While much more can be said about deconstructing the GLM assumptions (Strunk, in
press) and the assumptions of other approaches, it is clear that those models are invested
in the culture of positivism. That investment represents a substantial challenge for the
use of quantitative methods in critical research for social justice and equity.
270 • CONSIDERING EQUITY IN QUANTITATIVE RESEARCH

DEHUMANIZATION AND REIMAGINATION


IN QUANTITATIVE METHODS
Relatedly, much of the work on topics of equity and justice, like research on race, sexual-
ity, gender identity, income, indigeneity, ability, and other factors, proceeds in quantita-
tive work from a deficit perspective. By comparing marginalized group outcomes (as is
often done) to privileged group outcomes, the analysis often serves to frame marginal-
ized groups as deficient in one way or another. While such comparisons can be useful in
documenting inequitable outcomes, the results also highlight disparities that are already
well documented and that can serve oppressive purposes. In fact, the ethical standards
for tests and measurement include mention of the fact that tests that put marginalized
groups in an unfavorable light should be reconsidered (American Educational Research
Association et al., 2014).
Another trend in quantitative work that studies inequity and inequality is to focus on
resiliency or strengths (Ungar & Liebenberg, 2011). The motive in those approaches is
admirable. Such researchers seek to shift the focus from deficits to assets, highlighting
the ways in which marginalized communities create opportunities and generate thriving
(Reed & Miller, 2016). However, those approaches have pitfalls too. The danger is that by
suggesting ways in which marginalized groups can build resiliency or capitalize on their
strengths, researchers are again recentering the “problem” as residing with marginalized
groups. To put it another way—one might ask who is required to have resiliency and who
can succeed without it? Members of marginalized groups and individuals in oppressive
systems require much more resiliency than individuals whose systems were created to
benefit. Because of that, the push for resiliency and assets research actually has the poten-
tial to further oppress by placing the burden of success on people for whom our society
was designed to create failure. Instead, researchers can focus their attention on the systems,
discourses, and practices that create marginality and how those systems can be re-created.
Researchers, though, can reimagine the purposes and possibilities of quantitative
methods research. Quantitative methods can serve equitable aims and can move toward
social justice. Doing so is difficult work: the very process of turning human beings into
numbers is inherently dehumanizing. However, approaching quantitative methods
from critical theoretical perspectives, and being thoughtful, reflexive, and critical about
how the methods are used, the methodological literature, and the positionality of the
researchers themselves can generate more humanizing possibilities.

PRACTICAL CONSIDERATIONS FOR QUANTITATIVE METHODS


How, then, can quantitative researchers better position their work to achieve social justice
and equity aims? We highlight several practical considerations for researchers to consider
in their use of quantitative methods. We do not suggest ideal or right answers but hope
that reflecting carefully on some of these questions can lead to more equitable quantita-
tive work. These considerations have to do with the meaning of GLM statistics, issues of
measurement, issues of research design, and questions about inferences and conclusions.

Measurement issues and demographic data


Measurement issues are one area that presents challenges for equitable quantitative
work. The mere act of quantification can be dehumanizing. Reducing human lives and
QUANTITATIVE METHODS FOR SOCIAL JUSTICE AND EQUITY • 271

the richness of experiences to numbers, quantities, and scales distances researchers from
participants and the inferences from their experiences. Moreover, researchers must make
difficult decisions about the creation of categorical variables. While many students and
established scholars alike default to federally defined categories (like the five federally
defined racial categories of White, non-Hispanic; Black, non-Hispanic; Hispanic; Asian;
or Native American), those categories are rarely sufficient or appropriate. Researchers,
such as Teranishi (2007), have pointed out the problems created by these overly simplis-
tic categories and of the practice of collapsing small categories together. When catego-
ries are not expansive enough, or when they are combined into more generic categories
for data analysis, much of the variation is lost. Moreover, asking participants to select
identity categories with which they do not identify can, in and of itself, be oppressive.
Thinking carefully about the identities of research participants and how to present sur-
vey options is an important step in humanizing quantitative research.
Many times, researchers simply throw demographic items on the end of a survey with-
out much consideration for how those items might be perceived or even how they might
use the data. We suggest that researchers only ask for demographic data when those
data are central to their analysis. In other words, if the research questions and planned
analyses will not make use of demographic items, consider leaving them out completely.
If those items are necessary, researchers should carefully consider the wording of those
items. One promising practice is to simply leave response options open, allowing partic-
ipants to type in the identity category of their choice. For example, rather than providing
stock options for gender, researchers can simply ask participants their gender and allow
them to type in a freeform response. One issue with that approach is that it requires more
labor from researchers to code those responses into categories. However, that labor is
worthwhile in an effort to present more humanizing work. Researchers might also find
categories they did not consider are important to participants, enriching the analysis.
In some cases, it is impractical to hand-code responses. This is particularly true in
large-scale data collection, where there might be thousands of participants. It might also
be difficult when the study is required to align with institutional, sponsor, or governmen-
tal data. For example, it is common for commissioned studies to be asked to determine
“representativeness” by comparing sample demographics to institutional or regional sta-
tistics. In such cases, a strategy that might be useful is to allow the open-response demo-
graphic item, followed by a forced choice item with the narrower options. In our work,
we have used the phrasing, “If you had to choose one of the following options, which one
most closely matches your [identity]?” Doing so allows for meeting the requirements of
the study, while also allowing more expansive options for use in subsequent analyses.
As one example, we provide below a sample of decisions researchers might make
around collecting data on gender and sexual identities. Similar thinking could inform
data collection on a number of demographic factors, as we illustrate in the Appendix
found at the end of this chapter.

Other practical considerations


One of the primary issues, as we have noted above, with using quantitative methods
for critical purposes is that those methods were not designed for such work. They were
imagined within a post-positivist framework and often fall a bit flat outside of that epis-
temological perspective. Part of that, as we discussed above, is related to the assumptions
of statistical models like the GLM, which make a number of post-positivist assumptions
272 • CONSIDERING EQUITY IN QUANTITATIVE RESEARCH

about the nature of research and the data. A practical struggle for researchers using those
methods, then, is to work against those post-positivist impulses. One way that research-
ers can do this is by openly writing about their assumptions, their epistemology, their
theoretical framework, and how they approach the tests. That type of writing is atypical
in quantitative methods but useful.
One important step in using quantitative methods for social justice and equity is to reject
the notion that these tests are somehow objective. All research is informed by researcher
and participant subjectivities. As others have suggested, the very selection of research
questions, hypotheses, measurement approaches, and statistical tests are all ideological
and subjective choices. While quantitative work is often presented as if it was devoid of
values, political action, and subjectivity, such work is inherently political, unquestionably
ideological, and always subjective. A small but important step is acknowledging that sub-
jectivity, the researcher’s positionality, and the theoretical and ideological stakes. It is also
important for researchers to acknowledge when their subjectivities diverge from the com-
munities they study. As Bonilla-Silva and Zuberi (2008) convincingly argue, these meth-
ods were created through the logics of whiteness and, unless researchers work against that
tendency, will center whiteness at the expense of all other perspectives and knowledges.
Another practical strategy is to approach the data and the statistical tests more reflex-
ively. One of the problems with quantitative work is that by quantifying individuals
researchers inherently dehumanize their participants. Researchers using quantitative
methods must actively work to be more reflexive and to engage with the communities
from which their data are drawn in more continuous and purposeful ways. There are
statistics that are more person-centered than variable-centered (like cluster analysis,
multidimensional scaling, etc.), but even in those approaches, people are still reduced to
numbers. As a result, writing up those results requires work to rehumanize the partici-
pants and their experiences.
One way in which this plays out is in how researchers conceptualize error. Most quan-
titative models evidence an obsession with error. In fact, advances in quantitative meth-
ods over the past half-century have almost entirely centered around the reduction of and
accounting for error, sometimes to the point of ridiculousness. Lost in the quest to reduce
error is the fact that what we calculate as error or noise is often deeply meaningful. For
example, when statistical tests of curriculum models treat variation among teachers as
error or even noncompliance, they obscure the real work that teachers do of modifying
curricula to be more culturally responsive and appropriate for their individual students.
When randomly assigning students to receive different kinds of treatments or instruc-
tion, researchers treat within-group variation as error when it might actually be attrib-
utable to differences in subject positioning and intersubjectivity. Quantitative methods
might not ever be capable of fully accounting for the richness of human experiences that
get categorized as error, but researchers can work to conceptualize of error differently,
and write about it in ways that open possibilities rather than dismiss that variation.

POSSIBILITIES FOR EQUITABLE QUANTITATIVE RESEARCH


Various researchers have already imagined new uses for quantitative methods that accom-
plish social justice and equity aims. Researchers have used large-scale quantitative data to
document the impact of policies and policy changes on expanding or closing gaps. Such
evidence is often particularly useful in convincing stakeholders (such as policymakers or
QUANTITATIVE METHODS FOR SOCIAL JUSTICE AND EQUITY • 273

legislators) that the injustices marginalized communities voice are real and demand their
attention. While it is a sad commentary that the voices of marginalized communities are
not sufficient to move policymakers to action, the naturalized sense of quantitative meth-
ods as objective or neutral can be useful in shifting those policy conversations.
Others have attempted to integrate various critical theoretical frameworks with quan-
titative methods. One such approach is QuantCrit, which attempts to merge critical race
theory (CRT) and quantitative methods. Much has been written elsewhere about this
approach, but it has been used in research on higher education to challenge whiteness
in college environments (Teranishi, 2007). Similarly, experimental methods have been
used to document the presence of things like implicit bias, the collective toll of microag-
gressions, and to attempt to map the psychological processes of bias and discrimination
(Koonce, 2018; Strunk & Bailey, 2015).
Quantitative methods, such as those described in this text, can be used in equitable
and socially just ways. However, researchers must carefully think about the implications
of their work, how that work is intertwined with issues of inequity and oppression, and
how they can reimagine their approaches to work toward equity. Throughout the case
studies and examples in this text, we have been intentional to include examples that
speak to a commitment to social justice and equity. We have also included more “tradi-
tional” quantitative research examples to illustrate the broad array of approaches availa-
ble. But our hope is that researchers and students using this book will opt to move their
approaches toward more equitable, inclusive, and just methodologies.

CHOOSING DEMOGRAPHIC ITEMS FOR


GENDER AND SEXUAL IDENTITY
First, to decide what demographic information you might collect, answer these questions:

• Is participant sex/gender central to the research questions and planned analyses?


Will you analyze or report based on gender? Is there a gender reporting require-
ment for your study or the outlets you plan to publish in?
• Are you writing about gender or sex?
• Sex is a biological factor, having to do with genital and genetic markers. In
most cases, collecting data on gender is the more appropriate and sufficient
option. If you need to collect this information, consider:
• An open response box in which participants can type their sex as assigned
as birth.
• Sex as assigned at birth:
• Male
• Female
• Intersex
• Prefer not to respond
• Gender is a social construct, having to do with identity, gender presentation,
physical and emotional characteristics, and the internal sense of self partici-
pants hold. If you need to collect this information, consider:
• An open response box in which participants can type their gender iden-
tity. An example might look like:
• What is your gender identity? (e.g., man, woman, genderqueer, etc.)
274 • CONSIDERING EQUITY IN QUANTITATIVE RESEARCH

• Gender identity (for adults):


• Agender
• Man
• Woman
• Nonbinary/Genderqueer/Genderfluid
• Two-spirit
• Another identity not listed here
• Gender identity (for children):
• Boy
• Girl
• Nonbinary/Genderqueer
• Two-spirit
• Gender expansive
• Another identity not listed here
• Do you need to collect information about whether participants are transgender?
• The term “transgender” typically refers to individuals for whom their gender
identity and sex as assigned at birth are not aligned. If you need to collect this
information, consider:
• Which do you most closely identify as?
• Cisgender (your gender identity and sex as assigned at birth are the same)
• Transgender (your gender identity and sex as assigned at birth are
different)
• Is participant sexual identity (sometimes called sexual orientation) central to the
research questions and planned analyses? Will you analyze based on sexual iden-
tity, or is there a reporting requirement for sexual identity in your intended publi-
cation outlet?
• If so, consider:
• An open response box in which participants can type their sexual orienta-
tion. An example might look like:
• What is your sexual identity? (e.g., straight, gay, lesbian, bisexual, pansex-
ual, asexual, etc.)
• Sexual identity:
• Straight/heterosexual
• Gay or lesbian
• Bisexual
• Pansexual
• Queer
• Asexual
• Another identity not listed here

Note
1 This chapter originally appeared in Strunk and Locke (2019) as a chapter in an edited vol-
ume. It has been modified and reproduced here by permission from Palgrave Macmillan, a
division of Springer. The original chapter appeared as: Strunk, K. K., & Hoover, P. D. (2019).
Quantitative methods for social justice and equity: Theoretical and practical considerations.
In K. K. Strunk & L. A. Locke (Eds.), Research methods for social justice and equity in educa-
tion (pp. 191–201). New York, NY: Palgrave.
Appendices

Table A1  Percentiles and one-tailed probabilities for z values

z %ile p z %ile p z %ile p z %ile p

−3.00 0.13 .001 −2.72 0.33 .003 −2.44 0.73 .007 −2.16 1.54 .015
−2.99 0.14 .001 −2.71 0.34 .003 −2.43 0.75 .007 −2.15 1.58 .016
−2.98 0.14 .001 −2.70 0.35 .003 −2.42 0.78 .008 −2.14 1.62 .016
−2.97 0.15 .002 −2.69 0.36 .004 −2.41 0.80 .008 −2.13 1.66 .017
−2.96 0.15 .002 −2.68 0.37 .004 −2.40 0.82 .008 −2.12 1.70 .017
−2.95 0.16 .002 −2.67 0.38 .004 −2.39 0.84 .008 −2.11 1.74 .017
−2.94 0.16 .002 −2.66 0.39 .004 −2.38 0.87 .009 −2.10 1.79 .018
−2.93 0.17 .002 −2.65 0.40 .004 −2.37 0.89 .009 −2.09 1.83 .018
−2.92 0.18 .002 −2.64 0.41 .004 −2.36 0.91 .009 −2.08 1.88 .019
−2.91 0.18 .002 −2.63 0.43 .004 −2.35 0.94 .009 −2.07 1.92 .019
−2.90 0.19 .002 −2.62 0.44 .004 −2.34 0.96 .010 −2.06 1.97 .020
−2.89 0.19 .002 −2.61 0.45 .005 −2.33 0.99 .010 −2.05 2.02 .020
−2.88 0.20 .002 −2.60 0.47 .005 −2.32 1.02 .010 −2.04 2.07 .021
−2.87 0.21 .002 −2.59 0.48 .005 −2.31 1.04 .010 −2.03 2.12 .021
−2.86 0.21 .002 −2.58 0.49 .005 −2.30 1.07 .011 −2.02 2.17 .022
−2.85 0.22 .002 −2.57 0.51 .005 −2.29 1.10 .011 −2.01 2.22 .022
−2.84 0.23 .002 −2.56 0.52 .005 −2.28 1.13 .011 −2.00 2.28 .023
−2.83 0.23 .002 −2.55 0.54 .005 −2.27 1.16 .012 −1.99 2.33 .023
−2.82 0.24 .002 −2.54 0.55 .005 −2.26 1.19 .012 −1.98 2.39 .024
−2.81 0.25 .002 −2.53 0.57 .006 −2.25 1.22 .012 −1.97 2.44 .024
−2.80 0.26 .003 −2.52 0.59 .006 −2.24 1.25 .013 −1.96 2.50 .025
−2.79 0.26 .003 −2.51 0.60 .006 −2.23 1.29 .013 −1.95 2.56 .026
−2.78 0.27 .003 −2.50 0.62 .006 −2.22 1.32 .013 −1.94 2.62 .026
−2.77 0.28 .003 −2.49 0.64 .006 −2.21 1.36 .014 −1.93 2.68 .027
−2.76 0.29 .003 −2.48 0.66 .007 −2.20 1.39 .014 −1.92 2.74 .027
−2.75 0.30 .003 −2.47 0.68 .007 −2.19 1.43 .014 −1.91 2.81 .028
−2.74 0.31 .003 −2.46 0.69 .007 −2.18 1.46 .015 −1.90 2.87 .029
−2.73 0.32 .003 −2.45 0.71 .007 −2.17 1.50 .015 −1.89 2.94 .029
(Continued)

275
276 • APPENDICES

Table A1  (Continued)


z %ile p z %ile p z %ile p z %ile p

−1.88 3.01 .030 −1.47 7.08 .071 −1.06 14.46 .145 −0.65 25.78 .258
−1.87 3.07 .031 −1.46 7.21 .072 −1.05 14.69 .147 −0.64 26.11 .261
−1.86 3.14 .031 −1.45 7.35 .073 −1.04 14.92 .149 −0.63 26.44 .264
−1.85 3.22 .032 −1.44 7.49 .075 −1.03 15.15 .152 −0.62 26.76 .268
−1.84 3.29 .033 −1.43 7.64 .076 −1.02 15.39 .154 −0.61 27.09 .271
−1.83 3.36 .034 −1.42 7.78 .078 −1.01 15.62 .156 −0.60 27.43 .274
−1.82 3.44 .034 −1.41 7.93 .079 −1.00 15.87 .159 −0.59 27.76 .278
−1.81 3.52 .035 −1.40 8.08 .081 −0.99 16.11 .161 −0.58 28.10 .281
−1.80 3.59 .036 −1.39 8.23 .082 −0.98 16.35 .164 −0.57 28.43 .284
−1.79 3.67 .037 −1.38 8.38 .084 −0.97 16.60 .166 −0.56 28.77 .288
−1.78 3.75 .038 −1.37 8.53 .085 −0.96 16.85 .169 −0.55 29.12 .291
−1.77 6.84 .068 −1.36 8.69 .087 −0.95 17.11 .171 −0.54 29.46 .295
−1.76 3.92 .039 −1.35 8.85 .088 −0.94 17.36 .174 −0.53 29.81 .298
−1.75 4.01 .040 −1.34 9.01 .090 −0.93 17.62 .176 −0.52 30.15 .302
−1.74 4.09 .041 −1.33 9.18 .092 −0.92 17.88 .179 −0.51 30.50 .305
−1.73 4.18 .042 −1.32 9.34 .093 −0.91 18.14 .181 −0.50 30.86 .309
−1.72 4.27 .043 −1.31 9.51 .095 −0.90 18.41 .184 −0.49 31.21 .312
−1.71 4.36 .044 −1.30 9.68 .097 −0.89 18.67 .187 −0.48 31.56 .316
−1.70 4.46 .045 −1.29 9.85 .098 −0.88 18.94 .189 −0.47 31.92 .319
−1.69 4.55 .046 −1.28 10.03 .100 −0.87 19.22 .192 −0.46 32.28 .323
−1.68 4.65 .047 −1.27 10.20 .102 −0.86 19.49 .195 −0.45 32.64 .326
−1.67 4.75 .048 −1.26 10.38 .104 −0.85 19.77 .198 −0.44 33.00 .330
−1.66 4.85 .049 −1.25 10.56 .106 −0.84 20.05 .201 −0.43 33.36 .334
−1.65 4.95 .050 −1.24 10.75 .108 −0.83 20.33 .203 −0.42 33.72 .337
−1.64 5.05 .051 −1.23 10.93 .109 −0.82 20.61 .206 −0.41 34.09 .341
−1.63 5.16 .052 −1.22 11.12 .111 −0.81 20.90 .209 −0.40 34.46 .345
−1.62 5.26 .053 −1.21 11.31 .113 −0.80 21.19 .212 −0.39 34.83 .348
−1.61 5.37 .054 −1.20 11.51 .115 −0.79 21.48 .215 −0.38 35.20 .352
−1.60 5.48 .055 −1.19 11.70 .117 −0.78 21.77 .218 −0.37 35.57 .356
−1.59 5.59 .056 −1.18 11.90 .119 −0.77 22.06 .221 −0.36 35.94 .359
−1.58 5.71 .057 −1.17 12.10 .121 −0.76 22.36 .224 −0.35 36.32 .363
−1.57 5.82 .058 −1.16 12.30 .123 −0.75 22.66 .227 −0.34 36.69 .367
−1.56 5.94 .059 −1.15 12.51 .125 −0.74 22.96 .230 −0.33 37.07 .371
−1.55 6.06 .061 −1.14 12.71 .127 −0.73 23.27 .233 −0.32 37.45 .375
−1.54 6.18 .062 −1.13 12.92 .129 −0.72 23.58 .236 −0.31 37.83 .378
−1.53 6.30 .063 −1.12 13.14 .131 −0.71 23.89 .239 −0.30 38.21 .382
−1.52 6.43 .064 −1.11 13.35 .134 −0.70 24.20 .242 −0.29 38.59 .386
−1.51 6.55 .066 −1.10 13.56 .136 −0.69 24.51 .245 −0.28 38.97 .390
−1.50 6.68 .067 −1.09 13.79 .138 −0.68 24.83 .248 −0.27 39.36 .394
−1.49 6.81 .068 −1.08 14.01 .140 −0.67 25.14 .251 −0.26 39.74 .397
−1.48 6.94 .069 −1.07 14.23 .142 −0.66 25.46 .255 −0.25 40.13 .401
APPENDICES • 277

z %ile p z %ile p z %ile p z %ile p

−0.24 40.52 .405 0.18 57.14 .429 0.60 72.91 .271 1.02 84.61 .154
−0.23 40.90 .409 0.19 57.53 .425 0.61 73.32 .267 1.03 84.85 .152
−0.22 41.29 .413 0.20 57.93 .421 0.62 73.56 .264 1.04 85.08 .149
−0.21 41.68 .417 0.21 58.32 .417 0.63 73.89 .261 1.05 85.31 .147
−0.20 42.07 .421 0.22 58.71 .413 0.64 75.17 .248 1.06 85.54 .145
−0.19 42.47 .425 0.23 59.10 .409 0.65 75.49 .245 1.07 85.77 .142
−0.18 42.86 .429 0.24 59.48 .405 0.66 75.80 .242 1.08 85.99 .140
−0.17 43.25 .433 0.25 59.87 .401 0.67 74.86 .251 1.09 86.21 .138
−0.16 43.64 .436 0.26 60.26 .397 0.68 75.17 .248 1.10 86.64 .134
−0.15 44.04 .440 0.27 60.64 .394 0.69 75.49 .245 1.11 86.65 .134
−0.14 44.43 .444 0.28 61.03 .390 0.70 75.80 .242 1.12 86.86 .131
−0.13 44.83 .448 0.29 61.41 .386 0.71 76.11 .239 1.13 87.08 .129
−0.12 45.22 .452 0.30 61.79 .382 0.72 76.42 .236 1.14 87.29 .127
−0.11 45.62 .456 0.31 62.17 .378 0.73 76.73 .233 1.15 87.49 .125
−0.10 46.02 .460 0.32 62.55 .375 0.74 77.04 .230 1.16 87.70 .123
−0.09 46.41 .464 0.33 62.93 .371 0.75 77.34 .227 1.17 87.90 .121
−0.08 46.81 .468 0.34 63.31 .367 0.76 77.64 .224 1.18 88.10 .119
−0.07 47.21 .472 0.35 63.68 .363 0.77 77.94 .221 1.19 88.30 .117
−0.06 47.60 .476 0.36 64.06 .359 0.78 78.23 .218 1.20 88.49 .115
−0.05 48.01 .480 0.37 64.43 .356 0.79 78.52 .215 1.21 88.69 .113
−0.04 48.40 .484 0.38 64.80 .352 0.80 78.81 .212 1.22 88.88 .111
−0.03 48.80 .488 0.39 65.17 .348 0.81 79.10 .209 1.23 89.07 .109
−0.02 49.20 .492 0.40 65.54 .345 0.82 79.39 .206 1.24 89.25 .108
−0.01 49.60 .496 0.41 65.91 .341 0.83 79.67 .203 1.25 89.44 .106
0.00 50.00 .500 0.42 66.28 .337 0.84 79.95 .201 1.26 89.62 .104
0.01 50.40 .496 0.43 66.64 .334 0.85 80.23 .198 1.27 89.80 .102
0.02 50.80 .492 0.44 67.00 .330 0.86 80.51 .195 1.28 89.97 .100
0.03 51.20 .488 0.45 67.36 .326 0.87 80.78 .192 1.29 90.15 .098
0.04 51.60 .484 0.46 68.08 .319 0.88 81.06 .189 1.30 90.32 .097
0.05 51.99 .480 0.47 68.84 .312 0.89 81.33 .187 1.31 90.49 .095
0.06 52.40 .476 0.48 68.79 .312 0.90 81.59 .184 1.32 90.66 .093
0.07 52.79 .472 0.49 69.14 .309 0.91 81.86 .181 1.33 90.86 .091
0.08 53.19 .468 0.50 69.50 .305 0.92 82.12 .179 1.34 90.99 .090
0.09 53.59 .464 0.51 69.85 .302 0.93 82.38 .176 1.35 91.15 .088
0.10 53.98 .460 0.52 70.19 .298 0.94 82.64 .174 1.36 91.31 .087
0.11 54.38 .456 0.53 70.54 .295 0.95 82.89 .171 1.37 91.47 .085
0.12 54.78 .452 0.54 70.88 .291 0.96 83.15 .169 1.38 91.62 .084
0.13 55.17 .448 0.55 71.23 .288 0.97 83.40 .166 1.39 91.77 .082
0.14 55.57 .444 0.56 71.57 .284 0.98 83.65 .164 1.40 91.92 .081
0.15 55.96 .440 0.57 71.90 .281 0.99 83.89 .161 1.41 92.07 .079
0.16 56.36 .436 0.58 72.24 .278 1.00 84.13 .159 1.42 92.22 .078
0.17 56.75 .433 0.59 72.57 .274 1.01 84.38 .156 1.43 92.36 .076
(Continued)
278 • APPENDICES

Table A1  (Continued)


z %ile p z %ile p z %ile p z %ile p

1.44 92.51 .075 1.85 96.78 .032 2.26 98.81 .012 2.67 99.62 .004
1.45 92.65 .073 1.86 96.86 .031 2.27 98.84 .012 2.68 99.63 .004
1.46 92.79 .072 1.87 96.93 .031 2.28 98.87 .011 2.69 99.64 .004
1.47 92.92 .071 1.88 96.99 .030 2.29 98.90 .011 2.70 99.65 .003
1.48 93.06 .069 1.89 97.06 .029 2.30 98.93 .011 2.71 99.66 .003
1.49 93.19 .068 1.90 97.13 .029 2.31 98.96 .010 2.72 99.67 .003
1.50 93.32 .067 1.91 97.19 .028 2.32 98.98 .010 2.73 99.68 .003
1.51 93.45 .066 1.92 97.26 .027 2.33 99.01 .010 2.74 99.69 .003
1.52 93.57 .064 1.93 97.32 .027 2.34 99.04 .010 2.75 99.70 .003
1.53 93.70 .063 1.94 97.38 .026 2.35 99.06 .009 2.76 99.71 .003
1.54 93.82 .062 1.95 97.44 .026 2.36 99.09 .009 2.77 99.72 .003
1.55 93.94 .061 1.96 97.50 .025 2.37 99.11 .009 2.78 99.73 .003
1.56 94.06 .059 1.97 97.56 .024 2.38 99.13 .009 2.79 99.74 .003
1.57 94.18 .058 1.98 97.61 .024 2.39 99.16 .008 2.80 99.74 .003
1.58 94.29 .057 1.99 97.67 .023 2.40 99.18 .008 2.81 99.75 .002
1.59 94.41 .056 2.00 97.72 .023 2.41 99.20 .008 2.82 99.76 .002
1.60 94.52 .055 2.01 97.78 .022 2.42 99.22 .008 2.83 99.77 .002
1.61 94.63 .054 2.02 97.83 .022 2.43 99.25 .007 2.84 99.77 .002
1.62 94.74 .053 2.03 97.88 .021 2.44 99.27 .007 2.85 99.78 .002
1.63 94.84 .052 2.04 97.93 .021 2.45 99.29 .007 2.86 99.79 .002
1.64 94.95 .051 2.05 97.98 .020 2.46 99.31 .007 2.87 99.79 .002
1.65 95.05 .050 2.06 98.03 .020 2.47 99.32 .007 2.88 99.80 .002
1.66 95.15 .049 2.07 98.08 .019 2.48 99.34 .007 2.89 99.81 .002
1.67 95.25 .048 2.08 98.12 .019 2.49 99.36 .006 2.90 99.81 .002
1.68 95.35 .047 2.09 98.17 .018 2.50 99.38 .006 2.91 99.82 .002
1.69 95.45 .046 2.10 98.21 .018 2.51 99.40 .006 2.92 99.82 .002
1.70 95.54 .045 2.11 98.26 .017 2.52 99.41 .006 2.93 99.83 .002
1.71 95.64 .044 2.12 98.30 .017 2.53 99.43 .006 2.94 99.84 .002
1.72 95.73 .043 2.13 98.34 .017 2.54 99.45 .005 2.95 99.84 .002
1.73 95.82 .042 2.14 98.38 .016 2.55 99.46 .005 2.96 99.85 .002
1.74 95.91 .041 2.15 98.42 .016 2.56 99.48 .005 2.97 99.85 .002
1.75 95.99 .040 2.16 98.46 .015 2.57 99.49 .005 2.98 99.86 .001
1.76 96.08 .039 2.17 98.50 .015 2.58 99.51 .005 2.99 99.86 .001
1.77 93.16 .068 2.18 98.54 .015 2.59 99.52 .005 3.00 99.87 .001
1.78 96.25 .038 2.19 98.57 .014 2.60 99.53 .005
1.79 96.33 .037 2.20 98.61 .014 2.61 99.55 .005
1.80 96.41 .036 2.21 98.64 .014 2.62 99.56 .004
1.81 96.48 .035 2.22 98.68 .013 2.63 99.57 .004
1.82 96.56 .034 2.23 98.71 .013 2.64 99.59 .004
1.83 96.64 .034 2.24 98.75 .013 2.65 99.60 .004
1.84 96.71 .033 2.25 98.78 .012 2.66 99.61 .004
APPENDICES • 279

Table A2  Critical value table for t at α = .05

df One-tailed Two-
tailed

1 6.31 12.71
2 2.92 4.30
3 2.35 3.18
4 2.13 2.78
5 2.01 2.57
6 1.94 2.45
7 1.89 2.36
8 1.86 2.31
9 1.83 2.26
10 1.81 2.23
11 1.80 2.20
12 1.78 2.18
13 1.77 2.16
14 1.76 2.14
15 1.75 2.13
16 1.75 2.12
17 1.74 2.11
18 1.73 2.10
19 1.73 2.09
20 1.72 2.09
21 1.72 2.08
22 1.72 2.07
23 1.71 2.07
24 1.71 2.06
25 1.71 2.06
26 1.71 2.06
27 1.70 2.05
28 1.70 2.05
29 1.70 2.04
30 1.70 2.04
280 • APPENDICES

Table A3  Critical values for f at α = .05

Numerator degrees of freedom (dfB or dfE)

1 2 3 4 5
1 161.45 199.50 215.71 224.58 230.16
2 18.51 19.00 19.16 19.25 19.30
3 10.13 9.55 9.28 9.12 9.01
4 7.71 6.94 6.59 6.39 6.26
5 6.61 5.79 5.41 5.19 5.05
6 5.99 5.14 4.76 4.54 4.39
7 5.59 4.74 4.35 4.12 3.97
8 5.32 4.46 4.07 3.84 3.69
9 5.12 4.26 3.86 3.63 3.48
10 4.96 4.10 3.71 3.48 3.33
11 4.84 3.98 3.59 3.36 3.20
Denominator df (dfW)

12 4.75 3.89 3.49 3.26 3.11


13 4.67 3.81 3.41 3.18 3.03
14 4.60 3.74 3.34 3.11 3.96
15 4.54 3.68 3.29 3.05 2.90
16 4.49 3.63 3.24 3.01 2.85
17 4.45 3.59 3.20 2.96 2.81
18 4.41 3.55 3.16 2.93 2.77
19 4.38 3.52 3.13 2.90 2.74
20 4.35 3.49 3.09 2.87 2.71
21 4.32 3.47 3.07 2.84 2.68
22 4.30 3.44 3.05 2.82 2.66
23 4.28 3.42 3.03 2.80 2.64
24 4.26 3.40 3.01 2.78 2.62
25 4.24 3.39 2.99 2.76 2.60
26 4.23 3.37 2.98 2.74 2.59
27 4.21 3.35 2.96 2.73 2.57
28 4.20 3.34 2.95 2.71 2.56
29 4.18 3.33 2.93 2.70 2.55
30 4.17 3.32 2.92 2.69 2.53
APPENDICES • 281

Table A4  Tukey HSD critical values for α = .05

Number of Groups (k)

2 3 4 5 6 7 8 9 10

2 6.08 8.33 9.80 10.88 11.73 12.44 13.03 13.54 13.99


3 4.50 5.91 6.82 7.50 8.04 8.48 8.85 9.18 9.46
4 3.93 5.04 5.76 6.29 6.71 7.05 7.35 7.60 7.83
5 3.64 4.60 5.22 5.67 6.03 6.33 6.58 6.80 6.99
6 3.46 4.34 4.90 5.30 5.63 5.90 6.12 6.32 6.49
7 3.34 4.16 4.68 5.06 5.36 5.61 5.82 6.00 6.16
8 3.26 4.04 4.53 4.89 5.17 5.40 5.60 5.77 5.92
9 3.20 3.95 4.41 4.76 5.02 5.24 5.43 5.59 5.74
10 3.15 3.88 4.33 4.65 4.91 5.12 5.30 5.46 5.60
11 3.11 3.82 4.26 4.57 4.82 5.03 5.20 5.35 5.49
Error degrees of freedom (dfW)

12 3.08 3.77 4.20 4.51 4.75 4.95 5.12 5.27 5.39


13 3.05 3.73 4.15 4.45 4.69 4.88 5.05 5.19 5.32
14 3.03 3.70 4.11 4.41 4.64 4.83 4.99 5.13 5.25
15 3.01 3.67 4.08 4.37 4.59 4.78 4.94 5.08 5.20
16 3.00 3.65 4.05 4.33 4.56 4.74 4.90 5.03 5.15
17 2.98 3.63 4.02 4.30 4.52 4.70 4.86 4.99 5.11
18 2.97 3.61 4.00 4.23 4.49 4.67 4.82 4.96 5.07
19 2.96 3.59 3.98 4.25 4.47 4.65 4.79 4.92 5.04
20 2.95 3.58 3.96 4.23 4.45 4.62 4.77 4.89 5.01
21 2.94 3.56 3.94 4.21 4.42 4.60 4.74 4.87 4.98
22 2.93 3.55 3.93 4.20 4.41 4.58 4.72 4.85 4.96
23 2.93 3.54 3.91 4.18 4.39 4.56 4.70 4.83 4.94
24 2.92 3.53 3.90 4.17 4.37 4.54 4.68 4.81 4.92
25 2.91 3.52 3.89 4.15 4.36 4.53 4.67 4.79 4.90
26 2.91 3.51 3.88 4.14 4.35 4.51 4.65 4.77 4.88
27 2.90 3.51 3.87 4.13 4.33 4.50 4.64 4.76 4.86
28 2.90 3.50 3.86 4.12 4.32 4.49 4.62 4.74 4.85
29 2.89 3.49 3.85 4.11 4.31 4.47 4.61 4.73 4.84
30 2.89 3.49 3.85 4.10 4.30 4.46 4.60 4.72 4.82
282 • APPENDICES

B1 STATISTICAL NOTATION AND FORMULAS

X: Score
X: Group mean, also written as M
X : Grand mean
s: Standard deviation
SD: Standard deviation
s2: Variance
Σ: Sum of
N: Sample size
H0: Null hypothesis
H1: Alternative hypothesis

Descriptive statistics and standard scores


£X
X=
N

Range = Xhighest − Xlowest

∑(X − X)
2
2
s =
N −1
∑(X − X)
2
2
s= s =
N −1
X−X
z=
s

Probabilities
p ( A) = A / N
( )(
p ( AB ) = p ( A ) p ( B ) )
p ( Aor B ) = p ( A ) + p ( B )

One-sample tests
M −µ
Z=
σ
N

M −µ
d=
σ
APPENDICES • 283

M −µ
t=
s
N

M −µ
d=
s

Independent samples t-test

X −Y
t=
sdiff

sdiff = s 2diff

s 2diff = s 2 M X + s 2 MY

s 2 pooled
s2 MX =
NX

s 2 pooled
s 2 MY =
NY
 df   df 
s 2 pooled =  X  s 2 X +  Y  s 2Y
 dftotal   dftotal 
X −Y
d=
s pooled

t2 −1
ω2 =
t 2 + N X + NY − 1

One-way ANOVA

Source SS df MS F

dfB = k – 1
( ) SSB MSB
2
Between SSB = X − X MSB = F=
df B MSW
Within dfW = n – k SSW
SSW = ( X − X )
2
MSW =
dfW

dfT = n – 1
Total
( )
2
SST = X − X
284 • APPENDICES

SSB − ( df B )( MSw )
ω2 =
SST + MSw

Factorial ANOVA

Source SS df MS F

kIV1 − 1
( ) SSIV 1 MSIV 1
2
IV1 ∑ X IV1 − X
df IV 1 MSwithin
kIV2 − 1
( ) SSIV 2 MSIV 2
2
IV2 ∑ X IV 2 − X
df IV 2 MSwithin
IV1*IV2 SStotal − SSIV1 − SSIV2 − SSwithin (dfIV1)(dfIV2) SSIV 1∗IV 2 MSIV 1∗IV 2
df IV 1∗IV 2 MSwithin

( ) Ntotal − [(kIV1) SSwithin
2
Within ∑ X − Xcell
(kIV2)] df within
Total
( ) Ntotal − 1
2
∑ X−X

SSE − ( df E )( MSw )
ω2 =
SST + MSw

Paired samples t-test

D
t=
SED

∑D
D=
N

SSD
SED =
N ( N − 1)

(∑ D)
2


SSD = ∑ D ( )2

N

t2 −1
ω2 =
t2 + n −1
APPENDICES • 285

Within-subjects ANOVA

Source SS df MS F

k − 1
( ) SSbetween MSbetween
2
Between ∑ Xk − X
dfbetween MSwithin
MSsubjects
nsubjects − 1
( ) SSsubjects
2
Subjects ∑ X subject − X MSwithin
df subjects

Within SStotal − SSbetween − SSsubjects (dfbetween)(dfsubjects) SSwithin


df within
ntotal − 1
Total
( )
2
∑ X−X

SSbetween − ( dfbetween )( MSwithin )


ω2 =
SStotal + MSwithin

SSbetween
η2 =
SSbetween + SSwithin
References

American Educational Research Association, American Psychological Association, and National


Council on Measurement in Education. (2014). Standards for educational and psychological test-
ing. Washington, DC: Authors.
American Psychological Association. (2010). Publication manual of the American Psychological
Association (6th ed.). Washington, DC: Author.
Asgari, S., & Carter, F. (2016). Peer mentors can improve academic performance: A quasi-experi-
mental study of peer mentorship in introductory courses. Teaching of Psychology, 43(2), 131–135.
https://doi.org/10.1177/0098628316636288
Bhattacharya, K. (2017). Fundamentals of qualitative research. New York, NY: Routledge.
Bonilla-Silva, E., & Zuberi, T. (2008). Toward a definition of white logic and white methods. In E.
Bonilla-Silva & T. Zuberi (Eds.), White logic, white methods: Racism and methodology (pp. 3–29).
Lanham, MD: Rowman & Littlefield.
Borg, M. W. R., & Gall, M. D. (1979). Educational research: An introduction (3rd ed.). New York,
NY: Longman.
Bossaert, G., de Boer, A. A., Frostad, P., Pijl, S. J., & Petry, K. (2015). Social participation of students
with special educational needs in different educational systems. Irish Educational Studies, 34(1),
43–54. https://doi.org/10.1080/03323315.2015.1010703
Centers for Disease Control and Prevention. (n.d.). U.S. Public Health Service syphilis study at
Tuskegee. https://www.cdc.gov/tuskegee/
Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York, NY: Routledge.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.
Creswell, J. W., & Poth, C. N. (2017). Qualitative inquiry and research design: Choosing among five
approaches (4th ed.). Los Angeles, CA: SAGE.
Crotty, M. (1998). The foundations of social research: Meaning and perspective in the research pro-
cess. Los Angeles, CA: SAGE.
Davidson, I. J. (2018). The ouroboros of psychological methodology: The case of effect sizes (mechan-
ical objectivity vs. expertise). Review of General Psychology. https://doi.org/10.1037/gpr0000154
DeCuir, J. T., & Dixson, A. D. (2004). “So when it comes out, they aren’t surprised that it is there”:
Using critical race theory as a tool of analysis of race and racism in education. Educational
Researcher, 33(5), 26–31.
Delucci, M. (2014). Measuring student learning in social statistics: A pretest-posttest study of knowl-
edge gain. Teaching Sociology, 42(3), 231–239. https://doi.org/10.1177/0092055X14527909
Denzin, N. K., & Lincoln, Y. S. (2012). The landscape of qualitative research (4th ed.). Los Angeles,
CA: SAGE.
DeVellis, R. F. (2016). Scale development: Theory and applications (4th ed.). Thousand Oaks, CA:
SAGE.
Felver, J. C., Morton, M. L., & Clawson, A. J. (2018). Mindfulness-based stress reduction reduces
psychological distress in college students. College Student Journal, 52(3), 291–298.

287
288 • REFERENCES

Fischer, C., Fishman, B., Levy, A., Dede, C., Lawrenze, F., Jia, Y., Kook, K., & McCoy, A. (2016).
When do students in low-SES schools perform better-than-expected on high-stakes tests? Analyzing
school, teacher, teaching, and professional development. Urban Education. Advance online publi-
cation. https://doi.org/10.1177/0042085916668953
Giroux, H. A. (2011). On critical pedagogy. New York, NY: Bloomsbury.
Glantz, S. A., Slinker, B. K., & Neilands, T. B. (2016). Primer on regression and analysis of variance
(3rd ed.). New York, NY: McGraw-Hill.
Goodman-Scott, E., Sink, C. A., Cholewa, B. E., & Burgess, M. (2018). An ecological view of school
counselor ratios and student academic outcomes: A national investigation. Journal of Counseling &
Development, 96(4), 388–398. https://doi.org/10.1002/jcad.12221
Guba, E. G., & Lincoln, Y. S. (1994). Competing paradigms in qualitative research. In N. Denzin & Y.
Lincoln (Eds.), Handbook of qualitative research (1st ed.). Thousand Oaks, CA: SAGE.
Hagen, K. S. (2005). Bad blood: The Tuskegee syphilis study and legacy recruitment for experimental
AIDS vaccines. New Directions for Adult & Continuing Education, 2005(105), 31–41. https://doi.
org/10.1002/ace.167
Herrnstein, R. J., & Murray, C. (1994). The bell curve: Intelligence and class structure in American
life. New York, NY: Free Press.
Institute for Education Sciences. (2003, December). Identifying and implementing educational
practices supported by rigorous evidence: A user friendly guide. National Center for Education
Evaluation and Regional Assistance. https://ies.ed.gov/ncee/pubs/evidence_based/randomized.asp
Kanamori, Y., Harrell-Williams, L. M., Xu, Y. J., & Ovrebo, E. (2019). Transgender affect misattribu-
tion procedure (transgender AMP): Development and initial evaluation of performance of a measure
of implicit prejudice. Psychology of Sexual Orientation and Gender Diversity. Online first publica-
tion. https://doi.org/10.1037/sgd/0000343
Kennedy, B. R., Mathis, C. C., & Woods, A. K. (2007). African Americans and their distrust of the
health care system: Healthcare for diverse populations. Journal of Cultural Diversity, 14(2), 56–60.
Keppel, G., & Wickens, T. D. (2004). Design and analysis: A researcher’s handbook (4th ed.). New
York, NY: Pearson.
Kim, A. S., Choi, S., & Park, S. (2018). Heterogeneity in first-generation college students influencing
academic success and adjustment to higher education. Social Science Journal. Advance online pub-
lication. https://doi.org/10.1016/j.soscij.2018.12.002
Kincheloe, J. L., Steinberg, S. R., & Gresson, A. D. (1997). Measured lies: The bell curve examined.
New York, NY: St. Martins.
Koonce, J. B. (2018). Critical race theory and caring as channels for transcending borders between
an African American professor and her Latina/o students. International Journal of Multicultural
Education, 20(2), 101–116.
Lachner, A., Ly, K., & Nückles, M. (2018). Providing written or oral explanations? Differential effects
of the modality of explaining on students’ conceptual learning and transfer. Journal of Experimental
Education, 86(3), 344–361. https://doi.org/10.1080/00220973.2017.1363691
Lather, P. (2006). Paradigm proliferation as a good thing to think with: Teaching research in education
as a wild profusion. International Journal of Qualitative Studies in Education, 19(1), 35–37. https://
doi.org/10.1080/09518390500450144
Leonardo, Z., & Grubb, W. N. (2018). Education and racism: A primer on issues and dilemmas. New
York, NY: Routledge.
Mills, G. E. & Gay, L. R. (2016). Educational research: Competencies for analysis and applications
(12th ed.). Upper Saddle River, NJ: Prentice Hall.
National Center for Education Statistics. (2018). Digest of education statistics. https://nces.ed.gov/
programs/digest/d18/tables/dt18_105.30.asp
Nolan, K. (2014). Neoliberal common sense and race-neutral discourses: A critique of “evidence-based”
policy-making in school policing. Discourse: Studies in the Cultural Politics of Education, 36(6),
894–907. https://doi.org/10.1080/01596306.2014.905457
REFERENCES • 289

Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.). New York, NY: Wadsworth.
Perez, E. R., Schanding, G. T., & Dao, T. K. (2013). Educators’ perceptions in addressing bullying of
LGBTQ/gender nonconforming youth. Journal of School Violence, 12(1), 64–79. https://doi.org/10
.1080/15388220.2012.731663
Reed, S. J., & Miller, R. L. (2016). Thriving and adapting: Resilience, sense of community, and syn-
demics among young black gay and bisexual men. American Journal of Community Psychology,
57(1–2), 129–143. https://doi-org.spot.lib.auburn.edu/10.1002/ajcp.12028
Richardson, T. Q. (1995). The window dressing behind The Bell Curve. School Psychology Review,
24(1), 42–44.
Shannonhouse, L., Lin, Y. D., Shaw, K., Wanna, R., & Porter, M. (2017). Suicide intervention train-
ing for college staff: Program evaluation and intervention skill measurement. Journal of American
College Health, 65(7), 450–456. https://doi.org/10.1080/07448481.2017.1341893
Shultz, K. S., Whitney, D. J., & Zickar, M. J. (2013). Measurement theory in action (2nd ed.). New
York, NY: Routledge.
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African
Americans. Journal of Personality and Social Psychology, 69(5), 797–811.
Strunk, K. K. (in press). A critical theory approach to LGBTQ studies in quantitative methods courses.
In N. M. Rodriguez (Ed.), Teaching LGBTQ+ studies: Theoretical perspectives. New York, NY:
Palgrave.
Strunk, K. K., & Bailey, L. E. (2015). The difference one word makes: Imagining sexual orientation in
graduate school application essays. Psychology of Sexual Orientation and Gender Diversity, 2(4),
456–462. https://doi.org/10.1037/sgd0000136
Strunk, K. K., & Hoover, P. D. (2019). Quantitative methods for social justice and equity: Theoretical
and practical considerations. In K. K. Strunk & L. A. Locke (Eds.), Research methods for social
justice and equity in education (pp. 191–201). New York, NY: Palgrave.
Strunk, K. K., & Locke, L. A. (Eds.) (2019). Research methods for social justice and equity in educa-
tion. New York, NY: Palgrave.
Strunk, K. K., & Mwavita, M. (2020). Design and analysis in educational research: ANOVA designs
in SPSS. New York, NY: Routledge.
Teranishi, R. T. (2007). Race, ethnicity, and higher education policy: The use of critical quantitative
research. New Directions for Institutional Research, 2007(133), 37–49. https://doi.org/10.1002/
ir.203
Thompson, B. (Ed.). (2002). Score reliability: Contemporary thinking on reliability issues. Thousand
Oaks, CA: SAGE.
Thorndike, R. M. & Thorndike-Christ, T. (2010). Measurement and evaluation in psychology and
education (8th ed.). Boston, MA: Pearson.
U.S. Department of Health and Human Services. (n.d.). The Belmont report. Office for Human
Research Protections. https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html
Ungar, M., & Liebenberg, L. (2011). Assessing resilience across cultures using mixed methods:
Construction of the Child and Youth Resilience Measure. Journal of Mixed Methods Research, 5(2),
126–149.
Usher, E. L. (2018). Acknowledging the whiteness of motivation research: Seeking cultural relevance.
Educational Psychologist, 53(2), 131–144. https://doi.org/10.1080/00461520.2018.1442220
Valencia, R. R., & Suzuki, L. A. (2001). Intelligence testing and minority students: Foundations, per-
formance factors, and assessment issues. Thousand Oaks, CA: Sage.
Vishnumolakala, V. R., Southam, D. C., Treagust, D. F., Mocerino, M., & Qureshi, S. (2017). Students’
attitudes, self-efficacy, and experiences in a modified process-oriented guided inquiry learning
undergraduate chemistry classroom. Chemistry Education Research and Practice, 18(2), 340–352.
https://doi.org/10.1039/C6RP00233A
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and
purposes. American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108/
INDEX

Page numbers in n indicate notes and bold indicate tables.

alpha level (for probability/significance) 33, 70, 84, 116, 155–156, 158–159, 161, 161, 166, 180, 185, 193, 195,
146–149, 151, 154n1, 180, 207–208, 260 199, 206–207, 209, 213, 215, 224, 232, 235, 239, 241,
alternate forms reliability 32; see also reliability 243, 256, 260
alternative hypothesis or research hypothesis 59, 62, 69, descriptive statistics 52–53, 100, 139, 142, 142, 143,
74, 89, 94, 131 149, 153, 154, 172, 175, 178, 178, 182, 182, 188, 188,
analysis of variance (ANOVA) 112–121, 124–125, 132, 203, 223, 229, 236, 253, 258, 262
134, 135, 137–141, 145–154n1, 155–162, 166–168, disordinal interaction 155, 158, 167, 172, 181–182
172–173, 175–181, 184–186, 213–214, 216–217, 221,
223, 225, 228–233, 237, 239–241, 243–245, 247–249, effect size (including eta squared, Cohen’s d, omega
251–253, 255, 257–258, 260–261 squared) 70, 73, 76–77, 79, 83, 94–97, 100, 102,
a priori comparisons 113, 126, 130–131, 132–133, 107–108, 111, 113, 126, 136, 138, 141, 143, 148, 152,
140–143 154n2, 155, 167, 170, 172, 175–176, 181, 185, 193,
198, 201–203, 207, 210–213, 221–222, 226, 228–229,
Belmont report 3, 15–16 233, 236, 239, 244, 246, 251, 261
Bonferroni adjustment 107–108, 116, 128–129, 137, empirical research 11, 269
146–149, 207–208, 223, 229, 235–236, 249 epistemology (including positivist, post-­positivism,
post-­modernism, interpretivism, constructivism,
causality 30–31, 61, 243 etc.) 3, 8, 10–13, 20, 88, 269, 271–272
Central limit theorem 59, 67–68 equity 4, 265, 267–270, 272–274
central tendency 37–38, 40, 44, 50, 52, 54 error 11, 47–48, 52, 59, 70, 76–77, 79, 84, 87, 89–91, 93,
Common Federal Rule 3, 16, 19–20 95, 101, 107–108, 110, 113–116, 118–120, 125–129,
comparisons 3, 54, 62, 89, 110, 112–114, 116, 127–134, 137, 138, 141, 146–147, 149, 160, 170–173, 176,
138–141, 166, 173, 175, 185, 213, 222–223, 227–229, 196–198, 201, 207–209, 216, 222–223, 225, 227, 237,
233, 236, 244, 251, 258, 268, 270 243, 248–249, 251, 253n1, 257, 260, 269, 272
confidence interval 99, 101, 195–196, 248 estimate 32, 35, 37–41, 50, 52, 54, 73, 76–77, 79, 95–97,
confounding variable 59–61, 118 120, 126, 127–129, 148, 154n2, 167, 170, 172, 198,
construct 7, 29, 32, 34, 65, 194, 273 221–222, 226, 228
counterbalancing 196, 207, 209, 215, 233, 235, 261 experimental design 117, 240, 243
external validity 84
degrees of freedom 77–78, 84, 89, 91–92, 94, 101,
114–115, 120, 124–125, 132–133, 138, 166, 198, 202, factor analysis 29, 34, 151
216–217, 220, 228–229 familywise error 107–108, 113, 115–116, 147, 149,
dependent variable 60–61, 75, 78, 83, 86–88, 96, 99, 207–208
107, 110, 113, 118, 120, 135, 141, 146–147, 151, F distribution 68, 88, 113–115, 119–120, 124–126

291
292 • INDEX

generalizability 11, 21, 25–26, 75, 84, 107, 195, 242 median 37–40, 46–47, 52, 114
general linear model 114, 118, 169, 174, 269 mixed design ANOVA 221, 237, 239–241, 243–245,
Greenhouse-­Geisser correction 216, 226, 228, 233, 248–249, 251–253, 257–258
235–236 mode 37–38, 40, 46, 52, 114
multivariate 60, 146
histogram 45, 47–48, 53–54, 67
homogeneity of variance 26, 83, 88–89, 99–100, 103n1, nesting 87, 147, 151, 195, 207, 209, 215, 232, 235, 242,
107, 111, 113, 119, 138, 147, 151, 155, 160, 173, 176, 257, 260
181, 185, 187, 216, 239, 243, 247, 251, 257, 261 nonparametric 118
hypothesis 8–9, 55, 57, 59–71, 73–74, 76–78, 84, 88–89, normality 26, 37, 46–48, 50, 52, 75, 83, 87, 97, 107, 111,
94, 99, 101, 106–107, 115–116, 119, 125–127, 131, 113, 118, 139, 141, 147, 151, 155, 159, 180, 185, 193,
133, 138, 140, 160, 166–167, 198, 201–202, 211n2, 195, 199–200, 207, 209, 213, 215, 224, 229, 232, 235,
216, 220–221, 228, 232, 243 239, 241, 245, 251, 257, 260
null hypothesis 55, 57, 59, 62, 68–71, 73–74, 76–78, 84,
independence of observations 87, 107, 118, 242 88–89, 94, 103n1, 115–116, 119, 125, 129, 131, 133,
independent variable 59–61, 96, 99, 115, 135–136, 147, 138, 160, 166–167, 198, 216, 220–221, 228, 243
155–163, 166, 169–170, 173, 178, 181, 185, 193–194,
196, 199, 221, 226–227, 239–240, 243–244, 248, observational 31, 117
253n1, 256, 260 omnibus test 116, 125–126, 132, 138, 171, 229, 232, 251
informed consent 15–18 one-­tailed test (incl. directional hypothesis) 74, 76, 78,
Institutional Review Board (IRB) 19–20 83, 94, 99, 101, 110–111, 113, 125, 198, 211n2
interaction effects 244 one-­way ANOVA 112–118, 120, 126, 134, 135,
interpretivism 11–12 140–141, 145–150, 152, 155, 158–162, 167, 173, 175,
213, 217, 223, 239, 244
Kolmogorov-­Smirnov (KS) test 48 order effects 196, 199, 215, 242
kurtosis (including leptokurtosis, platykurtosis, and ordinal interaction 158, 240, 249, 252, 261–262
mesokurtosis) 37, 46–48, 52, 75, 87, 107, 111, 118, orthogonality 131
141, 147, 151, 159, 176, 180, 195, 200–201, 207, 209, outliers 39–43
215, 224–225, 257, 260
pairwise comparisons 116, 127, 130, 173, 213, 222–223,
Levene’s test 83, 88–89, 100, 103n1, 111, 119, 120, 136, 227–229, 233, 236, 244
137, 138, 141, 147, 151, 160, 170, 172, 176, 216, 226, population 5, 7, 15–16, 21–27, 30, 55, 63–69, 71,
243, 247 73–79, 88, 101, 103n1, 119, 160, 195, 202, 215,
longitudinal 25, 30, 196, 207, 209, 214–215, 233, 235, 222, 242
242–243, 261 post-­hoc test 113, 127–130, 134, 137, 138–143,
148–149, 153, 167, 175, 186–187, 222–223, 239, 244,
main effects 155, 160–161, 166–168, 172, 175–177, 183, 249, 252, 258, 261–262
186–188, 248, 251–252, 257–258 practice effects 32, 194, 196, 199, 215, 242
Mauchly’s test 216, 228, 235, 243, 246, 254n1 probability 22, 30, 59, 62–71, 73, 76, 84, 88, 101, 111,
mean 24–25, 28–30, 34, 37–44, 46–47, 49–50, 52, 114, 138, 146, 151, 173, 202, 211n2, 229
61–62, 64, 66–70, 74–79, 88–91, 93, 95, 97, 101, 114,
120–123, 127, 133, 134, 138–140, 146, 150–151, random assignment 21, 30–31, 60, 84–86, 88, 106–107,
155–160, 162–165, 167–168, 170, 172, 175, 178, 182, 109–110, 117, 119, 147, 154n3, 155, 160, 181, 185,
196–197, 199, 201–203, 210n1, 217–219, 240, 244, 193, 195, 207, 215, 217, 219, 233, 235, 239, 242–243,
248–249, 253, 262–263 248, 257, 261, 269
mean square (MS) 120, 124, 132–133, 137, 148, 152, random sampling or random selection 21–22, 30, 65,
161, 162, 162, 166, 168, 181, 185, 217, 220–221 67, 75, 78, 83, 88, 107, 111, 113, 119, 147, 151, 155,
measurement 8, 18, 21, 25–26, 28, 31, 35, 49, 51, 70, 83, 160, 181, 185, 193, 195, 207, 215, 217, 219, 233, 235,
86–87, 107, 110, 113, 118, 147, 154n3, 155, 158, 180, 239–240, 242–243, 253, 260
185, 193, 195, 207, 209, 213–215, 229, 232, 235, 239, range 37, 39, 41, 43–44, 46, 49–50, 52, 54, 147, 151,
241, 256–257, 259–260, 267, 269–270, 272 180, 257, 268
INDEX • 293

ratio scale 21, 26–29, 51, 61, 75, 78, 83, 86, 89, 93, 97, sphericity 213–214, 216, 226–229, 233, 235–236, 239,
100, 107, 110, 113–114, 118, 124–125, 127, 138, 147, 243, 246, 251, 253n1, 257, 261
149, 154–155, 158, 180–181, 184, 193, 195, 207, 209, standard deviation 37, 41, 43–44, 46, 49–50, 52, 74–79,
213, 215, 232, 235, 239, 249, 257, 260 88, 90, 143, 178, 181, 185, 203, 210n1, 253, 263n1
reliability 31–35, 107, 110, 151, 194, 206, 209, 232, standard error of the mean 52, 77, 79, 114, 170, 172,
235, 260 196–197, 201
repeated measures ANOVA or within-­subjects ANOVA sums of squares (SS) 41–42, 120–126, 132, 137, 148,
213, 216–217, 221, 223, 225, 228–234, 237, 239, 152, 159, 162, 165, 168, 181, 185, 197–198, 217,
243–245, 251 220–222

sample 21–27, 30, 35, 38–41, 43–46, 48–49, 54–55, 59, Tukey HSD test 127–129, 137
63–71, 73–79, 83, 86–88, 90–93, 96, 101–103, 105, two-­tailed test 74, 76, 83, 94, 99, 101, 110–111, 193,
107, 110, 113, 116, 117–121, 130, 142–143, 147, 151, 198, 201–202
159–160, 176–179, 181–182, 185, 195, 203, 207, 209, two-­way ANOVA or factorial ANOVA 155–162,
214, 216, 222–223, 231–233, 235, 243, 245, 255, 257, 166–168, 171, 173, 175–181, 184–186, 240, 247–249
261, 271 Type I error 59, 70, 76, 84, 107–108, 110, 115–116, 126,
sample size 23, 26, 39–41, 44, 48, 67, 68, 70, 77, 96, 129, 146, 207–208, 223, 237n3, 251
103n2, 119, 120, 158, 160, 187, 197–198, 214, Type II error 59, 70
222, 245
sampling bias 21, 23–24, 35, 88, 107, 119, 195, 233, 243 unbalanced design or unequal sample sizes 93, 119,
scale of measurement (incl. nominal, ordinal, interval, 128, 159–160, 187
ratio) 21, 26–29, 35, 51–52, 61, 75, 78, 83, 86, 98, univariate 146
107, 111, 113, 118, 215
Scheffe test 128–129, 137, 138, 142–143, 148–149, 153, validity 31, 33–35, 84, 107, 147, 151, 180, 206, 209, 232,
186–187, 223, 249 235, 256, 260
significance level 55, 57, 59, 62, 68–71, 73, 76–79, variance 26, 29, 37, 41–44, 52, 55, 83, 87–93, 95–96,
88, 94, 97, 101–102, 107–108, 111, 116, 119, 120, 99–100, 102, 103n1, 107–108, 111, 113–114,
124–125, 133, 134, 138–143, 167, 220, 237n3, 249 119–122, 123, 126, 132, 136–138, 141, 143, 147–148,
simple effects analysis 155, 173–177, 181–182, 249–250 151, 153, 155, 160, 162, 168, 173, 176–177, 181–182,
skewness 37, 46–48, 52, 54, 75, 87, 107, 111, 118, 141, 185–189, 193, 196, 199, 203, 207–213, 216–218,
147, 151, 159, 176, 179, 195, 200–201, 207, 209, 215, 221–222, 228–230, 233, 235–236, 243–244, 246–247,
224–225, 257 251–253, 257, 261–262
source table 120, 121, 122–125, 128, 131–133, 155,
160–161, 165, 168, 172, 175, 213, 217, 220–221 Z scores 37, 49–50, 54, 74

You might also like