Professional Documents
Culture Documents
in Educational Research
Using jamovi
Design and Analysis in Educational Research Using jamovi is an integrated approach
to learning about research design alongside statistical analysis concepts. Strunk and
Mwavita maintain a focus on applied educational research throughout the text, with
practical tips and advice on how to do high-quality quantitative research.
Based on their successful SPSS version of the book, the authors focus on using jamovi
in this version due to its accessibility as open source software, and ease of use. The book
teaches research design (including epistemology, research ethics, forming research ques-
tions, quantitative design, sampling methodologies, and design assumptions) and intro-
ductory statistical concepts (including descriptive statistics, probability theory, sampling
distributions), basic statistical tests (like Z and t), and ANOVA designs, including more
advanced designs like the factorial ANOVA and mixed ANOVA.
This textbook is tailor-made for first-level doctoral courses in research design and
analysis. It will also be of interest to graduate students in education and educational
research. The book includes Support Material with downloadable data sets, and new case
study material from the authors for teaching on race, racism, and Black Lives Matter,
available at www.routledge.com/9780367723088.
“The ability to analyze data has never been more important given the volume of informa-
tion available today. A challenge is ensuring that individuals understand the connectedness
between research design and statistical analysis. Strunk and Mwavita introduce fundamental
elements of the research process and illustrate statistical analyses in the context of research
design. This provides readers with tangible examples of how these elements are related and
can affect the interpretation of results.
Many statistical analysis and research design textbooks provide depth but may not
situate scenarios in an applied context. Strunk and Mwavita provide illustrative examples
that are realistic and accessible to those seeking a strong foundation in good research
practices.”
— Forrest C. Lane, Ph.D.,
Associate Professor and Chair
Department of Educational Leadership
Sam Houston State University, USA
“Strunk and Mwavita provide a sound introductory text that is easily accessible to readers
learning applied analysis for the first time.
The chapters flow easily through traditional topics of null hypothesis testing and p-
values. The chapters include hand calculations that assist students in understanding
where the variance is and case studies at the end to develop writing skills related to each
analysis. In addition, software is integrated toward the end of the chapters after readers
have seen and learned to interpret the techniques by hand. Finally, the length of the book
is more manageable for readers as a first introduction to educational statistics.”
— James Schreiber, Ph.D., Professor
School of Nursing, Duquesne University, USA
Design and
Analysis in
Educational
Research Using
jamovi
ANOVA Designs
Kamden K. Strunk
and Mwarumba Mwavita
First published 2022
by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
and by Routledge
605 Third Avenue, New York, NY 10158
All rights reserved. No part of this book may be reprinted or reproduced or utilised in
any form or by any electronic, mechanical, or other means, now known or hereafter
invented, including photocopying and recording, or in any information storage or
retrieval system, without permission in writing from the publishers.
Typeset in Minion
by Straive, India
Access the Support Material: www.routledge.com/9780367723088
Contents
Acknowledgements vii
8 Comparing more than two sample means: The one-way ANOVA 113
v
vi • Contents
Appendices 275
References 287
Index 291
Acknowledgments
We wish to thank Wilson Lester and Payton Hoover, doctoral students working with
Kamden Strunk, for their help in locating and sorting through potential manuscripts
for the case studies included in this book. We also thank Dr. William “Hank” Murrah
for his help in locating, verifying, and utilizing code for producing simulated data that
replicate the results of those published manuscripts for the case studies. Finally, thank
you to Payton Hoover and Hyun Sung Jang for their assistance in compiling the index.
We also wish to thank the many graduate students who provided feedback on various
pieces of this text as it came together. Specifically, Auburn University graduate students
who provided invaluable feedback on how to make this text more useful for students
included: Jasmine Betties, Jessica Broussard, Katharine Brown, Wendy Cash, Haven
Cashwell, Jacoba Durrell, Jennifer Gibson, Sherrie Gilbert, Jonathan Hallford, Elizabeth
Haynes, Jennifer Hillis, Ann Johnson, Minsu Kim, Ami Landers, Rae Leach, Jessica Mil-
ton, Alexcia Moore, Allison Moran, Jamilah Page, Kurt Reesman, Tajuan Sellers, Daniel
Sullen, Anne Timyan, Micah Toles, LaToya Webb, and Tessie Williams. For their help
and support throughout the process, we thank Hannah Shakespeare and Matt Bickerton
of Routledge/Taylor & Francis Group.
Kamden Strunk wishes to thank his husband, Cyrus Bronock, for support throughout
the writing process, for listening to drafts of chapters, and for providing much needed
distractions from the work. He also wishes to thank Auburn University and, in par-
ticular, the Department of Educational Foundations, Leadership, and Technology for
supporting the writing process and for providing travel funding that facilitated the col-
laboration that resulted in this book.
Mwarumba Mwavita wishes to thank his wife, Njoki Mwarumba, for the encourage-
ment and support throughout the writing process, reminding me that I can do it. In
addition, to my sons, Tuzo Mwarumba and Tuli Mwarumba for cheering me along the
way. This book is dedicated to you, family.
vii
Part I
Basic issues
1
1
Basic issues in quantitative
educational research
The purpose of this text is to introduce quantitative educational research and explore
methods for making comparisons between and within groups. In this first chapter, we
provide an overview of basic issues in quantitative educational research. While we can-
not provide comprehensive coverage of any of these topics, we briefly touch on several
issues related to the practice of educational research. We begin by discussing research
problems and research questions.
3
4 • BASIC ISSUES
Broad
First, research problems should be broad and have room for multiple research questions
and approaches. For example, racialized achievement gaps are a broad research prob-
lem from which multiple questions and projects might emerge. On the other hand, the
question of whether summer reading programs can reduce racialized gaps in reading
achievement is quite narrow—and, in fact, is a research question. The problem of how
to increase the number of women who earn advanced degrees in STEM fields is a broad
research problem. On the other hand, asking if holding science nights at a local elemen-
tary school might increase girls’ interest in science is quite narrow and is also likely a
research question. While we will work to narrow down to a specific research question,
it is usually best to begin by identifying the broad research problem. Not only does that
help position the specific study as part of a line of inquiry, but it also helps contextualize
the individual question or study within the broader field of literature and prior research.
Meaningful
Research problems should be meaningful. In other words, “solving” the research prob-
lem should result in some real impact or change. While research problems are often
too big to be “solved” in any one study (or perhaps even one lifetime), they should be
meaningful for practical purposes. Closing racialized achievement gaps is a meaningful
problem because those gaps are associated with many negative outcomes for people of
color. Increasing the number of women with advanced degrees in STEM is meaningful
because of its impact on gender equity and transforming cultures of STEM fields.
Theoretically driven
Research problems should also be theoretically driven. These problems do not exist in
a vacuum. As we will emphasize throughout the text, one important part of being a
good researcher is to be aware of, and in conversation with, other researchers and other
published research. One way in which this happens is that other researchers will have
produced and proliferated theoretical models that aim to explain what is driving the
problems they study. Those theories can then inform the refinement and specification
of the problem. Our selection of research problems must be informed by this theoretical
QUANTITATIVE EDUCATIONAL RESEARCH • 5
landscape, and the most generative research problems researchers select tend to be those
that address gaps in existing theory and knowledge.
Answerable
Good research questions are answerable within the scope of a single study. Part of this
has to do with the narrowness of the question. Questions that are too broad (have not
been sufficiently narrowed down from the bigger research problem) will be impossible to
answer within a study. Moreover, an answerable question will be one that existing research
methods are capable of addressing. Taking one of our earlier examples of a research prob-
lem, the persistence of racialized achievement gaps in schools, we will give examples of
research questions that are answerable and those that aren’t. An example: What is the best
school-based program for reducing racialized gaps in fifth-grade reading achievement?
This is not an answerable question because no research design can determine the “best”
school-based program. Instead, we might ask whether a summer reading program or an
afterschool reading program were associated with lower racialized reading achievement
gaps in the fifth grade. This is a better question because we can compare these two pro-
grams and determine which of those two was associated with better outcomes.
Meaningful
Finally, as with broad research problems, research questions should be meaningful. In
what ways would it be helpful, lead to change, or make some kind of difference if the
6 • BASIC ISSUES
answer to the research question was known? In our example, is it meaningful to know
if one of the two programs might be associated with reductions in racial achievement
gaps in reading at the fifth-grade level? In this case, we might point to this as an impor-
tant moment to minimize reading gaps as students progress from elementary to middle
school environments. We might also point to the important societal ramifications of
persistent racialized achievement gaps and their contribution to systemic racism. There
are several claims we might make about how this question is meaningful. If we have a
meaningful research problem, it is likely that most questions arising from that problem
will also be meaningful. However, researchers should evaluate the meaningfulness of
their questions before beginning a study to ensure that participants’ time is well spent,
and the resources involved in conducting a study are invested wisely.
Often, manuscripts go through multiple rounds of peer review before publication. This is
a sort of quality control for published research. When something is published in a peer-re-
viewed journal, that means it has undergone extensive review by experts in the field who
have determined it is fit to be published. Ideally, this helps weed out questionable studies
or papers using inadequate methods. It is not a perfect system, and some papers are still
published that have major problems with their design or analysis. However, it is an impor-
tant quality check that helps us feel more confident in what we are reading and, later, citing.
Other ways of finding published research exist too. Many public libraries have access
to at least some of the same databases as universities, for example. There are also sources
like Google Scholar (as opposed to the more generic Google search) that pull from a
wide variety of published sources. Google Scholar includes mostly peer-reviewed prod-
ucts, though it will also find results from patent filings, conference proceedings, books,
and reports. While journal databases, like those listed above, will require journals to
meet certain criteria to be included (or “indexed”), Google Scholar has no such inclusion
criteria. So, when using that system, it is worth double-checking the quality of journals in
which articles are published. Another way to keep up to date with journals in your spe-
cific area of emphasis is to join professional associations (like the American Educational
Research Association [AERA], the American Educational Studies Association [AESA],
and the American Psychological Association [APA]) relevant to your discipline. Many of
them include journal subscriptions with membership dues and reading the most recent
issues can help you keep updated.
explain the same problem using different tools. Other times, the differences reflect qual-
itative, quantitative, and mixed-method differences in how research is conceptualized
and presented. The differences also might relate to theories of knowledge or epistemol-
ogies. Those epistemologies shape vastly different ways of writing and different ways
of presenting data and findings or results. In the next section, we briefly describe the
methodological approaches that are common in educational research: qualitative and
quantitative methods. We then turn to questions of epistemology.
In the next section, we discuss the formation of research questions. It is very impor-
tant to realize that some kinds of research questions are well-matched with quantita-
tive methods, and others are well-matched with qualitative methods. We cannot stress
enough that neither kind of research is “better” than the other—they simply answer
different kinds of questions, and both are valuable. We also strongly recommend that
anyone planning to do mostly quantitative research should still learn at least the basics
of qualitative research (and vice versa). There are several well-written introductory texts
on qualitative research that might help build a foundation for understanding qualita-
tive work (Bhattacharya, 2017; Creswell & Poth, 2017; Denzin & Lincoln, 2012), and
students should also consider adding one or more qualitative methods courses to gain
10 • BASIC ISSUES
a deeper understanding. If you find yourself asking research questions that are not well
matched with quantitative methods, but might be matched with qualitative methods, do
not change your question. Methods should be selected to match questions, and if your
questions are not quantitative kinds of questions, then qualitative methods will provide
a more satisfying answer. The remainder of this text focuses on quantitative methods,
however. Next, we turn to the question of epistemologies and their potential alignments
with research methods.
Positivism
In positivism, there is an absolute truth that is unchanging, universal, and without
exception. Not only does such an absolute truth exist, but it is possible for us to know
that truth with some degree of certainty. The only limitation in our ability to know the
absolute truth is the tools we use to collect and analyze data. This perspective holds there
is absolute truth to everything—physics, sure, but also human behavior, social relations,
and cultural phenomena. As a point of clarification, sometimes students hear about pos-
itivism and associate it with religious or spiritual notions of universal moral law or uni-
versal spiritual truths. While there are some philosophical connections, positivism is not
about religious or spiritual beliefs, but about the “truth” of how the natural and social
worlds work.
QUANTITATIVE EDUCATIONAL RESEARCH • 11
To turn to our guiding questions: What can we know? In positivism, we could know
just about anything and with a high degree of certainty. The only barrier is the adequacy
of our data collection and analysis tools. How do we generate and validate knowledge?
Through empirical observations, verifiable and falsifiable hypotheses, and replication.
While much work in positivistic frames is quantitative, some qualitative approaches and
scholars subscribe to this epistemology as well. So, the tools don’t necessarily have to be
quantitative, but a positivist would be concerned with the error in their tools and analy-
sis, with sampling adequacy, and other issues that might limit claims to absolute truth. In
verifying knowledge, there is an emphasis on reproducibility and replication, subjecting
hypotheses to multiple tests to determine their limits, and testing the generalizability of
a claim. Finally, what is the purpose of research? In positivistic work, the purposes of
research are explanation, prediction, and control.
Post-positivism
In post-positivism, many of the same beliefs and ideas from positivism are present. The
main difference is in the degree of confidence or certainty. While post-positivists believe
in the existence of absolute truth, they are less certain about our ability to know it. It
might never be fully possible, in this perspective, to fully account for all the variation,
nuance, detail, and interdependence in things like human social interaction. While this
perspective suggests there is an absolute truth underlying all human interaction, it might
be so complex that we will never fully know that truth.
What can we know? In theory, anything. However, in reality, our knowledge is very
limited by our perspectives, our tools, our available models and theories, and the finite
nature of human thought. How do we generate and validate knowledge? In all of the same
ways as in positivism. One difference is that in the validation of knowledge, post-posi-
tivistic work tends to emphasize (perhaps even obsess over) error, the reduction of and
accounting for error, and improving statistical models to handle error better. That trend
makes sense because, in this perspective, error is the difference between our knowledge
claims and absolute truth. Finally, what is the purpose of research? As with positivism, it
is explanation, prediction, and control.
Interpretivism
Interpretivism marks a departure from post-positivism in a more dramatic way. In inter-
pretivism, there is no belief in an absolute truth. Instead, truths (plural) are situated in
particular moments, relationships, contexts, and environments. Although this perspec-
tive posits multiple truths, it is worth noting that it does not hold all truth claims as
equal. There is still an emphasis on evidence, but without the idea of a universal or abso-
lute truth. There are a variety of interpretivist perspectives (like constructivism, symbolic
interactionism, etc.), but they all hold that in understanding social dynamics, truths are
situated in dynamic social and personal interactions.
What can we know? We can know conditional and relational truths. Though there is
no absolute universal truth underlying those truth claims, there is no reason to doubt
these conditional or relational truths in interpretivism. How do we generate and validate
knowledge? We can generate knowledge by examining and understanding social rela-
tions, subjectivities, and positional knowledges. In validating knowledge, interpretivist
researchers might emphasize factors like authenticity, trustworthiness, and resonance.
12 • BASIC ISSUES
Critical approaches
Critical approaches are diverse and varied, so this term creates a broad umbrella. How-
ever, in general, these approaches hold that reality and truth are subjective (as does
interpretivism) and that prevailing notions of reality and truth are constructed on the
basis of power. Critical approaches tend to emphasize the importance of power, and that
knowledge (and knowledge generation and validation systems) often serve to reinforce
existing power relations. A range of approaches might fall into this umbrella, such as
critical theory, feminist research, queer studies, critical race theory, and (dis)ability stud-
ies. Importantly, each of those perspectives also has substantial variability, with some
work in those perspectives falling more into deconstructivism. Because in reality, there
is wide variability in how people go about doing research, the lines between these rough
categories are often blurred.
What can we know? We can know what realities have been constructed, and we can
critically examine how they were constructed and what interests they serve. How do we
generate and validate knowledge? Through tracing the ways that power and domination
have shaped social realities. There is often an emphasis on locating and interrogating
contradictions or ruptures in social realities that might provide insight into their role in
power relations. There is also often an emphasis on advocacy, activism, and interrupting
oppressive dynamics. What is the purpose of research? To create change in social reali-
ties and interrupt the dynamics of power and oppression.
Deconstructivism
Deconstructivism is another large umbrella term with a lot of diverse perspectives under
it. These might be variously referred to as postmodernism, poststructuralism, decon-
structivism, and many other perspectives. These perspectives generally hold that reality
is unknowable, and that claims to such knowledge are self-destructive. Although truths
might exist (or at least, truth claims exist), they are social constructions that consist of
signs (not material realities) and are self-contradictory. Work in this perspective might
question notions of reality and knowledge or might critique (or deconstruct) the ways
that knowledges and truth claims have been assembled. There is some overlap with criti-
cal perspectives in that many deconstructivist perspectives also hold that the assemblages
of signs and symbols that construe a social reality are shaped by power and domination.
What can we know? We cannot know in this perspective because there is a ques-
tioning of the existence of truth. We can, however, interrogate and deconstruct truth
claims, their origins, and their investment with power. How do we generate and validate
knowledge? In deconstructivist perspectives, researchers often critique or deconstruct
existing knowledge claims rather than generating knowledge claims. This is because of
the view that truth/knowledge claims are inherently contradictory and self-defeating.
What is the purpose of research? To critique the world, knowledge, and knowability. One
of the purposes of deconstructivist research is to challenge those notions, pushing others
to rethink the systems of knowledge that they have accumulated.
QUANTITATIVE EDUCATIONAL RESEARCH • 13
Historical considerations
Entire books have been written on the historical context for modern research ethics
regulations. Here, we briefly describe a few key events that led to the system of regula-
tion currently in place in the United States. Of course, other nations have a history that
overlaps with and diverges from that of the United States, but many of the same events
shaped thinking about ethics regulations in many places. One moment often identified
as a key historical marker in research ethics is the Nuremberg Trials that followed World
War II. While these trials are best known as the trials in which Nazi leaders were con-
victed of war crimes, the tribunal also took up the question of research. In Nazi Germany
and occupied territories, doctors and researchers employed by the government carried
out gruesome and inhumane experiments on unwilling subjects, many of whom were
also in marginalized groups (such as Jewish people, LGBTQ people, and Romani peo-
ple). What emerged from the tribunals was a general condemnation of such work but not
much in the way of specific research regulations.
In the United States, the key moment in driving the current systems of regulation
was the U.S. Public Health Service (PHS) Tuskegee Syphilis Study. In the current U.S.
government, the PHS includes agencies like the National Institutes of Health (NIH)
and Centers for Disease Control and Prevention (CDC), plus multiple other parts of
the Department of Health and Human Services. Beginning in 1932, the PHS began a
study of Black men in Tuskegee, Alabama, who were infected with syphilis (Centers for
Disease Control and Prevention, n.d.). At the time, there was no known cure for syph-
ilis and few protective measures. The PHS set out to observe the course of the disease
through death in these men. An important note is that all men in the study were infected
with syphilis before being enrolled in the study (the PHS did not actively infect men in
Tuskegee with syphilis, though the PHS did actively infect men in Nicaragua for decades
in studies that only recently became known to the public). Tuskegee was selected as a site
for the study because it was very remote, very poor, and, in segregated Alabama, entirely
Black. PHS officials believed the site was isolated enough both physically and socially to
allow the study to go on without being discovered or interrupted. Shortly after the study
began, penicillin became available as a treatment, and it was extremely effective in treat-
ing syphilis. By 1943, it was widely available. However, the men enrolled in the Tuskegee
QUANTITATIVE EDUCATIONAL RESEARCH • 15
study were neither informed of the existence of penicillin nor treated with the antibiotic.
The PHS Tuskegee Syphilis Study continued for 40 years, finally ending in 1972 after
a whistleblower brought the study to light. The study had long-term ramifications for
medical mistrust among Black populations in the United States, especially in the South
(Hagen, 2005). Those continued effects of the study are associated with lower treatment
seeking and treatment adherence among Black patients in Alabama, for example (Ken-
nedy, Mathis, & Woods, 2007). In 1979, the Belmont Report was issued, leading directly
to the current system of ethical regulations in place in the United States, and we will see
clearly how that study is directly tied to the elements of the Belmont Report.
Beneficence
The principle of beneficence requires the minimization of potential harms and the max-
imization of potential benefits to participants. Put simply—this principle suggests that
participants ought to exit a study at least as well off as they entered it. Researchers should
not engage in activities, conduct, or methods that harm participants. Note that benef-
icence is about harm and benefits to participants, not broader society. This principle
connects to the Tuskegee study in that those researchers judged the harm to participants
as justified by the potential benefit to society. The Belmont Report makes clear that rea-
soning is not appropriate, and the welfare of individual participants must be the key
consideration. This principle requires researchers to think about how to reduce risks
to participants and maximize benefits. We will discuss both risks and benefits in more
detail later in this chapter.
16 • BASIC ISSUES
Justice
Finally, the principle of justice has to do with who should bear the burdens of participat-
ing in research in relation to who stands to benefit from research. In the Tuskegee study,
Tuskegee was not selected because the population there stood to benefit more than other
areas from any potential findings. Instead, researchers selected Tuskegee as a site for
the study because it was remote and largely isolated, and because its residents were low
income and Black. That meant that it was unlikely that participants would seek or receive
any outside medical treatment, and it also meant the researchers could operate with little
or no scrutiny. This is an unjust rationale for selecting participants. Doing research with
marginalized or vulnerable communities should be limited to those cases where those
communities will benefit from the results of research. In a related issue, it also means
that research with captive groups like prisoners should only be done if the research is
about their captivity. There is another side to this question, too, because there is a history
of some fields of research having almost exclusively White, or wealthy, or men partici-
pants. This is particularly problematic in fields like medical research, where treatments
might affect different groups of people in different and sometimes contradictory ways.
However, federal guidance, relying on the principle of justice, requires the adequate rep-
resentation of women and people of Color in research.
Informed consent
Human research participants must provide informed consent to participate in research.
This means both that participants must consent to their participation in research, and
QUANTITATIVE EDUCATIONAL RESEARCH • 17
they must do so with adequate information to decide on their participation. This relates
to the principle of respect for persons. In general, participants should be informed
about the purposes of research, the procedures used in the study, any risks they might
encounter, benefits they will receive, the compensation they will receive, information on
who is conducting the study, and contact information in case of questions or problems.
Informed consent documents cannot contain any language that suggests participants are
waiving any rights or liability claims—participants always retain all of their human and
legal rights regardless of the study design. In most cases, consent is documented through
the use of an informed consent form, typically signed by both the researcher and the
participant. However, signing the form is not sufficient for informed consent. Informed
consent involves, ideally, dialogue in which the researcher explains the information and
the participant is free to ask questions or seek clarification, after which they may give
their consent.
In some cases, documenting consent through the use of a signed form is not appro-
priate, in which case a waiver of documentation might be issued. In situations where the
only or primary risk to participants is that their participation might become known (a
loss of confidentiality) and the only record linking them to the study is the signed con-
sent form, a waiver of documentation might be appropriate. In that case, participants
receive an information about the study letter, which contains all elements of a consent
form, but without the participant signature.
One important note for people who do research involving children is that children
cannot consent to participate in research. Instead, their parent or legal guardian con-
sents to their participation in the research, and the child assents to participation. This
additional layer of protection (in requiring parental consent) is because children are con-
sidered as having a diminished capacity for consent. There are some scenarios in which
parental consent might also be waived, such as research on typical classroom practices or
educational tests that do not present more than minimal risk. Children are not the only
group regulations define as having diminished capacity for consent. Prisoners also have
special protections in the regulations because of the strong coercive power to which they
may be subjected. Research involving prisoners must meet many additional criteria, but
the research must be related to the conditions of imprisonment and must be impractical
to do without the participation of current prisoners.
Explanation of risks
An important component of informed consent is the explanation of risks. Participants
must be aware of the risks they could reasonably encounter during the study. A lot of
educational and behavioral research carries very little risk. The standard for what defines
a risk is whether the risks involved in the study exceed those encountered in daily life.
Studies with risks less than those of daily life are described as being no more than mini-
mal risk. In other cases, there are real risks. Common in educational and social research
are risks like the risk of a loss of confidentiality (the risk that people will find out what
a participant wrote in their survey, for example), discomfort or psychological distress
(for example, the experience of anxiety on answering questions about past trauma), and
occasionally physical discomfort or pain (for example, the risk of experiencing pain in a
study that involves exercise). There is a range of other risks that might occur depending
18 • BASIC ISSUES
on the type of research, like risks associated with blood collection, or electrographic
measurement.
Important in consideration of those risks, and whether they are acceptable, is the
principle of beneficence. Risks must be balanced by benefits to participants. In studies
involving no more than minimal risk, there is no need for a benefit to offset risks. How-
ever, in research where the risks are higher, the benefits need to match the risk level. In
an extreme example, there are medical trials where a possible risk is death from side
effects of treatment for a fatal disease. However, the benefit might be the possibility of
curing the disease or extending life substantially. In such cases, the benefit might be
deemed to exceed the risk. In most educational research, the risks are not nearly so high,
but when they are more than minimal, benefits must outweigh the risks.
Deception
Before we move on to discuss benefits and compensation, we briefly pause to discuss the
issue of deception. Informed consent requires participants to be aware of the purposes
and procedures of research before participation and that they freely consent to the study.
However, there are cases where a study cannot be carried out if participants fully know
the purposes of the research. For example, if a study aims to examine the conditions
under which people obey authority figures, if they explain that purpose to participants,
the study might be spoiled. Participants who know they are being evaluated for obedi-
ence might be more likely to defy instructions, for example. There are, then, occasions
where deception is allowed. The first criteria are that the study cannot be carried out
without deception. The scope of the deception must be limited to the smallest possible
extent that will allow the study to proceed. Finally, the risks associated with the decep-
tion must be outweighed by benefits to participants. Deception always increases the risks
to participants, if only for the distress that being deceived can cause. In most cases, a
study involving deception must also provide a debriefing—a session or letter in which
participants are fully informed of the actual purposes and procedures after the study.
Deception in educational research is rare, though the regulations do allow it under very
limited circumstances.
gym access). In some cases, compensation might also take the form of academic credit,
such as gaining extra credit in a course for research participation. Course credit is often
trickier because compensation must be equal for all participants, and courses often have
very different grading systems. Moreover, typically, any offer of course credit must be
matched with an alternative way to earn that course credit to avoid coercion. We will
discuss compensation more in the coming chapters as it relates to sampling strategies,
but compensation (or incentives) is allowed, so long as the amount is in line with the
requirements of the study.
variation, researchers should always consult their local IRB information before propos-
ing and conducting a study. In general, though, most IRBs follow a similar process.
First, researchers must design a study and describe that study design in detail. Most
IRBs provide a form or questionnaire to guide researchers in describing their study.
Typically, those forms ask for details about the study purpose, design, and who will be
conducting the research. They will also ask about risks and benefits, as well as compensa-
tion. IRBs typically require researchers to attach copies of recruitment materials, consent
documents, and study materials to the IRB so that reviewers can evaluate the appro-
priateness of those documents. IRB review falls into one of three categories: exempt,
expedited, and full board. There is much variation in how different IRBs handle those
categories, but typically exempt proposals are reviewed most quickly. Exemptions can
fall in one of several categories, but are usually no more than minimal risk and involve
anonymous data collection. Expedited applications are often reviewed more slowly than
exempt because they require a higher-level review than exempt applications. There are
multiple categories of expedited review in the Common Federal Rule as well, but often
school-based research can qualify as expedited, depending on the specifics of the study.
Finally, full-board reviews will be reviewed by an entire IRB membership at their regu-
lar meetings. Most IRBs meet once per month and will usually require several weeks of
notice to review a proposal. As a result, the full-board review can take several months.
Regardless of the level of review, it is very common for the IRB to request revisions to the
initial proposal to ensure full compliance with all regulations. When planning a study,
it is a good idea to plan in time for the initial review and one or two rounds of revision,
at a minimum.
We have avoided being overly specific about the IRB process because of how much it
varies across institutions. However, when planning a study, talk with people at your insti-
tution about the IRB process. Read your local IRB website or other documentation, and
always use their forms and guidance in designing a study. Once your IRB approves the
study, recruiting can begin. Researchers must follow the procedures they outlined in their
IRB application exactly. Any deviations from the approved procedures can result in sanc-
tions from the IRB, which can be quite serious. However, in the event a change to the
procedures is necessary, IRBs also have a process for requesting a modification to the orig-
inally approved procedures. In most institutions, modifications are reviewed quite quickly.
CONCLUSION
In this chapter, we discussed a range of basic issues in thinking about educational
research. We do not intend this chapter to be an exhaustive treatment of any of these
issues but to serve as an overview of the range of considerations in educational research.
We encourage students who feel less familiar or comfortable with these topics to seek
out more information on them. Questions of methodology, epistemology, and ethics
can be big and involve many considerations. We have recommended source materials
for several of these topics to allow further exploration. For more on research ethics in
your setting, consult local regulations and guidance. The purpose of this textbook is to
provide instruction on quantitative educational research, and in the next chapter, we will
begin exploring basics in educational statistics.
2
Sampling and basic
issues in research design
21
22 • BASIC ISSUES
Sampling strategies
When researchers design a study, they have to recruit a sample of participants. Those
participants are sampled from the population. First, researchers must define the pop-
ulation from which they wish to sample. Populations can be quite large, sometimes in
the millions of people. For example, imagine a researcher is interested in students’ tran-
sitions during the first year of college. There are almost 20 million college students in
the United States alone (National Center for Education Statistics, 2018). As many as
4 million of those students might be in their first year. If a researcher were to sample
250 first-year students, how likely is it the results from those 250 might generalize to the
4 million? That depends on the sampling strategy.
Random sampling
In an ideal situation, researchers would use random sampling. In random sampling,
every member of the population has an equal probability of being in the study sample.
Imagine we can get a list of all first-year college students in the United States, complete
with contact information. We could randomly draw 250 names from that list and ask
them to participate in our study. We are likely to find that some of those students will not
respond to our invitation. Others might decline to participate. Still others might start the
survey but drop out partway through it. Therefore, even with a perfect sampling strategy
designed to produce a random sample, we still face a number of barriers that make a true
random sample almost impossible. Of course, the other problem with this scenario is
that it is nearly impossible to get a list of everyone in the population. Random sampling
is impractical in research with human participants, but there are several other strategies
that researchers commonly use.
participants from Alabama might set a quota of 129 Black women for that sample (taking
the population percentage and multiplying by the target sample size). The researcher
would then intentionally seek out Black women until 129 were enrolled in the study. The
researcher would set quotas for every demographic category and then engage in targeted
recruiting of that group until the quota was met. The end result is a sample that matches
the population very closely in demographic categories. However, the process of produc-
ing that representative sample involved many targeted recruiting efforts, which might
introduce sampling bias. However, this method is widely used to produce samples that
approach representativeness, especially in large-scale and large-budget survey research.
Snowball sampling
Another method of accessing population that is not easily accessible or hard to reach is
snowball sampling. Examples might include members of a secretive group, people with a
stigmatized health issue, or members of a group subject to legal restrictions or targeted by
law enforcement. A snowball sample begins with the researcher identifying a small num-
ber of participants to directly recruit. That initial recruiting might involve relationship and
trust building work as well. For example, if a researcher was interested in surveying undoc-
umented immigrants, they might find this population difficult to directly reach because of
legal and social factors. So, the researcher might need to invest in building relationships
with a small number or local group of undocumented immigrants. In that example, it would
be important for the researcher to build some genuine, authentic relationships and to prove
that they are trustworthy. Participants might be skeptical of a researcher in this circum-
stance, wondering about how disclosing their documentation status to a researcher might
impact their legal or social situation. It would be important for the researcher to prove they
are a safe person to talk to. After initial recruiting in a snowball sample, participants are
asked to recruit other individuals that qualify for the study. This is useful because, in some
circumstances, individuals who are in a particular social or demographic group might be
more likely to know of other people in that same group. It can also be useful because, if
the researcher has done the work of building relationships and trust, participants may be
comfortable vouching for the researcher with other potential participants. This approach
is used in quantitative and qualitative research. One drawback to snowball sampling is it
tends to produce very homogenous samples. Because the recruiting or sampling effort is
happening entirely through social contacts, the participants who enroll in the study tend
to be very similar in sociodemographic factors. In some cases, that similarity is acceptable,
but this only works when the criteria for inclusion in the study are relatively narrow.
Purposive sampling
In this sampling method a researcher selects the sample using their experience and
knowledge of the target population or group. This sampling method is also termed judg-
ment sampling. For example, if we are interested in a study of gifted students in middle
schools, we can select schools to study based on our knowledge of gifted schools. Thus,
we rely on prior knowledge to select the schools that meet specific criteria, such as pro-
portion of students who go into high school and take advanced placement (AP) courses
and proportion of teachers with advanced degrees. This is also sometimes referred to as
targeted sampling, because the researcher is targeting very particular groups of people,
rather than engaging in a broader sampling approach or call for participants.
24 • BASIC ISSUES
Convenience sampling
Probably the most common sampling method is convenience sampling. However,
though it is common, it is also one of the more problematic approaches. Convenience
sampling is, as the name implies, a sampling method where the researcher selects par-
ticipants who are convenient to them. For example, a faculty member might survey stu-
dents in a colleague’s classes. In fact, a problem in some of the published research is that
many samples are comprised entirely of first-year students at research universities in
large lecture classes. Those samples are convenient for many faculty members. They may
be able to gain several hundred responses from a single class section without leaving
their building. So, the appeal is clear—convenience samples are quicker, easier, and less
costly to obtain. However, these samples are usually heavily biased, meaning the findings
from such a sample are unlikely to generalize to other groups. These samples are not
representative samples. There is a place for convenience samples, but researchers should
carefully consider whether the convenience and ease of access are worth the cost to the
trustworthiness of the data and carefully evaluate sampling bias.
SAMPLING BIAS
When samples are not random (which they pretty much never are), researchers must
consider the extent to which their sample might be biased. Sampling bias describes the
ways in which a sample might be divergent from the population. As we have alluded
to already in this chapter, researchers often aim for representativeness in their samples
so that they can generalize their results. Sampling bias is, in a sense, the opposite of
representativeness. The more biased a sample, the less representative it is. The less repre-
sentative the sample, the less researchers are able to generalize the results outside of the
sample. Here, we briefly review some of the types of sampling bias to give a sense of the
kinds of concerns to think through in designing a sampling strategy.
Self-selection bias
Humans participate in research voluntarily. But most people invited to participate in a
study will decline. Researchers often think about a response rate of around 15% as being
relatively good. But that would mean 85% of people invited to respond did not. In other
words, compared to the population, people who volunteer for research might be con-
sidered unusual. Perhaps there are some characteristics that volunteers have in common
that differ from non-volunteers. In other words, the fact that participants self-select to
participate in studies means that their results might not generalize to non-volunteers.
This bias is especially pronounced when the topic of the study is relevant to volunteer
characteristics. For example, customer satisfaction surveys tend to accumulate responses
from people who either had a horrible experience or an amazing experience—people
without strong feelings about the experience as a customer are less likely to respond. The
fact that people whose experience was neither horrible nor wonderful are less likely to
respond biases the results. In another example, if a researcher is studying procrastina-
tion, they might miss out on participants who procrastinate at high levels because they
might never get around to filling out the survey. Self-selection is always a concern, but
particularly when the likelihood to participate is related to factors being measured in
the study.
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 25
Exclusion bias
Researchers always set inclusion and exclusion criteria for a given sample. For example,
a researcher might limit their study to current students or to certified teachers. Setting
those criteria is important and necessary. But sometimes the nature of the exclusion cri-
teria can exclude participants in ways that bias the results. For example, many researchers
studying college students will exclude children from their samples. They do so for reasons
related to ethical regulations (specifically, to avoid seeking parental consent) that would
make the study more difficult to complete. However, it may be that college students who
are not yet adults (say, a 17-year-old first-year student) might have perspectives and expe-
riences that are quite different from other students. Those perspectives get lost through
excluding children and can bias the results. It might make sense to accept that limitation,
that the results wouldn’t generalize to students who enroll in college prior to 18 years of
age, but researchers should consider the ways that exclusion criteria might bias results.
Attrition bias
Attrition bias is a result of participants leaving a study partway through. Most frequently,
this happens in longitudinal research, where participants might drop out of a study after
the first part of the study, before later follow-up measures are completed. In some cases,
this happens because participants cannot continue to commit their time to being part
of the study. In other cases, it might happen because participants move away, no longer
meet inclusion criteria, or become unavailable due to illness or death. For example, in
longitudinal school-based research, researchers might follow students across multiple
years. Students might move out of the school district over time, and this might be more
likely for some groups of students than others. Those students who move away cannot
be included in the analysis of change across years, but likely share some characteristics
that are also related to their leaving the study. In other words, the loss of those data via
attrition biases the results.
Another way that attrition can happen is via participants dropping out of a survey
partway through completing it. Perhaps the survey was longer than the participant
expected, or something suddenly came up, but the participant has chosen not to finish
participating in a single-time measurement. This is most common in survey research,
where participants might give up on the survey because they found it too long. It may
be that the participants who stopped halfway through share characteristics that both led
them to leave the study and were relevant to the study outcomes. Again, in this case, the
loss of those participants may bias the results.
quantitative work. Very few samples’ results would generalize to the entire population,
but researchers should think about how far their results might generalize. One way to
assess the generalizability of results is to evaluate sampling biases.
Another issue in generalizability is related to sample size. How many people comprise
a sample affects multiple layers of quantitative analysis, including factors we will come
to in future chapters like normality and homogeneity of variance. But the sample size
also impacts generalizability. Very small samples are much less likely to be representative
of the population. Even by pure chance in a random sample, smaller samples are more
likely to be biased. As the sample size increases, it will likely become more representative.
In fact, as the sample size increases, it gets closer and closer to the size of the population.
As a general rule, there are some minimum sample sizes in quantitative research. We’ll
return to these norms in future chapters. Most of our examples in this text will involve
very small, imaginary samples to make it easier to track how the analyses work. But
in general samples should have at least 30 people for a correlational or within-subjects
design. When comparing two or more groups, the minimum should be at least 30 peo-
ple per group (Gay et al., 2016). These are considered to be minimum sample sizes, and
much larger samples might be appropriate in many cases, especially where there are mul-
tiple variables under analysis or the differences are likely to be small (Borg & Gall, 1979).
LEVELS OF MEASUREMENT
The data we gather can be measured at several different levels. In the most basic sense, we
think of variables as being either categorical or continuous. Categorical variables place
people into groups, which might be groups with no meaningful order or groups that have
a rank order to them. Continuous variables measure a quantity or amount, rather than a
category. There are two types of categorical variables: nominal and ordinal. Likewise, there
are two types of continuous variables: interval and ratio. For the purposes of the analyses
discussed in this book, differentiating between interval and ratio data will not be impor-
tant. However, below we introduce each level of measurement and provide some examples.
Nominal
Nominal data involve named categories. Nominal data cannot be meaningfully ordered.
That is, they are categorical data with no meaningful numeric or rank-ordered values.
For example, we might categorize participants based on things like gender, city of res-
idence, race, or academic program. These categories do not have meaningful ordering
or numbering within them—they are simply ways of categorizing participants. It is also
important to note that all of these categories are also relatively arbitrary and rely on social
constructions. Nominal data will often be coded numerically, even though the numbers
assigned to each group are also arbitrary. For example, in collecting student gender, we
might set 1 = woman, 2 = man, 3 = nonbinary/genderqueer, 4 = an option not included
in this list. There is no real logic to which group we assign the label of 1, 2, 3, or 4. In fact,
it would make no difference if instead we labelled these groups 24, 85, 129, and 72. The
numeric label simply marks which groups someone is in—it has no actual mathematical
or ranking value. However, we will usually code groups numerically because software
programs, such as jamovi, cannot analyze text data easily. So, we code group membership
with numeric codes to make it easier to analyze later on. In another example, researchers
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 27
in the United States often use racial categories that align to the federal Census categories.
They do so in order to be able to compare their samples to the population for some region
or even the entire country. So, they might code race as 1 = Black/African American,
2 = Asian American/Pacific Islander, 3 = Native American/Alaskan Native, 4 = Hispanic/
Latinx, 5 = White. Again, the numbering of these categories is completely arbitrary and
carries no real meaning. They could be numbered in any order and accomplish the same
goal. Also notice that, although these racial categories are widely used, they are also prob-
lematic and leave many racial and ethnic groups out altogether. For most of the analyses
covered in this text, nominal variables will be used to group participants in order to
compare group means. Another example of a nominal variable would be experimental
groups, where we might have 1 = experimental condition and 0 = control condition.
Ordinal
Ordinal variables also involve categories, but categories that can be meaningfully ranked.
For example, we might label 1 = first year, 2 = sophomore year, 3 = junior year, 4 = senior
year to categorize students by academic classification. The numbers here are meaningful
and represent a rank order based on seniority. Letter grades might be labelled as 1 = A, 2 = B,
3 = C, 4 = D, 5 = F. The numbering again represents a rank order. Grades of A are considered
best, B are considered next best, and so on. Other examples of ordinal data might include
things like class rank, order finishing a race, or academic ranks (i.e., assistant, associate, full
professor). The analyses in this book will not typically include ordinal data, but there are
constructs that exist as ordinal and other sets of analyses that are specific to ordinal data.
Interval
Interval data are continuous and measure a quantity. Interval data should have the same
interval or distance between levels. For example, if we measure temperature in Fahrenheit,
the difference between 50 and 60 degrees is the same as the difference between 60 and 70
degrees. Temperature is a measure of heat, and it’s worth noting that zero degrees does
not represent a complete absence of heat. In fact, many locations regularly experience
outdoor temperatures well below zero degrees. Interval data do not have a true, meaning-
ful absolute zero—zero represents an arbitrary value. Another characteristic of interval
data is that ratios between values may not be meaningful. For example, in comparing 45
degrees and 90 degrees Fahrenheit, 45 degrees would not be exactly half the amount of
heat of 90 degrees, even though 45 is half of 90. The distance between increments (in this
case, degrees) is the same, but because the scale does not start from a true, absolute zero,
the ratios are not meaningful. Other examples of interval-level data might include things
like scores on a psychological test, grade point averages, and many kinds of scaled scores.
Ratio
The difference between ratio and interval data can feel confusing and a bit slippery. Luck-
ily, for the purposes of analyses covered in this book, the difference won’t usually matter.
Most analyses—and all of the analyses in this text—will use either ratio or interval data,
28 • BASIC ISSUES
making them largely interchangeable. But ratio data do have some characteristics that
set them apart from interval. The easiest to see is probably that ratio data have a true,
meaningful, absolute zero. That is, ratio data have a value of zero that represents the com-
plete lack of whatever is being measured. For example, if we measure distance a person
runs in a week, the answer might be zero for some participants. That would mean they
had a complete lack of running. Similarly, if we measure the percentage of students who
failed an exam, it might be that 0% failed the exam, representing a complete lack of stu-
dents who failed the exam. Those values of zero are meaningful and represent an absolute
absence of the variable being measured. Because ratio data have a true, meaningful, abso-
lute zero, the ratios between numbers become meaningful. For example, 25% is exactly
half as many as 50%. Other examples of ratio data include time on task, calories eaten,
distance driven, heart rate, some test scores, and anything reported as a percentage.
For each statement, indicate your level of agreement or disagreement using the pro-
vided scale:
1 2 3 4 5 6 7
Likert-type scales involve statements (or “stems”) to which participants are asked to
react using a scale. The most common Likert-type scales have seven response option
(as the example above), but that can vary anywhere from three to ten response options.
The response options represent a gradient, such as from strongly agree to strongly dis-
agree. As such, those individual responses might be considered ordinal data. We could
certainly order these, and in some sense have done so with the numeric values assigned
to each label. When a participant responds to this item, we could think of that as them
self-assigning to a ranked category (for example, if they select 3, they are reporting they
belong to the category Somewhat Disagree that is ranked third). Some methodologists
feel quite strongly about calling all Likert-type data ordinal.
However, there is a complication for Likert-type data. Namely, we almost never use
a single Likert-type item in isolation. More typically, researchers average or sum a set of
multiple Likert-type items. For example, perhaps we asked participants a set of six items
about their enjoyment of quantitative coursework. We might report an average score of
those six items and call it something like “quantitative enjoyment.” That scaled score is
more difficult to understand as ordinal data. Perhaps a participant might average 3.67
across those six items. A score of 3.67 doesn’t correspond to any category on the Lik-
ert-type scale. The vast majority of researchers will choose to treat those average or total
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 29
scores as interval data (not ratio, in part because there is no possible score of zero on a
Likert-type scale). There is some disagreement about this among methodologists, but
most will treat average or total scores as interval, especially for the purposes of the analy-
ses covered in this book. In more advanced psychometric analyses, perhaps especially in
structural equation modelling and confirmatory factor analysis, the distinction becomes
more important and requires more thought and justification. But for the purposes of the
analyses in this text, it will be safe to treat total or average Likert-type data as interval.
Operational definitions
In designing a study, researchers will determine what variables they are interested in
measuring. For example, they might want to measure self-efficacy, student motivation,
academic achievement, psychological well-being, racial bias, heterosexism, or any num-
ber of other ideas. An important first step in designing good research is to carefully define
what those variables mean for the purpose of a given study. When researchers say, for
example, they want to measure motivation, they might mean any of several dozen things
by that. There are at least four major theories of human motivation, each of which might
have a dozen or more constructs within them. A researcher would need to carefully define
which theory of motivation they are mobilizing and which variables/constructs within
that theory they intend to measure. If a researcher wants to measure racial bias, they will
need to define exactly what they mean by racial bias and how they will differentiate vari-
ous aspects of what might be called bias (implicit bias, discrimination, racialized beliefs,
etc.). If a researcher wants to study academic achievement, they might select grade point
averages (which are very problematic measures due to variance from school to school
and teacher to teacher, along with grade inflation), standardized test scores like SAT or
ACT (which are problematic in that they show evidence of racial bias and bias based on
income), or a psychological instrument like the Wide-Range Achievement Test (WRAT,
which also shows some evidence of cultural bias). However, the research defines the vari-
able and measures it will affect the nature of the results and what they mean. The way that
researchers define the variable or construct of interest is referred to as the operational
definition. It’s an operational definition because it may not be perfect or permanent, but
it is the definition from which the researcher is operating for a given project.
Part of operationally defining a variable involves deciding how it will be measured.
Many variables could be measured in multiple ways. In fact, for any given variable, there
might be dozens of different measures in common use in the research literature. Each
will differ in how the variable is defined, what kinds of questions are asked, and how
the ideas are conceptualized. Researchers have a tendency to at times write about vari-
ables and measures as if they were interchangeable. They might include statements like,
30 • BASIC ISSUES
“Self-efficacy was higher in the experimental group,” when what they actually mean is
that a particular measure for self-efficacy in a particular moment was higher for the
experimental group. As we advocate later in this chapter, most researchers will be well
served to select existing measures for their variables. But the selection of a way to meas-
ure a variable is a part of, and should align with, the operational definition.
Random assignment
Another key term in research design is random assignment. In random assignment,
everyone in the study sample has an equal probability of ending up in the various
experimental groups. For example, in a design where one group gets an experimental
treatment and the other group gets a placebo treatment, each participant would have a
50/50 chance of ending up in the experimental vs. control group. This is accomplished
by randomly assigning participants to groups. In many modern studies, the random
assignment is done by software programs, some of which are built into online survey
platforms. Random assignment might also be done by drawing or by placing participants
in groups by the order the sign up for the study (e.g., putting even-numbered sign ups in
group 1 and odd in group 2).
Random assignment matters for the kinds of inferences a researcher can draw from
a given set of results. By randomly assigning participants to groups, theoretically their
background characteristics and other factors are also randomized to groups. So, the only
systematic difference between groups will be the treatment or conditions supplied by
group membership. As a result, the inferences can be stronger. We would feel more con-
fident that differences between groups are due to group membership (or experimental
treatment) when the groups were randomly assigned, because there are theoretically no
other systematic differences between the groups. When researchers use intact groups
(groups that are not or cannot be randomly assigned), the inferences will be somewhat
weaker. For example, if we compare academic achievement at School A, which uses com-
puterized mathematics instruction, vs. School B, which uses traditional mathematics
instruction, there might be lots of other differences between the two schools other than
whether they use computerized instruction. Perhaps School A also has a higher budget,
or students with greater access to resources, or more experienced teachers. It would be
harder, given these intact groups, to attribute the difference to instruction type than if
students were randomly assigned to instruction type.
Random assignment, though, is not sufficient to establish a causal claim (that a cer-
tain variable caused the outcome). Causal claims require robust evidence. For a causal
claim to be supported, there must be: (1) A theoretical rationale for why the potential
causal variable would cause the outcome; (2) The causal variable must precede the out-
come in time (which usually means a longitudinal design); (3) There must be a reliable
change in the outcome based on the potential causal variable; (4) All other potential
causal variables must be eliminated or controlled (Pedhazur, 1997). Random assignment
helps with criterion #4, but the others would also need to be met for a causal claim.
One distinction to be clear about, as it can be confusing for some students, is that
random assignment and random sampling (described earlier in this chapter) are two
separate processes that are not dependent on one another. Random sampling means
everyone in the population has an equal chance of being in the sample. Random assign-
ment means everyone in the sample has an equal chance of being in each group. They
both involve randomness but for separate parts of the process.
SAMPLING & BASIC ISSUES IN RESEARCH DESIGN • 31
Score reliability
Reliability is essentially about score consistency (Thorndike & Thorndike-Christ, 2010).
There are different ways of thinking about the consistency of a test score, though. It
might be consistent across time, consistent within a time point, or consistent across
people. When researchers write about reliability, they are most commonly referring to
internal consistency reliability. Here, though, we briefly review several forms of score
32 • BASIC ISSUES
reliability. First, it is important to know that reliability is not a property of tests, but of
scores. A test cannot be reliable, and it is always inappropriate to refer to a test as being
reliable (Thompson, 2002). Rather, test scores can be reliable and may be tested for reli-
ability in one of several ways.
Test–retest reliability
Test–retest reliability is a measure of consistency of test scores we obtain from participants
between two times. We also refer this form of reliability as stability of our measure. The
correlation between the two scores is the estimate of the test–retest reliability coefficient.
To calculate this reliability estimate, we would give the same scale or test to the same
people on two occasions. The assumption here is that there is no change in the construct
we are measuring between the two occasions, so any differences in scores are attributed
to unreliability of the test or scale. However, this is not always an accurate assumption.
Some constructs are not stable across time. For example, if we measure students’ anxiety
the day before the midterm and the first day of their winter holiday break, we would not
expect to find similar scores. Anxiety is a construct that changes rapidly within people.
On the other hand, if we measured personality traits using a Big Five inventory, we would
expect to find very similar scores across time. Another issue to consider is practice effects
on repeated administrations. If we give a memory test to a participant and then again, a
week later, using the same set of items to memorize, the participant will likely do much
better the second time. This again is not really an issue of unreliability but of practice and
the fact that the participant continued to learn. As a result, not all scales or variables are
suitable for a test-retest reliability estimate. It only applies to stable constructs (variables
that don’t change much within a person over time) and that do not have strong practice
effects. This coefficient is reported on a zero to one scale, with numbers closer to one
being better. An acceptable test–retest reliability coefficient will vary based on the kind of
test and its application but might be .6 or higher in many cases.
on the correlation among the test items. In the past, researchers would calculate this by
randomly dividing the items on a test into two equal sets (split halves) and then calcu-
lating the correlation between those halves. This was called split halves reliability, and it
still shows up, though rarely, in published work. In modern research, most researchers
report a coefficient like coefficient alpha, more commonly known as Cronbach’s alpha.
This measure is better than split halves reliability because it is equivalent to the average
of all possible split halves. But it is based on the correlation among the test items. Often
reported simply as α, this coefficient ranges from zero to one, with higher numbers rep-
resenting higher internal consistency. Most researchers will consider an α of .7 or higher
acceptable, and .8 or higher good (DeVellis, 2016).
Content validity
The first question for validity is about the content of the test or scale. Is that content
representative of the content the scale is meant to assess? For example, if a scale is meant
to measure depression, is the information on that scale representative of the types of
symptoms or experiences that depression includes? If it is meant to be a geometry test,
does the test include content from all relevant aspects of geometry? Researchers would
also be interested in demonstrating the test has no irrelevant items (such as ensuring
the geometry test does not include linear algebra questions). Usually, content validity is
assessed by ratings from subject matter experts who determine whether each item in a
test or scale is content-relevant, and whether there are any aspects or facets of content
that have not been included.
Criterion validity
Criterion validity basically asks whether the test or scale score acts like the variable it is
meant to represent. Does it correlate with scores that variable is supposed to correlate
with? Does it predict outcomes that variable should predict? Can it discriminate between
groups that variable would differentiate between? If a researcher is evaluating a depres-
sion scale, they might test whether it correlates with related mental health outcomes, like
anxiety, at rates they would expect. They might also test whether there are differences in
the test score between people that meet clinical criteria for a major depressive diagno-
sis and people that do not. They might test whether the scale scores predict outcomes
associated with depression like sense of self-worth or suicidal ideation. Because “depres-
sion” should act in those ways, the researcher would test if this scale, meant to measure
34 • BASIC ISSUES
depression, acts in those ways. This is a way of determining of the interpretation of this
score as an indicator of depression is valid. Because this is not a text on psychometric
theory, we will not go into detail further on establishing criterion validity. But it might
involve predictive validity, discriminant validity, convergent validity, and divergent
validity. These are various ways of assessing criterion validity.
Structural validity
Another way researchers evaluate validity issues is via structural validity. This form of
validity evidence asks whether the structure of a scale or test matches what is expected
for the construct or variable. In our example of a depression scale, many psychologists
theorize three main components to depression: affective, cognitive, and somatic. So, a
researcher might analyze a scale meant to measure depression to determine if these three
components (often called factors in such analyses) emerge. They might do this with anal-
ysis such as principal component analysis (PCA), exploratory factor analysis (EFA), or
confirmatory factor analysis (CFA). In each of those approaches, the basic question will
be whether the structure that emerges in the analysis matches the theoretical structure
of the variable being measured.
Construct validity
There are several other ways that researchers might evaluate validity. However, their
goal will be to make a claim of construct validity. Construct validity would mean that
the scale can validly be interpreted and used in the ways researchers claim—that the
claimed interpretation and use of the scores is valid. However, construct validity cannot
be directly assessed. Instead, researchers make arguments about construct validity based
on various other kinds of validity evidence. Often, when they have multiple forms of
strong validity evidence, such as those reviewed above, they will claim construct validity
based on that assembly of evidence.
validity studies available? What is the range of reliability estimates in those published
papers? While a strong track record doesn’t guarantee future success, and score reliabil-
ity in particular is very sample dependent, if a score has a fairly consistent record of good
reliability and validity evidence, it will likely produce reliable scores that can be validly
interpreted in similar future studies.
CONCLUSION
In this chapter, we have introduced sampling, sampling bias, levels of measurement,
basic issues in research design, and very briefly introduced some measurement concepts.
These basic concepts are important to understand as we move forward into statistical
tests. We will return to many of these concepts over and over in future chapters to under-
stand how to apply various designs and statistical tests. In the next chapter, though, we
will introduce basic issues in educational statistics.
3
Basic educational statistics
Central tendency 38
Mean 38
Median 39
Mode 40
Comparing mean, median, and mode 40
Variability 40
Range 41
Variance 41
Standard deviation 43
Interpreting standard deviation 43
Visual displays of data 44
The normal distribution 46
Skew 47
Kurtosis 47
Other tests of normality 48
Standard scores 49
Calculating z-scores 49
Calculating percentiles from z 49
Calculating central tendency, variability, and normality estimates in jamovi 50
Conclusion 54
Note 55
In this chapter, we will discuss the concepts of basic statistics in education. We discuss
two types of statistics: central tendency and variability. We also describe ways to display
data visually. For some students, these concepts are very familiar. Perhaps a previous
research class or even a general education class included some of these concepts. How-
ever, for many graduate students, it may have been quite some time since their last course
with any kind of mathematical concepts. We will present each of these ideas assuming no
prior knowledge, and use these basic concepts as an opportunity to learn some statistical
notation as well. These concepts are foundational to our understanding, use of statistical
analyses, and making inferences from the results of our analyses. Many of these concepts
will be used in later chapters in this text and are foundational to all of the analyses we will
learn. We strongly recommend students ensure they are very familiar and comfortable
with the concepts in this chapter to set themselves up for success in the rest of this text.
37
38 • BASIC ISSUES
CENTRAL TENDENCY
Measures of central tendency attempt to describe an entire sample (entire distribution)
with a single number. These are sometimes referred to as point estimates because they
use a single point in a distribution to represent the entire distribution. All of these esti-
mates attempt to find the center of the distribution, which is why they are called central
tendency estimates. We have multiple central tendency estimates because each of them
finds the center differently. Many of the test statistics we learn later in this text will test
for differences in central tendency estimates (for example, differences in the center of
two different groups of participants). The three central tendency estimates we will review
are the mean, median, and mode.
Mean
The most frequently used measure of central tendency is the mean. You likely learned
this concept at some point as an average. The mean is a central tendency estimate that
determines the middle of a distribution by balancing deviation from the mean on both
sides. Another way of thinking of the mean is that it is a sort of balance point. It might
not be in the literal middle of a distribution, but it is a balance point. One way to visualize
how the mean works is to think of a plank balanced on a point. In the below example,
there are two objects (or cases) on the plank, and they’re equally far from the point, so
the plank is balanced.
Finding a balance point gets trickier when we add more cases, though. In the below
example, if we add one case on the far right side of the plank, we have to move the bal-
ance point further right to keep the plank from tipping over. So, although the balance
point is no longer in the middle of the plank, it’s still the balance point.
The mean works in this way—by balancing the distance from the cases on each side of
the mean, it shows the “center” of the distribution as the point at which both sides are
balanced.
To calculate the mean, we add all the scores (∑X) and divide that sum by the total
number of the scores (N), as shown in the formula below:
X
X
N
In this formula, we have some new and potentially unfamiliar notation. The mean score
is shown as X. This is a common way to see the mean written. The mean will also some-
times be written as M. In this case, the letter X stands in for a variable. If we had multiple
variables, they might be X, Y, and Z, for example. The Greek character in the numerator
is sigma (∑), which means “sum of.” So the numerator is read as “the sum of X,” meaning
we will add up all the scores for this variable. Finally, N stands in for the number of cases
BASIC EDUCATIONAL STATISTICS • 39
or the sample size. The entire formula, then, is that the mean of X is equal to the sum of
all scores on X divided by the number of scores (or sample size).
Let us calculate the mean for some hypothetical example data. Suppose you have
eighth-grade mathematics scores from eight students, and their scores are: 3, 6, 10, 5, 8,
6, 9, and 4. To calculate the mean, we can use the formula above:
X 3 6 10 5 8 6 9 4 51
X 6.375
N 8 8
One particular feature of the mean is that it is sensitive to extreme cases. Those cases are
sometimes called outliers because they fall well outside the range of the other scores. For
example, imagine that in the above example, we had a ninth student whose score was 25.
What happens to the mean?
X 3 6 10 5 8 6 9 4 25 76
X 8.444
N 9 9
This one extreme case, or outlier, shifts the mean by quite a bit. If we had several of these
extreme values, we would see even more shift in the mean. For this reason, the mean is
not a good estimate when there are extreme cases or outliers.
Median
In cases where the mean might not be the best estimate, researchers will sometimes refer
to the median instead. The median is the physical, literal middle of the distribution. It is
the middle score from a set of scores. There is no real formula for the median. If we rank
order the scores and find the middle score, that score is the median. For example, in our
hypothetical eight students above, we could rank order their scores, and find the middle
(marked by a box):
3, 4, 5, 6, 6, 8, 9,10
In this case, we don’t have a true middle score, because there is an even number of scores.
When there is an even number of scores, we take the average of the two middle scores as
the median. So, in this case, the median will be:
6 6 12
6
2 2
In the second example from above, where we had one outlier, we could rank order the
scores and find the middle score (marked by a box):
3, 4, 5, 6, 6, 8, 9,10, 25
In this case, we have nine scores, so there is a single middle score, making the median
6. Comparing our medians to the means, we see that, with no outliers, the median and
mean are closely aligned. We also see that adding the outlier does not result in any move-
ment at all in the median. While the mean moved by 2.444 with the addition of the
40 • BASIC ISSUES
outlier, the median did not move. So, we can see in these examples that the median is less
sensitive to extreme cases and outliers.
The median is also equal to the 50th percentile. That means that 50% of all scores
fall below the median. That’s true because it’s the middle score, so half of the scores are
above, and half are below the median. We return to percentiles in future chapters, but it
is helpful to know that the median is always equal to the 50th percentile.
Mode
The mode is another way of finding the center of the distribution. However, the mode
defines the center as being the most frequently occurring score. In other words, the score
that is the most common is the mode. There is no formula for the mode either. We sim-
ply find the most frequently occurring score or scores. Because the mode is the most
frequently occurring score, there can be multiple modes. There might be more than one
score that occurs the same number of times, and no score occurs more frequently. We
call those distributions bimodal when there are two modes or multimodal when there
are more than two modes. Note that most software, including jamovi, will return the
lowest mode when there is more than one. In our above example:
3, 4, 5, 6, 6, 8, 9,10
Only one value occurred more than once, which was 6. So the mode is 6. If we add the
outlier score of 25, the mode remains 6 (it is still the value that occurs most often). The
mode, then, is also more resilient to outliers and extreme values.
VARIABILITY
So far, we’ve described central tendency estimates and emphasized that the mean is most
often used. We also described central tendency estimates as point estimates. Point esti-
mates attempt to describe an entire distribution by locating a single point in the center
BASIC EDUCATIONAL STATISTICS • 41
of that distribution. However, there is another kind of estimate we can use to understand
more about the shape and size of a distribution: variability estimates. These are range,
rather than point, estimates as they give a sense of how wide the distribution is, and
where most scores are located within a distribution. We will explore three estimates of
variability: range, variance, and standard deviation.
Range
Range is the simplest measure of variability to compute and tells us the total size of a
distribution. It is the difference between the highest score and lowest score in the distri-
bution. That means the range is an expression of how far apart lowest and highest values
fall. It can be calculated as:
From our example above, without any outliers, the highest score was 10, and the lowest
score was 3. Thus, the range is 10 − 3 = 7. If we add the outlier of 25, then the range is
25 − 3 = 22. Because the range is based on the most extreme scores, it tends to be unsta-
ble and is highly influenced by outliers. It also offers us very little information about the
distribution. We have no sense, from range alone, about where most scores tend to fall
or what the shape of the distribution might be. Because of that, we usually rely on other
variability estimates to better describe the distribution.
Variance
Variance is based on the deviation of scores in the distribution from the mean. It meas-
ures the amount of distance between the mean and the scores in the distribution. Vari-
ance reflects the dispersion, spread, or scatter of scores around the mean. It is defined as
the average squared deviation of scores around their mean, often notated as s2. Variance
is calculated using the following formula:
X X
2
2
s
N 1
We will walk through this formula in steps. In the numerator, starting inside the paren-
theses, we have deviation scores. We will take each score and subtract the mean from it.
Those deviation scores are the deviation of each score from the mean. Next, we square
the deviation scores, resulting in squared deviation scores. The final step for the numer-
ator is to add up all the squared deviation scores, which gives the sum of the squared
deviation scores. That numerator calculation is also sometimes called the sum of squares
for short. The concept of the sum of squares carries across many of the statistical tests
covered later in this book, and it’s a good idea to get familiar and comfortable with it
now. Finally, we divide the sum of squares by the sample size minus one.1
In the table below, we show how to calculate the variance for our example data from
above. We have broken the process of calculating variance down into steps, which are
presented across the columns. For the original sample (without any outliers), where the
mean was 6.375:
42 • BASIC ISSUES
X X
2
X X−X
3 −3.375 11.391
6 −0.375 0.141
10 3.625 13.141
5 −1.375 1.891
8 1.625 2.641
6 −0.375 0.141
9 2.625 6.891
4 −2.375 5.641
∑ = 41.878
N 1 8 1 7
If we add in the outlier score of 25, where the mean was 8.444:
X X
2
X X−X
3 −5.444 29.637
6 −2.444 5.973
10 1.556 2.421
5 −3.444 11.861
8 −0.444 0.197
6 −2.444 5.973
9 0.556 0.309
4 −4.444 19.749
25 16.556 274.101
∑ = 350.221
X X
2
2 350.221 350.221
s 43.778
N 1 9 1 8
In these examples, we can see that as the scores become more spread out, variance gets
bigger. One challenge with variance is that it is difficult to interpret. We know that a
variance of 9.515 indicates a wider dispersion around the mean than a variance of 5.983,
but we have no real sense of where the scores are around the mean. While most of our
statistical tests will use variance (or the sum of squares) as a key component, it is difficult
BASIC EDUCATIONAL STATISTICS • 43
to interpret directly, so most often researchers will report standard deviation, which is
more directly interpretable.
Standard deviation
Standard deviation is not exactly a different statistic than variance—it is actually a way of
converting variance to make it more easily interpretable and to standardize it. Standard
deviation is often notated as s, though it is sometimes also written as SD, which is simply
an abbreviation. The formula for standard deviation is:
X X
2
s s 2
N 1
Standard deviation is s, and variance is s2, so we can convert variance to standard devi-
ation by taking the square root. In other words, the square root of variance is standard
deviation. From our examples above, the standard deviation of the scores without any
outliers is:
=s =
s2 5.983 = 2.446
One major advantage of standard deviation is that it is directly interpretable using some
simple rules. We’ll explain two sets of rules for interpreting standard deviation next.
• About 68% of all scores will fall within ±1 standard deviation of the mean.
• About 95% of all scores will fall within ±2 standard deviations of the mean.
• More than 99% of all scores will fall within ±3 standard deviations of the mean.
For example, in our data without outliers, the mean was 6.375 with a standard devi-
ation of 2.446 (M = 6.375, SD = 2.446). Based on that, we could expect to find about
68% of the scores between 3.929 and 8.821. To get those numbers, we take the mean and
subtract 2.446 for the lower number and add 2.446 for the higher number. We could add
and subtract the standard deviation a second time to get the 95% range. Based on that,
we’d find that about 95% of the scores should fall between 1.483 and 11.267. It is worth
noting that the interpretation of standard deviation gets a bit cleaner and more realistic
in larger samples.
What if the data were non-normal? In that case, we can use Chebyshev’s rule to inter-
pret the standard deviation. In this rule:
44 • BASIC ISSUES
• At least 3/4 of the data are within ± 2 standard deviations of the mean.
• At least 8/9 of the data are within ± 3 standard deviations of the mean.
It is worth noting that, because the denominator for variance includes sample size and
standard deviation is derived from variance, both variance and standard deviation will
be smaller in larger samples, all else being equal. In other words, as sample sizes get
bigger, we expect to see smaller variance and standard deviations. This becomes an
important point in some later analyses that compare groups, as it is one reason to prefer
roughly equal group sizes. But it also means that our ranges (like the 95% range) based
on standard deviation become more precise and meaningful as the sample size increases.
The interpretation does not change in a larger sample, but that estimation is going to be
more precise.
Score Frequency
1 0
2 0
3 1
4 1
5 1
6 2
7 0
8 1
9 1
In a small sample, like the one with which we are working, it can be useful to categorize
the scores in some way. For example, perhaps we might split our scores into 1–2, 3–4,
5–6, 7–8, and 9–10:
BASIC EDUCATIONAL STATISTICS • 45
Score Frequency
1–2 0
3–4 2
5–6 3
7–8 1
9–10 1
This kind of collapsing of values into categories will probably be unnecessary in larger
samples, where we are likely to have multiple participants at every score; in the case of a
small sample, however, it can help us visualize the distribution more easily. A frequency
table is the simplest way to display the data. Sometimes, frequency tables have additional
columns. For example, in jamovi, the software produces frequency tables that have a
column for the percentage of cases for each category/score as well.
The second kind of visual display we’ll introduce here is a histogram. Histograms take
the information from a frequency table and turn it into a graph. A histogram is essen-
tially a bar graph with no spaces between the bars. Across the horizontal, or X, axis will
be the scores or categories, and the vertical, or Y, axis will have the frequencies. For our
example.
The histogram allows us to visualize better the shape of the distribution and how scores
are distributed. Histograms are very commonly used in all kinds of research and are
easily produced with jamovi and other software. One way we often use histograms is to
visually inspect a distribution to determine if it is approximately normal.
46 • BASIC ISSUES
Skew
Skewed distributions are asymmetrical, so they will have one long tail and one short
tail. Because they’re asymmetrical, skewed distributions will have a mean and median
that are spread apart. The mean will fall a little way down into the short tail, while the
median will be closer to the high point in the histogram. One way to think about skew
is to think of the peak in the histogram as being pushed toward one side. As is clear the
figure below, in a negatively skewed distribution, the long tail will be on the left, and in a
positively skewed distribution, the long tail will be on the right.
There is a statistic for evaluating skew, which jamovi labels “skewness.” Later in this chap-
ter, we will walk through how to produce all the statistics in the chapter using jamovi.
We will not review how the skewness statistic is calculated for the purposes of this book.
However, skewness is only interpretable in the context of the standard error of skewness.
When the absolute value of skewness (that is, ignoring whether the skewness statistic
is positive or negative) is less than two times the standard error of skewness, the distri-
bution is normal. If the absolute value of skewness is more than two times the standard
error of skewness, the distribution is skewed. If the distribution is skewed and the skew-
ness statistic is positive, then the distribution is positively skewed. If the distribution is
skewed and the skewness statistic is negative, then the distribution is negatively skewed.
For example, if we find that skewness = 1.000 and SEskewness = 1.500, then we can con-
clude the distribution is normal. Two times 1.500 is 3.000, and 1.000 is less than 3.000,
so the distribution is normal. However, if we find that skewness = −2.500 and SEskewness
= 1.000, then we can conclude the distribution is negatively skewed. Two times 1.000
is 2.000, and 2.500 (the absolute value of −2.500) is more than 2.000, so we know the
distribution is skewed. The skewness statistic is negative, so we know the distribution is
negatively skewed.
Kurtosis
The other way that distributions can deviate from normality is kurtosis. While skew
measures if the distribution is shifted to the left or right, kurtosis measures if the peak
of the distribution is too high or too low. There are two kinds of kurtosis we might
find. Leptokurtosis occurs when the peak of the histogram is too high, indicating there
are a disproportionate number of cases clustered around the median. Platykurtosis
occurs when the peak is not high enough (the histogram is too flat), indicating too few
cases are clustered around the median. The figure below shows how these distributions
might look.
48 • BASIC ISSUES
Leptokurtic
Mesokurtic
Platykurtic
There is also a statistic we can use to evaluate kurtosis, which in jamovi is labeled, simply,
kurtosis. Like the skewness statistic, it is interpreted in the context of its standard error.
In fact, the interpretive rules are essentially the same. If the absolute value of kurtosis
is less than two times the standard error of kurtosis, the distribution is normal. If the
absolute value of kurtosis is more than two times the standard error of kurtosis and the
kurtosis statistic is positive, the distribution is leptokurtic. If it is negative, the distribu-
tion is platykurtic. In kurtosis, when the distribution is normal it may be referred to as
mesokurtic. In other words, a normal distribution demonstrates mesokurtosis. There is
no similar term for normal skewness.
A distribution can have skew, kurtosis, both, or neither. A normal distribution will
not have skew or kurtosis. Non-normal distributions might be non-normal due to skew,
kurtosis, or both. It is fairly common, though, for skew and kurtosis to occur together.
There is a tendency for data showing a strong skew also to be leptokurtic. That pattern
makes sense because if we push scores toward one end of the scale, the likelihood the
scores will pile up too high is strong.
STANDARD SCORES
One application of the normal distribution is in the calculation of standard scores.
Standard scores are also commonly referred to as z-scores. We can convert any score to
a z-score based on the mean and standard deviation for the sample of the distribution
from which that score came. These standard scores can solve a number of problems such
as scores with different units of measure and can also be used to calculate percentiles and
proportions of scores within a range. In addition, standard scores always have a mean of
zero and a standard deviation of one.
Calculating z-scores
To calculate standard scores, we use the following formula:
XX
z
s
In other words, the standard score (or z-score) is equal to the difference between the
score and the mean, divided by the standard deviation.
One way we can use these standard scores is to compare scores from different scales
or tests that have different units of measure. Imagine that we have scores for students on
a mathematics achievement test, where the mean score is 20, and the standard deviation
is 1.5. We also have scores on a writing achievement test, where the mean score is 35, and
the standard deviation is 4. A student, John, scores an 18 on the mathematics test and a
34 on the writing test. In which subject does John show higher achievement? We cannot
directly compare the two test scores because they are on different scales of measurement.
However, by converting to standard scores, we can directly compare the two test scores:
X X 18 20 2
Mathematics : z 1.333
s 1. 5 1. 5
X X 34 35 1
Writing : z 0.250
s 4 4
Based on these calculations, we can conclude that John had higher achievement in writ-
ing. His z-score for writing was higher than for mathematics, so we know his perfor-
mance was better on that test. Because we know the z-scores have a mean of zero and a
standard deviation of one, we can do this kind of direct comparison.
Another use for standard scores is in determining the proportion of scores that would
fall in a given range. Let us imagine a depression scale with a mean of 100 and a standard
deviation of 15, where a higher score indicates more depressive symptoms. What pro-
portion of all participants would we expect to have a score between 90 and 120? We can
answer this with standard scores. We’ll start by calculating the z-score for 90 and for 120:
X X 90 100 10
z 0.667
s 15 15
X X 120 100 20
z 1.333
s 15 15
Using those standard scores, we find the percentile for z = −0.667 is 25.14, and for z =
1.333 is 90.86. So, what percent of scores will fall between 90 and 120? We simply sub-
tract (90.86 − 25.14) to find that 65.72% of all depression scores will fall between 90 and
120 on this test.
This procedure is how we determined what percentage of scores would fall within
±1 and ±2 standard deviations of the mean. At +1 standard deviations, z = 1.000, and
at −1 standard deviation, z = −1.000. Looking at Table A1, we find that the percentiles
would be 84.13 and 15.87, respectively. So, the area between −1 and +1 is 84.13 − 15.87 =
68.26%. This is why we say about 68% of the scores will fall within ±1 standard deviation
of the mean.
By default, jamovi opens a blank, new data file. We can click “Data” on the top toolbar,
and then “Setup” to set up our variables in the dataset. In that dialogue box, we can
BASIC EDUCATIONAL STATISTICS • 51
specify the nature of our variables. For this example, we will name the variable “Age” and
specify that is it a continuous variable.
Within the setup menu, there are various options we can set:
• Name. We must name our variables. There are some restrictions on what you
can use in a variable name. They must begin with a letter and cannot contain any
spaces. So, for our purposes, we’ll name the variable Score.
• Description. We can put a longer variable name here that is more descriptive. This
field does not have restrictions on the use of spaces, special characters, etc.
• Level of measurement. There are four radio buttons to select the data’s level of
measurement. For interval or ratio data, we will select “Continuous.” For ordinal
data, select “ordinal,” and for nominal data, select “nominal.”
• Levels. For nominal or ordinal variables, this area can be used to name the groups
or categories after the data are entered. Once data are entered, this field will popu-
late with the numeric codes in the dataset. By clicking on those codes in this area,
you can enter a label for the groups/categories. We will return to this function in a
few chapters and clarify its use.
• Type. In this field, you can change from the default type of integer (which means
a number). Usually the only reason to change this will be if you have text varia-
bles (like free-text responses, names, or other information that cannot be coded
numerically).
We can then click the upward-facing arrow to close the variable setup. Then, we can
enter our data in the spreadsheet below.
52 • BASIC ISSUES
Next, we will ask the software to produce the estimates of central tendency, variabil-
ity, and normality. To do so in jamovi, we will click Analyses, then Exploration, then
Descriptives. In the resulting menu, we can click on the variable we wish to analyse (in
this case, Score), and click the arrow button to select it for analysis. We can also check the
box to produce a frequency table, though it will only produce such a table for nominal
or ordinal data.
Then, under Statistics, we can check the boxes for various descriptive statistics we wish to
produce, including mean, median, mode, standard deviation, variance, range, standard
error of the mean, skewness, kurtosis, and other options.
BASIC EDUCATIONAL STATISTICS • 53
As you select the various options in the analysis settings on the left, the output will
populate on the right. It updates as you change options in the settings, meaning there
is no “run analysis” or other similar button to click—the analyses are running as
we choose them. The descriptive statistics we requested are all in the table under
Descriptives.
54 • BASIC ISSUES
Finally, notice that there are references at the bottom of the output, which may be useful
in writing for publication if references to the software or packages are requested.
The estimates in the table are slightly different from our calculations earlier in this
chapter because we consistently rounded to the thousandths place, and jamovi does not
round at all in its calculations. We can also note from this analysis that the distribution
appears to be normal because for skewness, .201 is less than two times .752, and for kur-
tosis, 1.141 (absolute value of −1.141) is less than two times 1.480.
Finally, to save your work: The entire project can be saved as a jamovi project file. In
addition, you can save the data in various formats by clicking File, then Export. Options
include saving as an SPSS file (.sav format), a comma separate values format (.csv—this
format is widely used and would be compatible with almost any analysis software), or
other formats. To save the output only, you can right click in the output, then go to All,
then Export. This will allow you to save the output as a PDF or HTML format. Notice
that you can also export individual pieces of the analysis. You can also copy and paste any
or all of the analysis in this way.
CONCLUSION
In this chapter, we have explored ways to describe samples using central tendency and
variability estimates. We have also demonstrated how to evaluate whether a sample is
normally distributed, and the properties of the normal distribution. Then we explained
how to convert scores to standard scores (or z-scores) to use the normal distribution for
comparisons, calculating percentiles, and determining proportions of scores in a given
range. Finally, we demonstrated how to calculate most of these estimates using jamovi.
BASIC EDUCATIONAL STATISTICS • 55
In the next chapter, we will work with these and similar concepts to understand the null
hypothesis significance test.
Note
1 The denominator has N − 1 when calculating the variance of a sample. If we were calculat-
ing variance for a population, the denominator would simply be N. However, researchers
in educational and behavioral research almost never work with population-level data, and
the sample formula will almost always be the correct choice. Some other texts and online
resources, though, will show the formula with N as the denominator, which is because they
are presenting the population formula.
Part II
Null hypothesis significance testing
57
4
Introducing the null
hypothesis significance test
Variables 59
Independent variables 60
Dependent variables 60
Confounding variables 61
Hypotheses 62
The null hypothesis 62
The alternative hypothesis 62
Overview of probability theory 62
Calculating individual probabilities 63
Probabilities of discrete events 63
Probability distributions 64
The sampling distribution 64
Calculating the sampling distribution 65
Central limit theorem and sampling distributions 67
Null hypothesis significance testing 68
Understanding the logic of NHST 68
Type I error 70
Type II error 70
Limitations of NHST 70
Looking ahead at one-sample tests 71
Notes 71
In the previous chapters, we have explored fundamental ideas and concepts in educa-
tional research, sampling methods and issues, and basic educational statistics. In this
chapter, we will work toward applying those concepts in statistical tests. The purposes
of this chapter are to introduce types of variables that might be part of a statistical test,
to introduce types of hypotheses, to give an overview of probability theory, discuss sam-
pling distributions, and finally to explore how those concepts are used in null hypothesis
significance testing.
VARIABLES
There are several types of variables that you might encounter in designing research or
reading about completed research. We will briefly define each and give some examples
59
60 • NULL HYPOTHESIS SIGNIFICANCE TESTING
of what sorts of things might fit in each category. All of the research designs in this text
require at least one independent variable and one dependent variable. However, some
tests can also include mediating, moderating, and confounding variables.
Independent variables
In the simplest terms, an independent variable is the variable we suspect is driving or
causing differences in outcomes. We have to be very cautious here because claiming a
variable causes outcomes takes very specific kinds of evidence. However, the logic of an
independent variable is that it would be a potential or possible cause of those outcomes.
The naming of these variables as independent is because the independent variable would
normally be manipulated by the researcher. This is accomplished through random assign-
ment, which we described in a previous chapter. By randomly assigning participants to
conditions on the independent variable, we make it independent of other variables like
demographic factors or prior experiences. Because it has been randomly assigned (and
was manipulated by the researchers), the only systematic difference between groups is the
independent variable. Examples of independent variables might be things like treatment
type (randomly assigned by researchers), the type of assignment a student completes
(again, randomly assigned by researchers), or other experimentally manipulated variables.
In a lot of educational research scenarios, though, random assignment is not possible,
is impractical, or is unethical. Often, researchers are interested in studying how out-
comes differ based on group memberships that cannot be experimentally manipulated.
For example, when we study racialized achievement gaps, it is impossible to assign race
randomly. If we want to study differences in outcomes between online and traditional
face-to-face courses, we can normally not randomly assign students as they self-select
the type of course they want to take. In these cases, we might still treat those things (race,
class type) as independent variables, even though they are not true independent varia-
bles because they have not been randomly assigned. In those cases, some researchers will
refer to these kinds of variables as pseudo-independent or quasi-independent variables.
Dependent variables
If the independent variable is the variable we suspect is driving or causing differences
in outcomes, the dependent variable is the outcome we are measuring. It is called the
dependent variable because we believe scores on this variable depend on the independent
variable. For example, if a researcher is studying reading achievement test score differ-
ences by race, the achievement test score is the dependent variable. It is possible to have
more than one dependent variable as well. In general, the tests in this text will allow only
one dependent variable at a time, but there are other more advanced analyses (called
multivariate tests) that will handle multiple dependent variables simultaneously. One
method that can help identify the independent versus dependent variable is to diagram
what variables might be leading to, influencing, driving, or causing the other variable.
For example, if a researcher models their variables like this:
This diagram shows that the researcher believes class type influences or leads to differ-
ences in final exam scores. So, class type is the independent (or pseudo-independent)
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 61
variable, and final exam scores are the dependent variable. In this kind of diagram, the
variable the arrow points away from is independent, and the variable the arrow points
toward is dependent.
Confounding variables
Another kind of variable to consider are those that make a difference in the dependent
variable other than the independent variable. In other words—variables that change the
outcome other than the independent variable. There are a few ways this can happen,
but these variables are generally known as confounding variables. They are called con-
founding because they “confound” the relationship between independent and dependent
variables. Confounding variables might also be unmeasured. It could be that there are
variables that change the outcome that we have not considered or measured in our
research design, which would create confounded results. Ideally, though, we would iden-
tify and measure potential confounding variables as part of the design of the study.
One issue confounding variables create is what is known as the third variable prob-
lem. The third variable problem is the fact that just because our independent variable
and dependent variable vary together does not mean that one causes the other. It might
be that some other variable (the “third variable”) actually causes both. For example,
there is a strong, consistent relationship between ice cream consumption and deaths by
drowning. Does eating ice cream cause death by drowning? We might intuitively suspect
that this is not the case. However, how can we explain the fact that these two variables
co-vary? In this case, a third variable explains both: summer heat. When it gets hot out-
side, people become more likely to do two things: eat cold foods like ice cream and go
swimming to cool down. The more people swim, the more drowning deaths are likely
to occur. So, the relationship between ice cream consumption and drowning deaths is
not causal—it is an example of the third variable problem. Often in applied educational
research, the situation will not be so clear. There might be some logical reason to suspect
a causal relationship. However, good research design will involve identifying, measuring,
and excluding third-variable problems.
There are also potential confounding variables that serve as mediator or moderator
variables. These variables alter or take up some of the relationship between independent
and dependent variables. Mediators are usually grouping variables (categorical variables,
usually nominal) where the effect of the independent variable on the dependent differs
between groups. For example, we may find that an intervention aimed at increasing the
perceived value of science courses works better for third graders than it does for sixth
graders. There is a relationship between the intervention and perceived value of sci-
ence—but that relationship differs based on group membership (in this case, the grade
in school). Mediator variables are usually continuous (interval or ratio) and explain the
relationship between independent and dependent variable. For example, if our interven-
tion increases perceived value of science courses, we might wonder why it does so. Per-
haps the intervention helps students understand more about science courses (increases
their knowledge of science) and that increased knowledge leads to higher perceived
value for science courses. In that case, knowledge might be a mediator, and we might
find that the intervention increases knowledge, which in turn increases perceived value
(intervention → knowledge → perceived value). In some cases, the mediation is only
partial, meaning that the mediator doesn’t take up all of the relationship between the
independent and dependent variable, but does explain some of that relationship.
62 • NULL HYPOTHESIS SIGNIFICANCE TESTING
HYPOTHESES
In quantitative analysis, we must specify a hypothesis beforehand and then test our
hypotheses using probability analyses. The specific kind of testing used in most (but not
all) quantitative analysis is null hypothesis significance testing (NHST). We will return
to NHST later in this chapter, but first, we will talk about what hypotheses in these kinds
of analyses look like.
H0 : X = Y
In other analyses, the null hypothesis might be that there is no relationship between
two variables. In any case, the null hypothesis will always be that there is no meaningful
difference or relationship.
H1 : X ≠ Y
Alternative hypotheses can also potentially be directional. We will explore this more
in the next few chapters, but it could be that our hypothesis is that group X will have a
higher mean than group Y. We can specify that directionality in the hypothesis. We’ll
return to this idea in a later chapter and give examples of when it might be appropriate.
We’ll evaluate our hypotheses using NHST. Those tests operate based on probabilities,
and our decision about these hypotheses will be based on probability values. Because of
that, we next briefly review the basics of probability theory.
by 100. So, if an event had a probability of p = .250, then we’d expect that event to occur in
about 25% of cases. Stated another way, there is about a 25% chance of that event occurring.
(.167) = .028. In other words, when rolling two standard six-sided dice, both would come
up 6 about 3% of the time. From the example about randomly calling names in a class,
what is the probability that the instructor would randomly draw two names, and one
would be a science education student while the second would be a kinesiology student?1
p(science education & kinesiology) = p(science education) * p(kinesiology) = (8/30) *
(7/30) = .267 * .233 = .062. We’ll apply this logic to thinking about the probability of
getting samples with certain combinations of scores.
The other way we can work with discrete events is to ask the probability of getting one
outcome or another. For example, what is the probability of rolling two dice and at least
one of them coming up 6?2 In such a case, the formula is p(A or B) = p(A) + p(B). So, the
probability of rolling two dice and at least one of them coming up 6 is p(6 or 6) = .167
+ .167 = .334. At least one of the two dice will come up 6 about 33% of the time. What
is the probability an instructor will draw two names at random, and at least one will be
either a science education student or a kinesiology student? p(science education or kine-
siology) = p(science education) + p(kinesiology) = .267 + .233 = .500.
Probability distributions
We can also combine the probabilities of all possible outcomes into a single table. That
table is called a probability distribution. In a probability distribution, we calculate the
independent probabilities of all possible outcomes and put them in a table. For example:
n 10 5 7 8 30
p 0.333 0.167 0.233 0.267 1.000
So, if we randomly draw a name from the example class of 30 we described before, this
table shows the probability that the student whose name we draw will be in a given
program. Notice that the total of those independent probabilities is 1.000. If we draw a
name, the chances that student will be in one of these four programs are 100%, because
those are the only programs represented in the class.
Name 1 Name 2
In total, there are 16 possible samples we could get by drawing two names at random
from this class of 30. The distribution of those samples is the sampling distribution. In
applied research, we likely have populations that number in the millions, and samples
that number in the hundreds. In that case, the number of possible samples gets much
larger. The number of possible samples also increases when there are more possible out-
comes (for example, in this case, if we had five majors represented in the class, we would
have 25 possible samples).
Now imagine we take a random sample of two students from this population. We will
learn later that our analyses usually require samples of 30 or more, but to keep the
66 • NULL HYPOTHESIS SIGNIFICANCE TESTING
process simpler and easier to follow, we’ll imagine sampling only two. What are the pos-
sible combinations we might get? We could get: 0, 0; 0, 1; 0, 2; 0, 3; 1, 0; 1, 1; 1, 2; 1, 3; 2,
0; 2, 1; 2, 2; 2, 3; 3, 0; 3, 1; 3, 2; 3, 3. Our next question might be: what is the probability
of obtaining each of these samples? We can calculate that, using the probability formula
we discussed earlier. For example, if the probability of the random student having no
referrals is .391, and the probability of the random student having one referral is .261,
what is the probability of drawing two students randomly and the first having zero refer-
rals and the second having one referral? We would calculate this as p(zero & one) =
(p(zero))(p(one)) = (.391)(.261) = .102. So, there is roughly a 10% chance of getting that
particular combination. We can use this same process to determine the probability of all
possible samples:
Possible Sample p M
Notice that we can also calculate a mean for each sample. Next, we might want to know
the probability of getting a random sample of two from this population with a given
mean. For example, what is the probability of randomly selecting two students, and their
mean number of referrals being 1.0? To calculate this, we will use another formula intro-
duced earlier in this chapter. As we look at the sample means, there are three samples
that have a mean of 1.0 (0, 2; 2, 0; and 1, 1). So, we calculate p(0,2 OR 2,0 OR 1,1) = p(0,2)
+ p(2,0) + p(1,1) = .061 + .061 + .068 = .190. There is about a 19% chance that a random
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 67
sample of two from this population will have a mean of 1.0. Below are the calculations
for all possible means from a sample of two:
Mean Samples p
Notice that if we add up all these probabilities, the total is 1.0. This set of possible means
represents all possible outcomes of a random sample of two, so the total of the probabil-
ities will add to 1.0. We can also display these probabilities as a histogram.
distribution increases, several things start to happen. These trends are described by
something called the Central Limit Theorem, and include that:
• As the sample size increases, the sampling distribution will become closer and
closer to a normal distribution.
• As the sample size increases, the mean of the sampling distribution (that is, the
mean of all possible samples, often called the “mean of means”) will become closer
and closer to the population mean.
• Sample sizes over 30 will tend to produce a normal sampling distribution and a
mean of means that approximates the population mean.
The practical importance of this is that, because all the tests included in this book require
normal distributions, the minimum sample size will generally be 30, and we prefer larger
sample sizes.
In practice, it is rarely necessary to hand-calculate a sampling distribution, as we have
done above. Instead, we usually use known or theoretical sampling distributions, like
the z, t, and F distributions we’ll encounter in later chapters. However, those known or
theoretical sampling distributions work on the same mathematical principles. They also
produce the same result: they let us calculate the probability of getting a sample with
certain characteristics (e.g., a sample with a certain mean).
Most of the tests covered in this book involve testing the difference between two or more
sample means. However, these tests all work on the same logic—given a sampling distri-
bution, what is the probability of a difference this large?
This part of NHST can initially be confusing. Many students want the NHST to be a
direct test of the null hypothesis. It is not. The NHST is not a direct test of any hypothe-
sis. Instead, the NHST assumes the null hypothesis is true (and thus assumes a sampling
distribution where the mean of means is zero) and tests the probability of the observed
value. It asks the question: if the true difference in the population is zero, how likely is a
difference of this size to occur? Put another way: in a world where there is no difference,
what is the probability of getting a sample with this large of a difference?
In general, in educational research, we set the threshold for deciding to reject the null
hypothesis at p < .050. So, if the probability of the observed difference is less than .050,
we reject the null hypothesis. If it is greater than or equal to .050, we fail to reject the
null hypothesis. Notice that these are the only two options in NHST: rejecting or failing
to reject the null hypothesis. When the probability of the observed outcome occurring
if the true difference were zero (null) is low (less than .050), we conclude that the null
hypothesis is not a good explanation for the data. It is unlikely we would see a difference
this big if there were no real difference, so we decide it is not correct to believe there is
no real difference. If the probability is relatively high (greater than or equal to .050) that
we would observe a difference of this size if the true difference were zero, we conclude
there is not enough evidence to reject the null as a plausible explanation (we fail to reject
the null).
We will return to this logic of the NHST again in the next chapter, where we will have
our first statistical test. For now, it is important to be clear that the NHST tests the proba-
bility of our data occurring if the null were true and is not a direct test of the null or alter-
native hypothesis. It tells us how likely our data are to occur in a world where the null
hypothesis is true. If p = .042, for example, we would expect to find a difference that large
about 4.2% of the time in a world where the null hypothesis was true. Another important
note: our decision to make the cutoff .050 is completely arbitrary. It’s become the norm
in educational research and in social sciences in general, but there is no reason for .050
versus any other number. Many scholars have pointed out that NHST has shortcomings,
in part because we work with an arbitrary cutoff of .050, and in part because testing for
differences greater than zero is a low bar. We know that very few things naturally exist in
states of zero difference, so testing for differences greater than zero means that in a large
enough sample, we almost always find non-zero differences.
One final note about language before we review the types of error that occur in NHST.
It is typical to describe a finding where we reject the null hypothesis as “significant” and
to call it “nonsignificant” when we fail to reject the null hypothesis. Because of this, the
word “significant” takes on a special meaning in quantitative research. In writing about
quantitative research, it is important to avoid using “significant” for anything other than
describing an NHST result. In common speech and writing, the word “significant” can
mean lots of other things, but in quantitative research, it takes on this very particular
meaning. As a general rule, the word “significant” should be followed by reporting a
probability in this kind of writing. In all other cases, default to synonyms like “impor-
tant,” “meaningful,” “substantial,” and “crucial.”
70 • NULL HYPOTHESIS SIGNIFICANCE TESTING
Type I error
As we discussed above, typically in educational and social research, we set the criterion
value for p < .050, and reject the null hypothesis when the probability of our data are
below that threshold. This value is sometimes referred to as α (or alpha), and it repre-
sents our Type I error rate. A Type I error occurs when we reject the null hypothesis
when it was actually correct. In other words, a Type I error is when we conclude there is
a significant difference, but there is no real difference. By setting α = .050, we are setting
the Type I error rate at 5%. We expect to make a Type I error about 5% of the time using
this criterion. Type I error is the more serious kind of error, as this kind of error means
we have claimed a difference that does not really exist. Type I error is also the only kind
of error for which we directly control (by setting our criterion probability or α level).
Type II error
A Type II error occurs when we conclude there is no significant difference (fail to reject
the null hypothesis), but there is actually a difference. Perhaps the difference is small,
and our test led us to conclude it was too small to be significant. Sometimes researchers
refer to the Type II error rate as being 1 − α, so that when α = .050, the Type II error
rate would be .950. This is a bit misleading because it is not as though we would expect
to make a Type II error 95% of the time. That formula is probably less useful and more
misleading. The way to decrease our chances of a Type II error is not to increase α, after
all. Instead, we protect against Type II error by having sufficiently large sample size and
a robust measurement strategy.
To put it in the simplest terms, there are two kinds of errors we might make in an
NHST:
Limitations of NHST
All of the tests presented in the remainder of this textbook are null hypothesis signif-
icance tests—most researchers in educational and social research who do quantitative
work conduct NHST. Still, NHST, and, in particular, the use of p thresholds or criterion
values, have been the subject of much debate. Cohen (1994) suggested that the use of
NHST can lead to false results and overconfidence in questionable results. The American
Statistical Association has long advocated against the use of probability cutoffs (Wass-
erstein & Lazar, 2016), and has continued to push for more use of measures of magni-
tude and effect size. In response, many educational and social researchers have become
more critical of the use of probability values and NHST. The approach is deeply limited:
NHST assumes a zero difference, does not produce a direct test of the null or alterna-
tive hypothesis, and (as we will discover in future chapters) p can be artificially reduced
simply by increasing sample size. In other words, NHST is easily misinterpreted and can
also be “gamed.” As a result, we advocate in this text, as many publishers and journals
do, for the interpretation of p values alongside other measures of magnitude and effect
INTRODUCING NULL HYPOTHESIS SIGNIFICANCE TEST • 71
size. In fact, the APA (2020) publication manual is explicit in requiring NHST be paired
with measures of effect size or magnitude. With each NHST, we will also use at least
one such effect size indicator, and we will discover that not all significant differences are
meaningful.
Notes
1 For our purposes, all examples are given using sampling with replacement. We made this
choice because the sampling distribution logic, to which this section builds, uses sampling
with replacement calculations. It’s reasonable to think, though, about the classroom example
with a sampling without replacement calculation. When we draw the first name, if we don’t
replace that name before drawing the second name, the sample space would be reduced by
one. As a result, for our second probability calculation, we would calculate out of a sample
space of 29. However, for our purposes, we will always assume sampling with replacement,
so such adjustments are not needed.
2 For our purposes, we’re calculating “or” probabilities, including the probability of both
events occurring. In other words, the probability we calculate for rolling a 6 on at least one of
the two dice includes the probability that both dice will come up 6. So, we’ve phrased this as
getting 6 on at least one of the two rolls.
5
Comparing a single sample to the
population using the one-sample
Z-test and one-sample t-test
In the previous chapter, we explored the basics of probability theory and introduced null
hypothesis significance testing (NHST). In this chapter, we will go further with NHST
and learn two one-sample NHSTs. Neither of these one-sample tests are especially com-
mon in applied research. We teach them here as they are easier entry points to learning
NHST that help us build toward more commonly used tests. There are, though, prac-
tical uses for each of these two tests, and we will present realistic scenarios for each.
One-sample tests, in general, compare a sample to the population or compare a sample
to a criterion value. That function is why they are less common in applied work because
researchers rarely have access to population statistics and usually compare two or more
samples. However, in the event that a researcher had population statistics or criterion
values, the one-sample tests could be used for that comparison.
73
74 • NULL HYPOTHESIS SIGNIFICANCE TESTING
H0 : M
H1 : M
These directional hypotheses (hypotheses that specify a particular direction for the dif-
ference) are called one-tailed tests. That is because, by specifying a direction of the differ-
ence, we are only testing in the positive or the negative “tail” of the Z distribution.
Design considerations
This design is fairly rare in applied research. The reason is that the test requires that we
know the population mean and population standard deviation. This is fairly uncom-
mon—we almost always know or can calculate the sample mean and standard deviation.
But it’s trickier to do that for a population. In fact, much of the work that is done in
quantitative research is done because the population data are inaccessible. We very rarely
have access to the full population, which we would need in order to know its standard
deviation. But in situations where the population standard deviation is known, we can
calculate this test to compare a sample to the population.
COMPARING A SINGLE SAMPLE TO THE POPULATION • 75
Again, this test is not used frequently in applied research, but we will imagine a scenario
where it might be. Imagine that we gain, from a college entrance exam company, informa-
tion about all test-takers in a given year. They might report that, out of everyone who took
the college entrance exam that year, the mean was 21.00, with a standard deviation of 5.40.
The test scores range from 1 to 36. Imagine that we work in the institutional research office
for a small university that this year admitted 20 students. Those 20 admitted students had
an average entrance exam score of 23.45, with a standard deviation of 6.21. The university
might be interested in knowing if its group of incoming students had higher scores than
the population of test-takers. In this case, the university hopes to make a claim of a high-
er-than-normal test score for the purposes of marketing. Of course, there is considerable
debate among researchers about the value and importance of those scores, but it is also
true that many universities use them as a way to market the “quality” of their students.
• Random sampling. The test assumes that the sample has been randomly drawn
from the population. As we’ve already explored in this text, this assumption is
rarely, if ever, met because true random sampling is nearly impossible to perform.
But if we hope to generalize about a result, then the adequacy of sampling methods
is important. For example, in the scenario we described above, generalizability is
of limited importance. In fact, the university in our scenario hopes to demonstrate
their students are not representative of the population. But in other cases, it might
be that the researchers need their sample to be representative of the population so
that they can generalize their results.
• Dependent variable at the interval or ratio level. The dependent variable must be
continuous. That means it needs to be measured at either the interval or ratio level.
In our scenario, entrance exam scores are measured at the interval level (the same
interval between each point, with no true absolute zero).
• Dependent variable is normally distributed. The dependent variable should also
be normally distributed. We have introduced the concept of normality, and how to
assess it using skewness and kurtosis statistics. In future chapters, we will practice
evaluating this assumption for each test, but in this chapter, we will focus on the
mechanics of the test itself.
M
Z
N
The numerator is the mean difference, and the denominator is the standard error of the
population mean. In our example, the sample mean (M ) was 23.45, population mean (μ)
was 21.00, population standard deviation (σ) was 5.40, and the sample size (N ) was 20.
So we can calculate Z for our example as follows:
M 23.450 21.000 2.450 2.450
Z 2.0028
5.400 5.400 1.208
N 20 4.472
So, for our example, Z = 2.487. We can use that statistic to determine the associated
probability. In Table A1, the third column shows the one-tailed probability for each value
of Z. At Z = 2.02, p = .022. (Note that we always round down when comparing to the table
because rounding up would inflate p slightly, which raises the risk of Type I error.) That is
the one-tailed probability, which in this scenario is what we need because the hypothesis
was directional. If it had not been a directional hypothesis, though, we would simply
double that probability, so the two-tailed probability would have been p = .044. Finally,
because p < .050 (.007 < .050), we reject the null hypothesis and conclude there is a sig-
nificant difference between these 20 incoming students and all test-takers nationwide.
Note that this formula only works for the one-sample Z-test. Each test has its own for-
mula for effect size estimates. So, for our example:
M 23.450 21.000 2.450
d 0.454
5.400 5.400
Cohen’s d is not bounded to any range—so the statistic can be any number. However,
in cases where d is negative, researchers typically report the absolute value (drop the
negative sign). It can also be above 1.0 or even 2.0, though those cases are relatively
uncommon. There are some general interpretative guidelines offered by Cohen (1977),
which suggest that d or .2 or so is a small effect, or .5 or so is a medium effect, and .8 or
so is a large effect. It is important to know, though, that Cohen suggested those as start-
ing points to think about, not as any kind of rule or cutoff value. In fact, the best way to
COMPARING A SINGLE SAMPLE TO THE POPULATION • 77
interpret effect size estimates is in context with the prior research in an area. Is this effect
size larger or smaller than what others have found in this area of research? In our case,
we will likely call our effect size medium (d = .454).
Sometimes, researchers find extremely small effect sizes on significant differences.
That is especially likely if the sample size is very large. In those cases, researchers might
describe a difference as being significant but not meaningful. That is, a difference can be
more than zero (a significant difference) but not big enough for researchers to pay much
attention to it (not meaningful). One of the questions to ask about effect size is whether
a difference is large enough to care about it.
Finally, a note about the language of effect size: although researchers use language like
“medium effect” or “large effect,” they mean that there is a statistical effect of a given size.
This should not be confused with making a cause-effect kind of claim, which requires
other evidence, as we’ve discussed in this text elsewhere.
often call this “beating the table”—if the calculated test statistic “beats” the table value,
then they reject the null hypothesis.
Design considerations
The design considerations are essentially the same for this design as compared with the
one-sample Z-test. However, there is one situation where this test can be used while Z
cannot. Because the t-test does not require population standard deviation, it can be used
to test a sample against some criterion or comparison value. For example, it could be
used to test if a group of students has significantly exceeded some minimum required
test score. It could also be used to compare against populations with a known mean but
not a known standard deviation.
M
t
s
N
Imagine that in our earlier example, we had not known the population standard devia-
tion. We could then use the t-test to compare our sample of students to all test-takers as
follows:
Because the sample size was 20, there will be 19 degrees of freedom (N − 1 = 20 − 1 =
19). On the t table, we find that at 19 degrees of freedom, the critical value for a one-
tailed test is 1.73. Remember that our test is one-tailed because we specified a directional
hypothesis (that this group of students would have higher test scores than the popula-
tion of test-takers). Because our calculated value is higher than the tabled value (1.764
> 1.73), we know that p < .05, so we reject the null hypothesis and conclude there was a
significant difference.
COMPARING A SINGLE SAMPLE TO THE POPULATION • 79
M
d
s
This simply replaces the population standard deviation with sample standard deviation,
which is the same substitution as in the t-test formula. So, for our example:
Notice this is a slightly smaller effect size estimate than it was in the Z-test. This is com-
mon as the standard deviation for the sample will usually be larger than the population.
CONCLUSION
Although neither of the tests we introduced in this chapter are very common in applied
research, they work the same way, conceptually, as all of the other tests, we will explore in
this text. In all cases, these tests will take some form of between-groups variation (here,
it was the difference between the population and sample means) over the within-groups
variation or error (here, the standard error of the mean). How exactly we calculate those
two terms will change with each new design, but the basic idea remains the same. In our
next chapter, we will explore one of the most widely used statistical tests in educational
and psychological research: the independent samples t-test.
Part III
Between-subjects designs
81
6
Comparing two sample means
83
84 • BETWEEN-SUBJECTS DESIGNS
achievement in each of those versions of the class. Perhaps the format of the course
(online versus face-to-face) makes a difference in how well students learn (as measured
by a final exam). While this scenario presents several design limitations (which we will
discuss later in this chapter), the instructor could use an independent samples t-test to
evaluate whether students in the two versions of the course differ in their exam scores.
A more typical example for the independent samples t-test involves an experimental
group and a control group, compared on some relevant outcome. Imagine that an educa-
tional consultant comes to a school, advertising that their new video-based modules can
improve mathematics exam performance dramatically. The system involves assigning
students to watch interactive video modules at home before coming to class. To test the
consultant’s claims, we might randomly assign students to either complete these new
video modules at home or to spend a similar amount of time completing mathematics
worksheets at home. After a few weeks, we could give a mathematics exam to the stu-
dents and compare their results using an independent samples t-test.
The t distribution
As we discovered when we explored the one-sample t-test in the previous chapter, t-test
values are distributed according to the t distribution. The t distribution is a sampling
distribution for t-test values and allows precise calculation of the probability associated
with any given t-test value. We also described how the shape of the t distribution changes
based on the number of degrees of freedom. When we explored the one-sample t-test,
we said that there would always be n − 1 degrees of freedom. In the independent samples
t-test, there will be n − 2 degrees of freedom. As with the one-sample test, we use the t
distribution table to look up the critical value at a given number of degrees of freedom
and alpha or Type I error level (usually .05 for social and educational research). If the
absolute value of our t-test value exceeds the critical value, then p < .05 and we can reject
the null hypothesis. Of course, it is also possible to calculate the exact probability, or p,
values, and software like jamovi will produce the exact probability value. It is typical to
report the exact probability value when writing up the results.
might choose the face-to-face version of the class. Because of this self-selection, multiple
differences between the groups are built in from the start and cannot be attributed to the
course delivery mode. Students in the online class might be older, more likely to have
outside employment, and more likely to have multiple demands on their time. Students
in the face-to-face class might be younger (meaning less time has elapsed since prior
coursework, perhaps), have fewer employment and family obligations, and generally
have more free time to devote to coursework. If those differences exist, then a difference
in achievement cannot be fully attributed to the mode of course delivery.
In this case, we have an example where we cannot randomly assign participants to
groups. The groups are pre-existing, in this case by self-selection. Of course, many other
kinds of groups are intact groups we cannot randomly assign, like gender, in-state versus
out-of-state students, free-and-reduced-lunch status, and many others. A lot of those
intact categories are of interest for educational researchers. But because of the design
limitations inherent with intact groups, the inferences we can draw from such a com-
parison are limited. Specifically, we will not be able to claim causation (e.g., that online
courses cause lower achievement). We will only be able to claim association (e.g., that
online courses are associated with lower achievement). The distinction is important and
carries different implications for educational policy and practice.
In our second example described above, we have randomly assigned students to groups
(either to do video modules or the traditional worksheets). It is important to note that
the control group, in this case, is still being asked to do some sort of activity (in this case,
worksheets). That’s important because it could be that merely spending an hour a day
thinking about mathematics improves achievement and that it has nothing to do with
the video modules themselves. So, we assigned the control group to do an activity that is
typical, traditional, and should not result in gains beyond normal educational practice.
Because both groups of students will be doing something with mathematics for about the
same amount of time every day, we can more easily claim that any group differences are
related to the activity type (e.g., video or worksheet). This form of random assignment
will make it easier to make claims about the new videos and their potential impact on
student achievement than it was in the first example, where we had intact groups.
But there are still serious design challenges here. One issue is whether or not students
actually complete the worksheets or video modules. We would need a way to verify that
they completed those tasks (sometimes called a compliance check). Another possible
complication is with morbidity or dropout rates. Specifically, in a case where we ran-
domly assigned participants to groups, we would be concerned if the dropout rates were
not equal between groups. That is, we might assume some students will transfer out of
the class or stop participating in the research study. But because of random assignment,
the characteristics related to that decision to drop out of the study should be roughly
evenly distributed in the groups, so the dropout rates should be about the same. But what
if about 20% of the video group leaves the study and only 5% of the workshop group
leaves? That could indicate there is something about the video-based modules that is
increasing the rate of dropout and makes it harder to infer much from comparing the
two groups at the end of the study.
One final issue we will discuss, though there are many design considerations in both
examples, is experimenter bias. If teachers in the school know which students are doing
video modules and which are doing worksheets, that knowledge might change the way
they interact with students. If a teacher presumes the new content to be helpful, the
86 • BETWEEN-SUBJECTS DESIGNS
teacher might give more encouragement or praise to students in that group, perhaps
without being conscious of it. The opposite could be true, too, with a teacher assuming
the new video modules are no better than existing techniques and interacting with stu-
dent in a way that reflects that assumption. To be clear, in both cases, the teacher is likely
unaware they might be influencing the results. It is possible that someone involved with
a study might make more conscious, overt efforts to swing the results of a study, but that
is scientific misconduct and quite rare. On the other hand, unknowingly giving slight
nudges to study participants is more common and harder to account for.
Finally, before we move on to discuss the assumptions of the test, there are a few broad
considerations in designing research comparing two groups. One is that the independent
samples t-test is a large sample analysis. Many methodologists suggest a minimum of 30
participants per group (60 participants overall in a two-group comparison). Those groups
also need to be relatively evenly distributed—that is, we want about the same number of
people in both groups. This is built into a random assignment process, but when using
intact groups, it can be more challenging. The general rule is that the smaller group needs
to be at least half the size of the larger group. So, for example, if we have 30 students in a
face-to-face class and 45 in an online class, the samples are probably balanced enough.
If, however, we had 30 students face-to-face and 65 online, the imbalance would be too
great (30 is well less than half of 65). However, we want the groups to be as close in size to
one another as possible. We will explore the reason for that a bit more in the section on
assumptions. As we discuss the assumptions of the independent samples t-test, some of
these design issues will become clearer, and we will introduce a few other issues to consider.
evaluate the research design and nature of the variables. This is also an assumption for
which we cannot correct—if the level of measurement for the dependent variables is
incorrect, the t-test simply cannot be used at all.
Homogeneity of variance
The independent samples t-test, and many of the other tests covered in this book, requires
that the groups have homogeneous variance. In other words, the variance of each group
is roughly the same. The idea here is that, because we will be testing for differences in the
means of the two groups, we need variances that are roughly equal. A mean difference is
less meaningful if the groups also differ widely in variance. Basically, we are suggesting
that the width of the two sample distributions is about the same—that they have similar
standard deviations. That similarity will mean the two samples are more comparable.
Levene’s test
The simple test for evaluating homogeneity of variance is Levene’s test. It can be pro-
duced by the jamovi software with the independent samples t-test. Levene’s test is dis-
tributed as an F statistic (a statistic we will learn more about in a later chapter). In the
jamovi output for the t-test, the software will produce the F statistic and a related prob-
ability (labeled as Sig. in the output). That probability value is evaluated the same way as
any other null hypothesis significance test. If p < .05, we will reject the null hypothesis. If
p ≥ .05, we will fail to reject the null hypothesis. However, it is very easy to be confused
COMPARING TWO SAMPLE MEANS • 89
by the interpretation of Levene’s test. The null hypothesis for Levene’s test is homogeneity
of variance, while the alternative hypothesis is heterogeneity of variance1:
H 0 : SX1 2 = SX2 2
H 0 : SX1 2 ≠ SX2 2
Because of this, failing to reject the null hypothesis on Levene’s test means that the
assumption of homogeneity of variance was met. Rejecting the null hypothesis on Lev-
ene’s test means that the assumption of homogeneity of variance was not met. Put simply,
if p ≥ .05, the assumption of homogeneity of variance was met.
In other words, t is equal to the mean difference between the two groups, over the stand-
ard error of the difference.
know if there is a difference in final exam scores between the two versions of the course.
Imagine the instructor collects final exam scores and finds the following:2
1 85 1 91
2 87 2 89
3 83 3 93
4 84 4 94
5 81 5 92
We can easily calculate a mean for each group. The online class will be group X, and the
face-to-face class will be group Y.
X 85 87 83 84 81 420
X 84.00
Nx 5 5
Y 91 89 93 94 92 459
Y 91.80
NY 5 5
In this case, there were five students in both classes, which is why the denominator is the
same in calculating both means. Returning to our t formula, we are already done with
the numerator!
Partitioning variance
In future chapters, the topic of partitioning variance will get more nuanced. Here, we have
only two sources of variance: between-groups (the mean difference), and within-groups
(standard error of the difference) variance. We discussed above the calculation of the mean
difference, which defines between-groups variance for the independent samples t-test. The
more complicated issue with this test is the within-groups variance, here defined by the
standard error of the difference. To understand where this number comes from, we will
start at the sdiff term, and work our way backward to a formula you already know. Then, to
actually calculate sdiff, we’ll work through this set of equations in the opposite order.
As we learned in a prior chapter, standard error and standard deviation are related
concepts, applied to different kinds of distributions. So, just as standard deviation was
the square root of sample variance, standard error is the square root of error variance.
We can express this for sdiff in the following way:
COMPARING TWO SAMPLE MEANS • 91
sdiff = s 2diff
That error variance (variance of the difference) is calculated by adding the partial vari-
ance associated with each group mean:
s 2diff s 2 M X s 2 MY
That partial variance for each group mean is calculated by dividing the pooled variance
by the sample size of each group, so that:
s 2 pooled
s2 MX =
NX
s 2 pooled
s 2 MY =
NY
The pooled variance is calculated based on the proportion of degrees of freedom coming
from each group multiplied by the variance of that group:
df 2 dfY 2
s 2 pooled X s X s Y
dftotal dftotal
And finally, we have already learned how to calculate the variance of each group:
X X
2
2
s X
N X 1
Y Y
2
s 2
Y
NY 1
Okay, that is a lot to take in all at once, and a lot of unfamiliar notation. So, we will pause
for a moment to explain a bit before reversing the order of the formulas and calculating
sdiff. If you follow the order of this from bottom-to-top, what is happening is we start with
the variance of each of the two groups. Those group variances represent within-groups
variation. We described this in an earlier chapter as giving a sense of the error associated
with the mean. However, for the t-test, we need a measure of overall within-groups vari-
ation, rather than a separate indicator for each group. To accomplish that, we go through
a series of steps to account for the proportion of participants coming from each group
and the variance of that group, to arrive at a pooled variance. That pooled variance then
gets adjusted again based on the sample size of each group (the first adjustment was for
degrees of freedom, not sample size), and finally gets combined as an indicator of with-
in-groups variation. Finally, we take the square root to get from variance to standard
error. Those concepts might become even clearer as we walk through the calculations
with our example.
Recall that, above, we calculated a mean final exam score of 84.00 for online students
and 91.80 for face-to-face students. We can use those means to calculate group variance,
using the same process we introduced in Chapter 3:
92 • BETWEEN-SUBJECTS DESIGNS
Online Class
Squared Deviation X X
2
Student Score (X) Deviation (X − X )
1 85 85 − 84 = 1 12 = 1
2 87 87 − 84 = 3 32 = 9
3 83 83 − 84 = −1 (−1)2 = 1
4 84 84 − 84 = 0 02 = 0
5 81 81 − 84 = -3 (−3)2 = 9
Σ = 420 Σ=0 Σ = 20
X X
2
2 20 20
s X 5.00
N X 1 5 1 4
Face-to-Face Class
Squared Deviation Y Y
2
Student Score (Y) Deviation (Y − Y )
Y Y
2
2 14.80 14.80
s Y 3.70
NY 1 5 1 4
Our next calculation requires us to provide the degrees of freedom from each group and
the total degrees of freedom. Recall from the previous chapter on the one-sample t-test
that we learned, for a single sample, df = n − 1. You might imagine, then, that if we want
to know how many degrees of freedom are contributed by group X (the online class),
we could use the same formula, finding the dfx = NX − 1 = 5 − 1 = 4. Similarly, for group
Y (the face-to-face class), we’d find the dfY = NY − 1 = 5 − 1 = 4. So, the total degrees of
freedom would be 4 + 4 = 8. You can also extrapolate from this that, for the independent
samples t-test, dftotal = ntotal – 2. Armed with that information, we are ready to calculate
the pooled variance:
COMPARING TWO SAMPLE MEANS • 93
df X 2 dfY 2 4 4
s 2 pooled s X s Y 5 3. 7
dftotal dftotal 8 8
.5 5 .5 3.7 2.5 1.85 4.35
Next, we partial the pooled variance into variance associated with each group mean
(a process through which we make a further adjustment for the size of each sample):
s 2 pooled 4.35
s2 MX
= = = 0.87
NX 5
s 2 pooled 4.35
s 2 MY
= = = 0.87
NY 5
This step is not particularly dramatic when we have balanced samples. However, it is easy
to see how if we had different numbers of students in the two classes, this adjustment
would account for that unbalance.
Next, we will calculate the difference variance:
And finally, we’ll convert the difference variance to the standard error of the difference:
=
sdiff s 2diff
= 1.74 = 1.32
As we mentioned at the start of this section, the process of getting the denominator
value, which represents within-groups variation, is more laborious. Many of the steps
in that calculation are designed to account for cases where we have unbalanced samples
by properly apportioning the variance based on the relative “weight” of the two samples.
But having drudged through those calculations, we are now ready to examine the ratio
of between-groups to within-groups variation.
X Y
t
sdiff
In our case, we have all the information we need to calculate the test statistic as follows:
Here, the negative sign on the t-test value simply indicates that the second group (Y, or,
in our case, face-to-face instruction) had the higher score. If we had treated online
94 • BETWEEN-SUBJECTS DESIGNS
instruction as group Y and face-to-face instruction as group X, we’d have the same test
statistic, except it would be positive. So, the order of the groups doesn’t matter, but it will
affect the sign (+/−) on the t-test value.
are also meaningful. We have previously discussed and calculated Cohen’s d as an effect size
estimate. We will demonstrate how to calculate and interpret Cohen’s d for the independent
samples t-test too. Then we’ll explore another effect size estimate, ω2 (or omega squared).
Calculating Cohen’s d
When we learned the one-sample t-test, we also learned to calculate d based on the mean
difference over the standard error. Cohen’s d will work basically the same way in the
independent samples t-test, just with different ways of getting at the mean difference and
standard error. For the independent samples t-test, Cohen’s d will be calculated as follows:
X Y
d
s pooled
As part of the process of calculating t, we already have all of these terms. The numerator
is the mean difference, which, in the case of the independent samples t-test, is the differ-
ence between the two groups’ means. The denominator is the standard error, which in
our case will be the square root of the pooled variance. The reason it will be the square
root is that we want standard error (spooled), which is the square root of the pooled vari-
ance (s2pooled). In other words:
s pooled = s 2 pooled
= s 2 pooled
s pooled = 4.35 = 2.09
We can take all of that information, plug it into the formula for d, and calculate effect
size:
Remember that, for d, we report the absolute value (dropping the sign), which is why we
reported it here as 3.73 rather than −3.73. That would be a very large effect, according
to the general rules for interpreting d we described in the previous chapter, where any d
larger than .8 is typically considered large. Checking the box for effect size in jamovi will
produce Cohen’s d.
Calculating ω2
One of the problems with Cohen’s d is that it can be difficult to interpret. Even the general
cutoff points we described in the previous chapter are, in practice, not particularly use-
ful. Cohen himself suggested that those cutoff points are arbitrary and may not be mean-
ingful in applied research. We also do not know what it means proportionally, especially
because d can theoretically range from zero to infinity. Because of those limitations, and
96 • BETWEEN-SUBJECTS DESIGNS
others, researchers often prefer to use omega squared as an effect size estimate. We will
discuss more about interpreting this effect size estimate below, but it ranges from zero to
one and represents a proportion of explained variance. Because it is a proportional esti-
mate, it is often easier to interpret and make sense of than unbounded estimates like d.
Like d, the formula for omega squared will vary based on the statistical test to which
we apply it. In the case of the independent samples t-test, it is calculated as follows:
t2 1
2
t 2 N X NY 1
In this formula, we already calculated t, and we know the sample size of each group (here
represented as NX and NY). Because all of this information is already known, we can cal-
culate omega squared for our example:
A final note is that in our example, we are using fake data. We made the data up for
the purposes of the example. In real research, an effect size this large would be shocking.
By reading the published research in your area, you’ll get a feel for typical effect sizes,
but it is fairly uncommon for omega squared to exceed .20 in educational and behavioral
research. So, when you start working with real data, do not be discouraged to see some-
what smaller effect size estimates than we find in these contrived examples.
We could name them anything we want, so long as it starts with a letter and doesn’t
contain spaces, but it should be something we can easily identify later on. We can also
give the variable a better description on the “Description” line of the setup menu, which
can contain any kind of text so we can include a clearer description, if needed. Next, we
will set up the second column, which we might name ClassVersion. This variable will be
a Nominal data type, because the class versions are nominal data. We can also label the
groups using the “Levels” feature. Note that this will not work until after we have typed
in the data so the software knows what group numbers we will need to label. In our case,
the data will have two groups, which we will simply number 1 and 2. Group 1 will be the
Online Class group and group 2 will be the Face-to-Face Class group.
The step of adding group labels is optional—the analysis will run just fine without setting
up group labels. However, if we do take the time to go in and set up group labels (which,
again, needs to be done after data entry, so we are going just a little bit out of order to
show the data setup all together), the output will be labelled in the same way, making
it easier to interpret. One final step we may want to take is to delete the third variable
(automatically named “C”) that jamovi created by default. To do so, right click on the
column header “C” and click “Delete Variable.” It will prompt you to confirm you wish to
delete the variable. Simply click “OK” and the variable will be permanently deleted. This
step is also optional but results in a somewhat cleaner data file.
To enter the data, we simply type it into the spreadsheet. Note that the group mem-
bership will be entered as 1 for the Online Class and 2 for the Face-to-Face class, but if
COMPARING TWO SAMPLE MEANS • 99
you set up the group labels in jamovi, you will see the group names in the spreadsheet as
in this example. If you do not add the group labels, the spreadsheet will show the 1 or 2
in those cells instead.
Now that the data file is set up and the data entered, we are ready to run the independent
samples t-test. At the top of the window, click on the “Analyses” tab, then the “T-Tests”
button (note that due to the software formatting, it has a capital T although the t in t-test
is lowercase). Then choose the independent samples t-test.
In the resulting menu, click FinalExamScore and then the arrow next to the Depend-
ent Variables area to set that as the dependent variable. Then click on ClassVersion and
then the arrow next to Grouping Variable to set that as the independent variable. Notice
there are a number of options showing on the screen. By default, jamovi will have the
box for “Student’s” checked. This is the uncorrected t-test, which is what we will use if
the assumption of homogeneity of variance was met. To produce that test, we will check
the box next to “Equality of variances” under “Assumption Checks.” Another option
we will want to check is the “Descriptives” box under “Additional Statistics.” Note also
that there are options under “Hypothesis,” and by default it has a two-tailed hypothe-
sis checked. Probably the easiest way to conduct the test is to always leave that option
checked, and simply divide p by two if the test was actually one-tailed. Another option to
check is the “Confidence interval” under “Additional Statistics.” Finally, as we described
100 • BETWEEN-SUBJECTS DESIGNS
above, the “Effect size” option will produce Cohen’s d, but typically we would be more
interested in calculating omega squared (ω2).
One thing you may notice about jamovi as you select all the options is that the out-
put shows up immediately to the right, and updates in real time as you select different
options. This is a nice feature of jamovi compared to some other analysis software, as
it allows us to select different options without having to entirely redo the analysis. For
example, if we find that the assumption of homogeneity of variance was not met, we can
check the box for Welch’s correction and the output will update accordingly.
The output will start with the independent samples t-test results, then the assumption
tests, and finally the group descriptive statistics. However, we will discuss the output
starting with the assumption tests, because that output would inform how we approach
the main analysis. Note that for Levene’s test, jamovi produces an F ratio, degrees of free-
dom, and a p value. As discussed earlier in this chapter, if p > .050, then the assumption
was met. In this case, F1, 8 = .044, p = .839, so the assumption was met. As a result, we
can proceed with the student’s (or standard, uncorrected) t-test. If the data had not met
the assumption, we could choose the Welch’s correction and interpret it instead of the
student’s test.
COMPARING TWO SAMPLE MEANS • 101
Next, we will look at the independent samples t-test output. We see that t at 8 degrees of
freedom is −5.913, and p < .001. Because p < .050, the difference in exam scores between
online and face-to-face students was statistically significant. The software will, by default,
produce two-tailed probabilities for independent samples t-test. In our example, that
works because our hypothesis was two-tailed. If we needed to produce a one-tailed test,
we could simply divide the probability value reported in jamovi in half. We are also given
the 95% confidence interval, which is based on the standard error. From this, we can
determine that 95% of the time, in a sample of this same size from the same population,
we would expect to find a mean difference between −4.758 and −10.842.
A note here about rounding: The default in jamovi is to round to the hundredths place,
or two places after the decimal. We described this in an earlier chapter, but typically we
will want to report statistical test results to the thousandths place (three after the dec-
imal). There is an easy setting for this in jamovi. Simply click the three vertical dots in
the upper right corner of the software, and then change the “Number format” to “3 dp”
and the “p-value format” to “3 dp”. In this same menu, if we click “Syntax mode”, jamovi
will display the code used to produce the output. The jamovi software is based on R, a
programming language commonly used for statistical analysis. Taking a look and getting
familiar with R coding can be very useful, especially because there are some advanced
analyses (beyond the scope of this text) that might require using R directly.
102 • BETWEEN-SUBJECTS DESIGNS
You can see that we are suggesting a roughly five-sentence paragraph to describe the
results of an independent samples t-test. Here is an example of how we might answer
those questions for our example:
We could pull this all together to create a results paragraph something like:
(Continued)
COMPARING TWO SAMPLE MEANS • 103
were enrolled (ω2 = .77). Students in the face-to-face version of the class
(M = 91.80, SD = 2.24) scored higher on the final exam than did those in
the online version of the class (M = 84.00, SD = 1.92). Among the present
sample, students performed better on the final exam in the face-to-face
version of the class.
Notes
1 While the assumption of homogeneity of variance actually refers to population variance,
Levene’s test only assesses sample variance. That is, the assumption of homogeneity of vari-
ance is that the population variances for the two groups are equal. But, because we do not
have data from the full population, Levene’s test uses sample variances. As a result, we have
expressed the null and alternative hypotheses for Levene’s test using Latin notation, rather
than Greek notation. Other texts will show Levene’s test hypotheses in Greek notation (trad-
ing sigma for s) because the assumption is actually about the population.
2 In this example, as in most computation examples in this text, the sample size is quite small.
This is to make it easier to follow the mechanics of how the tests work. In actuality, this
sample size is inadequate for a test like the independent samples t-test. But, for the purposes
of demonstrating the test, we’ve limited the sample size.
7
Independent samples
t-test case studies
In the previous chapter, we explored the independent samples t-test using a made-up exam-
ple and some fabricated data. In this chapter, we will present several examples of published
research that used an independent samples t-test. For each sample, we encourage you to:
1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the t-test.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.
105
106 • BETWEEN-SUBJECTS DESIGNS
Research questions
The researchers were interested in determining:
Hypotheses
The authors hypothesized the following related to conceptual knowledge:
variable (DV) was conceptual knowledge test scores (12-item multiple-choice test). The
authors reported that they evaluated the content validity of the conceptual knowledge
test by using a subject-matter expert to check the correctness of the questions and the
possible answers. For the transfer hypothesis, the dependent variable was transfer test
scores (two open-ended questions) rated by two raters. The raters generated scores for
each student, and each question had 13 points resulting in a maximum of 26 points in
the transfer test. Two raters rated the transfer knowledge test. The interrater reliability
of the two raters (ICC) = .98. This shows that the raters were consistent in rating the
students’ transfer knowledge scores.
The authors reported Cohen’s d. However, we could calculate omega squared for
each test:
t2 1 2.3172 1 5.368 1 4.368
Transfer : 2 2
2
.083
t N X NY 1 2.317 24 24 1 5.368 47 52.368
t2 1 1.3242 1 1.753 1
Conceptual : 2
t 2 N X NY 1 1.3242 24 24 1 1.753 47
0.753
.015
48.753
From these calculations, we can determine that about 8% of the variance in transfer
knowledge was explained by the type of explanation (ω2 = .083). We would not in-
terpret the effect size for conceptual knowledge because the test was nonsignificant.
5. What is the pattern of group differences?
Those producing oral explanations (M = 11.062, SD = 3.446) scored higher on
transfer knowledge than those generating written explanations (M = 9.146,
SD = 2.134). For conceptual knowledge, there was no significant difference between
the oral (M = 8.333, SD = 1.834) and written (M = 8.917, SD = 1.139) conditions.
Write-up
Results
(Continued)
INDEPENDENT SAMPLES t-TEST CASE STUDIES • 109
Now, compare this version, which follows the format we suggested in Chapter 6, to the
published version. What is different? Why is it different? Notice that, in the full article,
the t-tests are just one step among several analyses the authors used. Using the t-test in
conjunction with other analyses, as these authors have done, results in some changes in
how the test is explained and presented.
Research questions
The researchers wanted to know if participants would rate the essays using the words
partner or husband (implying the author was gay) differently than they rated the
essay using the word “wife” (implying the author was straight). Their literature review
110 • BETWEEN-SUBJECTS DESIGNS
suggested that participants might have an implicit bias against gay applicants, which
might result in lower ratings on some scales. In particular, they wanted to test differences
in perceived “fit” with the graduate program, because fit is a subjective quality where
implicit or unconscious bias would be more likely to manifest. They also tested differ-
ences in rating of preparedness for graduate school.
Hypotheses
The authors hypothesized the following related to ratings of fit:
H0: There was no difference in ratings of fit between participants reading the wife
essay version versus the husband or partner essay versions.
H1: Participants reading the wife version would provide higher fit ratings than those
reading the husband or partner versions.
Notice that these hypotheses are one-tailed. They specify a direction of difference—that
the wife essay will get higher ratings. By default, jamovi produces two-tailed probabilities for
t-tests, so we will have to divide those probabilities by two to get the one-tailed probabilities.
Write-up
Results
Again, compare this to the published study to see how they differ. In this case, we’ve
focused on Study 2 of the manuscript, but the authors have written about the t-test in a
rather different way than we have here because it was one of multiple analyses they used.
This is very typical in published work—to see multiple analyses in a single paper. This is
perhaps especially true of the use of the independent samples t-test, which is often used
as an additive or preliminary analysis to other tests. However, the independent samples
t-test can certainly stand on its own, especially in experimental research.
In the next chapter, we’ll move on to comparisons of more than two groups using the
one-way ANOVA. It functions in similar ways to the t-test but also has some key differ-
ences. The ANOVA is a more general form of the t-test, because it can test any number
of groups, while the t-test can only test two groups at a time.
For additional case studies, see the online eResources, which include dedicated
examples related to race and racism in education.
Note
1 The process by which we simulate data for these case studies results in data that are almost
perfectly normally distributed. Remember that the example datasets on the online resources
are not actual human subjects’ data, but simulated data to reproduce the outcomes of the case
study articles. If you decide to run the tests for normality for practice on these datasets, keep
in mind they will be nearly perfect due to the manner in which we have simulated those data.
8
Comparing more than
two sample means
113
114 • BETWEEN-SUBJECTS DESIGNS
In the previous chapter, we explored the independent samples t-test as a way to compare
two group means. However, many research designs will involve more than two groups,
and the t-test is an inefficient way to conduct those comparisons. In this chapter, we will
encounter the one-way analysis of variance (ANOVA, for short) for comparing more
than two group means. We will also explore how it is related to the t-test, and why we
cannot just use multiple t-tests to do multiple group comparisons.
The F distribution
The ANOVA produces an F statistic, unlike the t-test, which produces a t statistic. The F
here stands for Fischer, but for our purposes we will typically refer to it as the F test or the
F ratio. F has some characteristics that are a bit different from t, though. Both F and t are
(as we described in the prior section) sampling distributions of the test statistic, so that
given a test statistic and degrees of freedom, we can calculate the probability associated
with that test result. The t distribution was a normal distribution with a mean, median,
and mode of zero (not unlike the z distribution). The F distribution takes on a different
shape, though. And it does so because F cannot be negative (we’ll discover why shortly).
Because of that, the F distribution is not normal, unlike z. While it won’t be particu-
larly important that you can visualize the F distribution, the graphic below shows how it
might be shaped in several different situations.
COMPARING MORE THAN TWO SAMPLE MEANS • 115
As illustrated by this figure, the shape of the F distribution is quite different from pre-
vious sampling distributions we have encountered. Its shape varies based on the degrees
of freedom, of which there are infinite possible combinations. However, our interaction
with the F distribution will be quite similar. For hand-calculated examples, we will look
up the critical value of F in a table, and if our calculated value exceeds the critical value
(if we “beat the table”), then we reject the null hypothesis.
Familywise error
This problem is referred to as a familywise error. Here, we are thinking about the set
of data as a “family” in that the data are related to one another. When we perform
multiple tests on that same family of data, the Type I error piles up. We usually set
testwise Type I error (that is the Type I error rate for an individual test) at .05. But
116 • BETWEEN-SUBJECTS DESIGNS
if I do multiple tests in the same family of data, the error compounds. The formula
for computing how much error we get is ∝fw = 1 − (1 − ∝tw)c, where c is the num-
ber of tests, and αtw is the testwise Type I error rate. It won’t be important to learn
this formula, but we’ll use it briefly to illustrate how much familywise error we can
accumulate.
In the case where we have set testwise Type I error at .05 and we are doing three tests
(as in the example above, to compare three groups), ∝fw = 1 − (1 − ∝tw)c = 1 − (1 − .050)3
= 1 − .9503 = 1 − .857 = .143. In other words, I have a 14.3% chance of making a Type I
error across the set of tests. This gets much worse as we expand the number of groups.
For example, if we want to compare five groups using multiple independent samples
t-tests, we would need to do nine t-tests (1 vs. 2, 1 vs. 3, 1 vs. 4, 1 vs. 5, 2 vs. 3, 2 vs. 4, 2
vs. 5, 3 vs. 4, 3 vs. 5). Using our formula, ∝fw = 1 − (1 − ∝tw)c = 1 − (1 − .050)9 = 1 − .9509
= 1 − .630 = .370. That is a 37% chance of making a Type I error across the set of tests.
This is unacceptably high, and illustrates why we do not use multiple tests in the same
family of data. As we do, it becomes increasingly likely that we are claiming differences
exist that, in reality, do not exist.
Homogeneity of variance
We also encountered the assumption of homogeneity of variance in the Chapter 6. Here,
we have the assumption that the variances will be equal across all groups. This relates to
the idea that we are expecting our group means to differ but with relatively constant vari-
ance across the groups. This is especially important in the ANOVA design, because when
we calculate the test statistic, we will calculate a within-groups variation. For that cal-
culation to work properly, we need relatively consistent variation in each of our groups.
However, it is worth noting that it is far less common to violate this assumption in the
ANOVA. When we do violate this assumption, it is often due to unbalanced sample
sizes. As we discussed in Chapter 6, a good guideline is that no group should be more
than twice as large as any other group. Because variance is, in part, related to sample
size, groups with very different sample sizes will have different variances. The strongest
protection against failing this assumption is to have balanced group sizes or as close to
balance as possible.
through other means. In order to proceed with an uncorrected F statistic, even though
Levene’s test is significant, the following three conditions must all be met:
1. The sample size is the same for all groups (slight deviations of a few participants are
acceptable).
2. The dependent variable is normally distributed.
3. The largest variance of any group divided by the smallest variance of any group
is less than three. Or, put another way, the smallest variance of any group is more
than 1/3 the largest variance of any group.
If all three conditions are met, there is no need for a correction. However, if one
or more of these conditions was not met, jamovi has a correction available known
as the Welch correction. In fact, in jamovi, it defaults to the Welch correction,
and we must choose the uncorrected (or exact, or Fisher’s) test if it is appropriate.
Note, though, that this option is only available in the One-Way ANOVA menu,
and we normally will choose to use the ANOVA menu as it is more versatile, so
if the correction is needed, we would need to change which part of the program
we use.
Source SS df MS F
As we move forward, we’ll explore how to calculate each of these and the logic behind
the test statistics.
COMPARING MORE THAN TWO SAMPLE MEANS • 121
Partitioning variance
The ANOVA is called the “analysis of variance” because it involves partitioning total
variation into different sources of variance. In the one-way ANOVA, those sources are
between-groups and within-groups variance. But what is being partitioned into between
and within variation is the total variance. In the ANOVA, we determine how much total
variation is in the data based on deviations from the grand mean.
The grand mean is simply the mean of all the scores—that is, the overall mean regard-
less of which group a participant belongs to, often written as X. You might recall that
X normally notates a group mean. The second bar indicates this is the grand mean, with
some texts using the phrase “mean of means.” That terminology really only works in
perfectly balanced samples, though, where the grand mean and the mean of the group
means will be equal. However, sometimes knowing that background can make it easier
to remember that the double bar over a variable indicates the grand mean.
How then do we calculate variation from the grand mean? For the purposes of the
ANOVA source table, we’ll be calculating the sum of squares (SS), or sum of the squared
deviation scores. You might recall this in previous chapters where the numerator of the
variance formula was called the sum of squares or sum of the squared deviation scores.
The SST will be calculated based on deviations from the grand mean (just like the SS in
the numerator of variance was calculated from the group means). So, for the ANOVA:
( )
2
SST = X − X
Returning to our example, where we have students doing three different kinds of course-
work and hope to evaluate if there are any differences in racial stereotypes between students
doing these different kinds of work, imagine that each group has five children. For research
design purposes, five participants per group would not be sufficient (we would normally
want at least 30 per group), but for the purposes of illustrating the calculations, we will stick
to five per group. Below, we illustrate the calculations involved in getting the SST .
(X − X) (X − X)
2
Group Score
We would start by calculating the grand mean, which is the total of all scores divided by
the number of participants—in our case 59.4/15 = 3.96. We then take each score minus
the grand mean of 3.96 to get the deviation scores (which will sum to zero as we discov-
ered in prior chapters). Finally, we square the deviation scores and take the sum of the
squared deviation scores. The sum of the squared deviation scores from this procedure
is SST, which is 4.04 in this case.
Group Score (X − X) (X − X)
2
In this example, then, the SSW = 0.95. We are now 2/3 of the way through calculating the
sums of squares, after which we’ll move to complete the source table.
COMPARING MORE THAN TWO SAMPLE MEANS • 123
The final SS term we need to calculate is SSB, which measures between-groups varia-
tion. The formula for this term is:
( )
2
SSB = X − X
That is, the sum of the squared deviations of the group mean minus the grand mean.
This point can be confusing at first, because there is only one group mean per group,
but there are multiple participants per group. We will repeat the process for every par-
ticipant in every group, as illustrated below (remember from above that the grand mean
is 3.96):
(X − X) (X − X)
2
Group Score
As shown above, we took the group mean for each participant minus the grand mean,
giving the deviation score. Then we squared those deviation scores and took the sum of
the squared deviation scores. So, for our example, SSB = 3.10. Let’s go ahead and drop
those SS terms into the source table:
Source SS df MS F
Because the total variance is partitioned into between and within, we expect SSB + SSW
= SST. In this illustration, we are off by 0.01 because we rounded throughout to the
124 • BETWEEN-SUBJECTS DESIGNS
hundredths place. If we had not rounded (the software will not round), this would be
exactly equal. Next, we’ll move on to fill in the rest of the source table.
MSB
F=
MSW
In our case, that would mean that F = 1.55/0.08 = 19.38. The completed source table for
this example would be:
Source SS df MS F
Notice that the bottom three cells in the right-hand corner are empty. That pattern will
always be present in every ANOVA design we learn. Finally, we supply below for refer-
ence a source table with the formulas for each term:
COMPARING MORE THAN TWO SAMPLE MEANS • 125
Source SS df MS F
( ) dfB = k − 1 SS MSB
2
Between SSB = X − X MSB = B F=
df B MSW
dfW = n − k SS
SSW = ( X − X )
2
Within MSW = W
dfW
Total
( ) dfT = n − 1
2
SST = X − X
SSB − ( df B )( MSw )
ω2 =
SST + MSw
All of the necessary information is on our source table, so we can just drop those terms
into the formula and calculate omega squared:
Post-hoc tests
We will start by exploring post-hoc tests, or pairwise comparisons. In published research,
post-hoc tests tend to be more common and are many researchers’ default choice. We
will talk more later about why that might not be the right default choice, but post-hoc
tests are quite common. In a post-hoc test, we will test all possible combinations of
groups and interpret the pattern of those comparisons to determine how the groups
differ from one another. They are also called pairwise comparisons because we compare
all possible pairs of groups. In our example, that would mean comparing students doing
the film project to those doing the pen pal project, then comparing those doing the film
project to those doing normal coursework, and finally comparing those doing the pen
pal project to those doing normal coursework. Of course, when we have more than three
groups, we get many more possible pairs, and thus will have far more pairwise compar-
isons. No matter how many groups we have, though, the basic process will be the same.
X1 − X2
HSD =
sm
The standard error term, sm, is calculated as:
MSW
sm =
N
128 • BETWEEN-SUBJECTS DESIGNS
In this formula, N is the number of people per group. Because of that, the formula works
as written only if there are the same number of people in each group. In the event the
groups are unbalanced (that is, they do not each have an equal number of participants),
we replace N with an estimate calculated as:
k
N′ =
1
∑
N
In this formula, k is the number of groups, and N is the number of people in each group.
In other words, we would divide 1 by the number of people in each group (one at a time),
and sum the result from each group, which becomes the denominator of the equation.
If that feels a little confusing, no need to worry. It is probably sufficient to know that,
when there are unbalanced group sizes, we make an adjustment to the standard error
calculation to account for that. In practice, we’ll usually run this analysis with software,
which will do that correction automatically. For our example data, we have five people
per group in each group, so no adjustment will be needed. Let us walk through calculat-
ing the Tukey post-hoc test for our example and determine how our three groups differ.
First, we’ll calculate sm for our example, taking MSW from our source table, and replac-
ing N with the number of people per group (we had 5 in all groups):
MSW .08
sm
= = = = .02 .14
N 5
So, the denominator for all of our Tukey post-hoc tests will be .14. Next, we will calculate
our three comparisons.
First, let’s compare students doing the film project (M = 3.86) to those doing the pen
pal assignment (M = 3.46):
X1 − X2 3.86 − 3.46 .40
HSD = = = = 2.86
sm .14 .14
Next, we will compare those doing the film project (M = 3.86) to those doing normal
coursework (M = 4.56):
Based on that, we can conclude that there is a significant difference in racial bias between
those doing the film project and the normal coursework, there is a significant difference in
racial bias between those doing the pen pal assignment and normal coursework, but there
is no significant difference in racial bias between those doing the film project and the pen
pal assignment. We can take the final step in this analysis by examining the means to see
that those doing normal coursework (M = 4.56) had significantly higher racial bias than
those doing the pen pal assignment (M = 3.46). Similarly, those doing normal coursework
(M = 4.56) had significantly higher racial bias scores than those doing the film project
(M = 3.86). We know the difference is significant based on the Tukey test result, and we
know that their bias scores are higher by examining the group means. Finally, there was
no significant difference between those doing the pen pal assignment and the film project.
The Tukey HSD post-hoc is only one of the available post-hoc tests. There are many
more available to us. All of them operate on the same basic logic and mathematics as
HSD, but estimate error somewhat differently, so will yield somewhat different test
results. We should also point out that to only report p values.
In practical terms, the differences in p values between these tests in most applied research
will be very small, perhaps .020 or less. Of course, with small differences between groups,
that can be the difference between rejecting the null and failing to reject the null (between
determining a difference is significant or not significant). In the table below are the p-val-
ues for our three comparisons for each of the three tests. This illustrates nicely how they
line up in terms of power and error:
Test Film vs. Pen Pal Film vs. Normal Pen Pal vs. Normal
In this case, our selection of post-hoc test can potentially change our interpretation of
the results. For example, if we would have used LSD post-hoc tests, we would conclude
there is a significant difference in bias among students doing the film project vs. those
doing the pen pal assignment. All of the other tests would lead us to conclude there was
no significant difference in those two groups. It is also worth noting that, generally, as the
sample sizes increase, the difference between the test results will become smaller.
The general rule will be to prefer more conservative tests when possible. That is espe-
cially true for research in an established area or confirmatory research. However, when
our sample sizes are smaller, the area of research is more novel, or our work is more
exploratory in nature, we might prefer to use a less conservative test. There are probably
few situations that justify the use of the LSD post-hoc, as it is quite liberal and provides
minimal Type I error protection, for example. But we could select among other tests
based on the research question and other factors. As a final note on selecting the appro-
priate post-hoc test, it is never acceptable to “try out” various post-hoc tests in the same
sample to see which one gives the desired result. We should select the post-hoc test in
advance and apply it to our data whether it gives us the answer we want or not.
There are many more post-hoc tests available than these, too. In jamovi, we will find
several options. The four we outlined here are seen most commonly in educational and
behavioral research, though, and are good general purpose post-hoc tests that will be
appropriate for the vast majority of situations.
A priori comparisons
Of course, we might have expected that would be the case. In all likelihood, the reason
we were testing the interventions was because we figured they would reduce racial bias as
compared with normal coursework. If so, we had a theoretical model in mind before we ran
our analysis. However, post-hoc tests do not directly evaluate theoretical models—they are
more like casting a wide net and seeing what comes up. That strategy might be appropriate
when we do not have a theoretical model going in. But when we have a theory beforehand,
we can instead specify a priori comparisons, otherwise known as planned contrasts. As we
COMPARING MORE THAN TWO SAMPLE MEANS • 131
introduce this concept, we wish to note that a priori comparisons are tricky to do in jamovi
and easier in some other software packages. Still, we will introduce this type of comparison
conceptually and provide information on how to specify and calculate planned contrasts,
while noting that only certain sets of contrasts are possible in jamovi.
For individual contrasts to work, as discussed earlier, the coefficients should sum to zero.
But for the coefficients to be orthogonal, we want the product of the coefficients to sum
to zero as well. This can take a little careful planning, but the set of contrasts here are
fairly common for three groups when one group is a control condition (a group that gets
no intervention).
Finally, how many contrasts should we specify? The answer is k − 1. For the contrasts
to be orthogonal (which we need them to be), we need to specify one fewer contrast than
we have number of groups. So, for our case, where there are three groups, we should
specify two contrasts, as we’ve done in the above example.
132 • BETWEEN-SUBJECTS DESIGNS
Source SS df MS F
While we have reproduced the formulas for the within and total lines of the source table,
we already calculated them for the omnibus test. Those terms do not change. All that is
happening in the a priori comparisons is we’re breaking down the between variation into
variance attributable to our planned contrasts.
The only new bit of calculation here is in the Sum of Squares column for our contrasts.
For each contrast, we will calculate the SS term using the formula, which has some unfa-
miliar elements in it. But it is a relatively simple calculation. The ψ term in the numer-
ator is calculated by multiplying all group means by their coefficients and adding them
together. For our contrasts:
The other term that is new for us is ∑c2, but it is simply the sum of the squared coeffi-
cients. For our two contrasts:
∑ c 2C 2 = (1) + ( −1) + ( 0 ) = 1 + 1 + 0 = 2
2 2 2
So, then, using the full formula, we can determine the SS for our two contrasts:
SSC1 = =
(
nψ 2 5 −1.80
2
=
)
5 ( 3.24 ) 16.20
= = 2.70
2
∑c 6 6 6
SSC2 = =
(
nψ 2 5 .40
2
=
)
5 (.16 ) .80
= = .40
2
∑c 2 2 2
Finally, we can complete our source table and calculate F statistics for the two contrasts:
Source SS df MS F
At 1 numerator df (the df for the contrast) and 12 denominator df, the critical value is
4.75. In both cases, our calculated value exceeds the critical value, so we can conclude
that p < .05, we can reject the null hypothesis, and conclude that there is a significant
difference based on these two contrasts.
a negative coefficient. In that comparison, ψ was positive, meaning the pen pal group had
higher scores. Again, though, the simpler path for interpretation will be to simply look at
the group means for significant comparisons, and interpret the pattern directly from those.
After typing in our data, we can also assign group labels in the Setup window for the
variable Group by adding a group label for each of the three groups.
Our data file now shows all of the scores with group labels.
Next, to run the ANOVA, we will go to the Analysis tab at the top of the screen, then
select ANOVA, and then on the sub-menu, click ANOVA. Note that there is also a spe-
cialized menu for the one-way ANOVA, which would work for this case. However, we’ll
demonstrate using the ANOVA sub-menu because it is a bit more versatile and has some
options that are useful.
136 • BETWEEN-SUBJECTS DESIGNS
In this case, racial bias scores are the dependent variable, so we will click on that variable,
and then click the arrow to move it to the Dependent Variable spot. Group is the inde-
pendent variable, which jamovi labels the Fixed Factor. So, we will click on Group, then
click the arrow to move it to Fixed Factors (plural because later designs will allow more
than one independent variable). In the same setup window, under Effect Sizes, we can
check the box for ω2 to produce omega squared.
Next, under Assumption Checks, we can check the box for Equality of Variances to pro-
duce Levene’s test. Then, under Post-Hoc Tests, we select Group on the left column, and
COMPARING MORE THAN TWO SAMPLE MEANS • 137
move it using the arrow button to the right column (which sets it as a variable to com-
pare using a post-hoc test). We can then choose the error correction, with the options
being None (LSD comparison), Tukey, Scheffe, Bonferroni, and one we haven’t discussed
called Holm. Earlier in the chapter, we provided a comparison of the most popular post-
hoc tests to help with deciding which test to use. For this example, we will use the more
conservative Scheffe test. So, we’ll check the Scheffe box and uncheck the Tukey box.
On the right side of the screen, the output has updated in real time as we choose our
options and settings. The first piece of output is the ANOVA summary table.
On Levene’s test, we see that F2, 12 = .150, p = .862. Because p > .05, we fail to reject
the null hypothesis on Levene’s test. Recall that the null hypothesis for Levene’s test
is that the variances are equal (or homogeneous), so this means that the data met the
assumption. In other words, we have met the assumption of homogeneity of variance.
Because of that, we’re good to use the standard ANOVA F ratio, and will not need any
correction.
Notice that this summary table includes all of the information (SS, df, MS, and F) that
we calculated by hand, plus the probability (labelled “p” in the output). Because jamovi
provides the exact probability (here, p < .001), we do not need to use the critical value.
Instead, if p < .05, we reject the null hypothesis and conclude there is a significant differ-
ence. Notice, too, that jamovi labels the “within” or “error” term as “residuals,” and it does
not supply the “total” terms. However, we could easily calculate the total sum of squares
and degrees of freedom by adding together the between and within terms (here labelled
Group and Residuals). Looking at the ANOVA summary table, we see that F2, 12 = 19.872,
p < .001, so there was a significant difference between groups. The output also has omega
squared, from which we can determine that about 72% of the variance in racial bias was
explained by which project group students were assigned to (ω2 = .716). Because the
ANOVA is an omnibus test, we then need to evaluate how the three groups differed, in
this case using a post-hoc test, which show up next in the output.
This table shows all possible comparisons, and gives the mean difference, SE (standard
error of the difference), df (degrees of freedom), a t statistic, and p (the probability,
which is followed by a suffix for which correction was used, so in our case, it reads
pscheffe). For each comparison, the two groups are significantly different if p < .05. While
the table provides a t statistic with degrees of freedom, it is not uncommon to see
researchers report only the probability values for this statistic. In fact, in other software
COMPARING MORE THAN TWO SAMPLE MEANS • 139
packages (such as the commonly used SPSS package), no test statistic is even provided
in the output (Strunk & Mwavita, 2020). For practical purposes, we’ll interpret these
comparisons based on the probabilities, interpreting those with p < .05 as significant
differences. So, for our purposes, here there is no significant difference when compar-
ing those doing the film project to those doing the pen pal assignment (p = .118), a
significant difference when comparing those doing the film project with those doing
normal coursework (p = .007) and a significant difference between those doing the
pen pal project and those doing normal coursework (p < .001). Knowing that there is
a difference when comparing normal coursework to either the film project or the pen
pal assignment, we can examine the means to determine what that pattern of difference
is. As we discovered above, the pattern is that those doing normal coursework have
higher bias scores than those doing either of the assignments designed to reduce bias
(film or pen pal).
We likely also want to produce descriptive statistics by group (we demonstrated pro-
ducing descriptive statistics overall, including normality tests, in previous chapters). To
do so, we will select the Analyses tab, then Exploration, and then Descriptives. We’ll
select RacialBias and click the arrow to move it to the Variables box. Then we will select
Group, and click the arrow to move it to the Split by box. That will produce output split
by group, so we’ll get descriptive statistics for each of the three groups. Under Statistics,
we will uncheck most of the boxes, while checking the boxes for “N” (which gives the
number of people per group), Mean, and Std. deviation. We could also check any other
boxes for statistics that are relevant.
These descriptive statistics will be helpful in interpreting the pattern of differences. For
example, here we previously found that there was a significant difference between the
Normal Coursework group and both the Film Project and Pen Pal Assignment group.
Here we see that the Normal Coursework group had a higher mean (M = 4.560, SD =
.241) than either the Film Project group (M = 3.860, SD = .305) or the Pen Pal Assignment
140 • BETWEEN-SUBJECTS DESIGNS
group (M = 4.560, SD = .241). We know that difference is statistically significant from the
post-hoc tests, so can support an inference that the pen pal assignment and film project
were associated with significantly lower racial bias scores than normal coursework.
• Deviation: Compares each group to the grand mean, omitting the first group. In
our example, it will produce a comparison of Pen Pal versus the grand mean, and
Normal Coursework versus the grand mean.
• Simple: Compares each group to the first group. In our example, it will compare
Pen Pal versus Film, and Normal Coursework versus Film.
• Difference: Compares each group with the mean of previous groups. In this case,
it will compare Pen Pal versus Film, and then will compare Normal Coursework
versus the average of Pen Pal and Film.
• Helmert: Compares each group with the average of subsequent groups. In this case,
it will compare Film versus the average of Pen Pal and Normal Coursework, and
will then compare Pen Pal and Normal Coursework.
• Repeated: Compares each group to the subsequent groups. In this case, it will com-
pare Film versus Pen Pal, and then Pen Pal versus Normal Coursework.
4. If the test was significant, what is the effect size? (If the test was not significant,
simply report effect size in #3.)
5. If the test was significant, report your follow-up analysis (post-hoc or a priori).
6. What is the pattern of group differences?
7. What is your interpretation of that pattern?
Compared with our suggestions for the independent samples t-test, we are suggesting
a slightly longer results section for the one-way ANOVA. That’s because the one-way
ANOVA has more information we can glean and involves the added layer of follow-up
analysis. The write-up will be slightly different depending on whether we use a post-hoc
test or a priori comparisons. Because jamovi is limited in conducting a priori compari-
sons, we will demonstrate only the post-hoc test writing process:
Results
(Continued)
142 • BETWEEN-SUBJECTS DESIGNS
Table 8.1
Descriptive Statistics by Group
Group N M SD SE
In the previous chapter, we explored the one-way ANOVA using a made-up example
and some fabricated data. In this chapter, we will present several examples of published
research that used the one-way ANOVA. For each sample, we encourage you to:
1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the t-test.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects’ data but have
been simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.
145
146 • BETWEEN-SUBJECTS DESIGNS
In this paper, the authors investigated how several variables might differ among three
groups of first-generation college students: (1) first-generation college students with an
older sibling who attended college (FGCS-OS); (2) continuing-generation college stu-
dents (that is, first-generation students with at least one parent who completed some col-
lege but not a degree; CSGS); and (3) first-generation college students who are the first
in their family to attend college at all (F-FGCS). The researchers tested five outcomes, of
which we will highlight two: peer support and institutional support.
Of special note in this case: The authors used five one-way ANOVAs to test difference
in five dependent variables. Doing so, as we’ve highlighted in prior chapters, will inflate
the Type I error rate. At a minimum, the use of multiple univariate tests like the ANOVA
should result in reducing the critical probability value. We normally make that adjust-
ment using the Bonferroni inequality, which involves dividing the critical probability
value by the number of tests. In this case, we would divide .05 (critical probability value,
or alpha) by the number of tests, five, resulting in an adjusted alpha of .01. However, we
also wish to stress that when a design calls for multiple univariate tests, like the ANOVA,
researchers should consider whether a multivariate analysis might be more appropriate.
Research questions
Because we are focusing on only two of the ANOVAs these authors used, we will focus
on these two research questions:
1. Was there a mean difference in peer support among the three groups of first-gen-
eration college students (FGCS-OS, CSGS, and F-FGCS)?
2. Was there a mean difference in institutional support among the three groups of
first-generation college students (FGCS-OS, CSGS, and F-FGCS)?
Hypotheses
The authors hypothesized the following related to peer support:
the item scores were summated to create the scale score. For peer support, the authors
reported coefficient alpha reliabilities of .85 for peer support and .89 for institutional
support, both of which are in the acceptable range. The authors also provided evidence of
content validity in describing how the items were developed and their content assessed
by experts. The independent variable was measured based on questions about who in the
participants’ families had attended college.
Write-up
Results
(Continued)
ONE-WAY ANOVA CASE STUDIES • 149
one parent who completed some college but not a degree; CSGS); and (3)
first-generation college students who are the first in their family to attend
college at all (F-FGCS). Because we used two ANOVAs, we set alpha at
.025 using the Bonferroni inequality to control for familywise error.
There was a significant difference between the three groups of students in
peer support (F2, 354 = 6.406, p = .002), but no significant difference in
institutional support (F2, 354 = 2.747, p = .065, ω2 = .010). About 3% of the
variance in peer support was explained by the groups of first-generation
students. To determine how peer support varied among the three groups
of first-generation students, we used Scheffe post-hoc tests. CSGS scored
significantly higher than F-FGCS students (p = .002). However, there
was no significant difference between CGCS and FGCS-OS (p = .650) or
between FGCS-OS and F-FGCS (p = .144). See Table 9.1 for descriptive
statistics by group. Among the present sample, CSGS students’ scores
significantly higher in peer support than F-FGCS students, but otherwise
there were no differences in peer or institutional support.
Table 9.1
Descriptive Statistics by Group
Group N M SD M SD
In APA style, tables go after the reference page, with each table starting on a new page.
For this example, we might make a table such as:
Research questions
The authors had two research questions that are reviewed here, although they had mul-
tiple other questions in the study:
1. Were there mean differences in school funding among the three groups of schools?
2. Were there mean differences in administrative support among the three groups of
schools?
Hypotheses
The authors hypothesized the following related to school funding:
H0: There were no significant differences in school funding between the three groups
of schools. (Mlower-than-expected = Mas-expected = Mhigher-than-expected)
H1: There were significant differences in school funding between the three groups of
schools. (Mlower-than-expected ≠ Mas-expected ≠ Mhigher-than-expected)
Write-up
Results
(Continued)
ONE-WAY ANOVA CASE STUDIES • 153
schools that performed better than expected, as expected, and lower than
expected. There was a significant difference between the three groups of
schools in school district funding (F2, 635 = 4.499, p = .011) and in effective
administrator support (F2, 635 = 3.556, p = .029). About 1% of the variance
in institutional funding was explained by the school groups (ω2 = .011),
while less than 1% of the variance in effective administrator support was
explained by the school groups (ω2 = .008). We used Scheffe post-hoc tests
to determine how institutional funding and effective administrative support
differed between the school groups. For school district funding, there was a
significant difference between lower-than-expected schools and as-expected
(p = .049) and better-than-expected (p = .045) schools. There was no sig-
nificant difference in funding between as-expected and better-than-expected
(p = .593) schools. For effective administrative support, there was a signifi-
cant difference between lower-than-expected and as-expected schools
(p = .050). However, there were no significant differences in effective
administrator support between lower-than-expected versus better-than-
expected (p = .998) schools, nor between as-expected and better-than-
expected schools (p = .262). See Table 9.2 for descriptive statistics by
group. Schools performing lower-than-expected had lower funding than
as-expected or better-than-expected schools and had lower effective admin-
istrative support than those performing as-expected. These differences were
small, though statistically significant.
(Continued)
154 • BETWEEN-SUBJECTS DESIGNS
Table 9.2
Descriptive Statistics by Group
Group N M SD M SD
In APA style, tables go after the reference page, with each table starting on a new page.
For this example, we might make a table such as Table 9.2.
Again, we encourage you to look up the original studies these cases highlight. Read
those articles and think about how and why what they have written might be different.
Doing so will also help you to see how these analyses get used in published work in the
field of educational research.
For additional case studies, including example data sets, please visit the textbook
website for an eResource package, including specific case studies on race and racism in
education.
Notes
1 In this case, the authors call this difference significant in the published manuscript. They
include a footnote that it is “significant at the p < .1 level.” We would argue against setting the
alpha criterion this high, particularly when there are multiple ANOVAs being used, and thus
in our version of this Results section, we call that difference nonsignificant.
2 In the published version of the manuscript the authors report eta squared (η2) as their effect
size estimate. We will discuss this effect size estimate in later chapters. However, for now it is
sufficient to know that eta squared tends to overestimate effect sizes. That difference is clear
in this example as the published eta squared estimates are larger than our calculated omega
squared estimates. That is a feature of eta squared—it usually overestimates effect sizes.
3 This is a tricky probability to interpret at first glance. Initially, jamovi will say that p = .050,
which if exactly true would mean the difference was not significant (because only when
p < .05 do we reject the null, so at p = .05, we would fail to reject the null). However, if we click
the settings (three vertical dots in the upper right corner) and change to 4 decimal places (4
dp, the ten thousandths place), we will see the exact value is .0496, which is less than .050, so
we reject the null and interpret this difference as statistically significant.
10
Comparing means across two
independent variables
In the previous chapters, we explored how to compare more than two groups in the
one-way ANOVA. That design required one categorical independent variable with two
or more levels. Usually, as we noted in the previous chapters, the one-way ANOVA is
used only when there are more than two levels on the independent variable (e.g., more
155
156 • BETWEEN-SUBJECTS DESIGNS
than two groups) because if there were only two, the independent samples t-test would
be the simpler alternative. However, we often have more than one independent variable,
and testing them separately does not allow us to get at the possible interactions between
our independent variables. For example, what if a treatment has different effects based
on gender? We often wonder about the interactions among more than one variable, and
have a statistical design available to test for those effects: the factorial ANOVA.
These cells will be under analysis in the factorial ANOVA. Each cell has a mean on the
dependent variable, and those means can be compared. However, the factorial ANOVA
will also analyze two other kinds of means, in addition to these cell means. Those are mar-
ginal means and the grand mean. We have encountered the idea of a grand mean before:
it is the mean of all the scores, regardless of group membership. Marginal means are the
means across only one group membership at a time (ignoring the second independent
variable). To illustrate, let’s imagine the above table, but with three people per group:
We get the cell means by taking the mean of the three scores in each cell. Then the
marginal means for Variable 2 (shown in first two rows of the right-hand column)
are the means of the six scores in each group (ignoring Variable 1 groups). Next, the
marginal means for Variable 1 (shown in the second and third columns of the bottom
row) are the means of the six scores in each group on Variable 1 (ignoring Variable 2
groups). Finally, the grand mean (shown in the bottom right cell) is the mean of all
twelve scores.
1
Mood Disorder Anxiety Disorder
Treatment A Treatment B
We see in this graph an interaction between disorder type and treatment. Those diag-
nosed with a mood disorder showed better outcomes with treatment B, but those diag-
nosed with an anxiety disorder showed better outcomes with treatment A. Without
getting into the specifics of the analysis just yet, there is little difference in the means of
the two treatments if we disregard disorder type. However, by looking at disorder type
and treatment together, we see a dramatic pattern reversal. This kind of interaction
is called a disordinal interaction. In a disordinal interaction, if we plot the means (as
we have done above), the lines will cross, indicating a pattern reversal. What we mean
by a pattern reversal is that, for example, those diagnosed with a mood disorder did
better with treatment B and worse with treatment A, but for those diagnosed with an
anxiety disorder, the pattern is reversed: they did better with treatment A and worse
with treatment B.
158 • BETWEEN-SUBJECTS DESIGNS
The graph below illustrates the other sort of interaction we might find.
7
1
Mood Disorder Anxiety Disorder
Treatment A Treatment B
In this example, we see an ordinal interaction. In an ordinal interaction, the lines do not
cross. That means we do not see a pattern reversal. In our example, participants getting
treatment A had worse outcomes for both disorders. However, we see a bigger difference
in outcomes among those with anxiety disorders. This is not a pattern reversal, as in a
disordinal interaction, but there is an interaction between the two independent variables.
dependent variable. This assumption does not differ from the one-way ANOVA or the
independent samples t-test.
Homogeneity of variance
As in the one-way ANOVA, the factorial ANOVA assumes homogeneity of variance. In
the case of the factorial ANOVA, the assumption is that the variances are equal across all
cells. The most common reason for a violation of this assumption is unbalanced sample
sizes. As we explored in the prior chapter on the one-way ANOVA, variance is related to
sample size, and all else being equal, smaller samples have larger variance. So, when the
cell sizes become unbalanced, we will typically see the smaller cells have higher variance
and larger cells have lower variance. This means that one way to protect against viola-
tions of this assumption is to strive for balanced cell sizes.
and total variation, which are conceptually the same as in the one-way ANOVA (though
calculated slightly differently). However, the between-groups variation will be split up, or
partitioned, into three sources: the main effect of the first independent variable, the main
effect of the second independent variable, and the interaction.
Partitioning variance
In order to better understand how variance is partitioned in the factorial ANOVA, we
present below a source table:
Source SS df MS F
As we described earlier, the source table for this design has within and total variation
(just like the one-way ANOVA did), but now partitions “between” variation into the var-
iation on the first independent variable, on the second independent variable, and then
on the interaction. The test also produces three F statistics—one for each of the two main
effects and one for the interaction.
Treatment A 1, 3, 2 7, 5, 6
Treatment B 6, 4, 5 3, 5, 4
As with the one-way ANOVA, we’ll need to calculate the between- and within-subjects
variables. However, in the case of the factorial ANOVA, the overall between variation
is partitioned into variance from the first IV (disorder type in our case), the second IV
(treatment type, in our case), and the interaction of those two independent variables. We
will still calculate within and total variation, however. In the table below are the formulas
for calculating the source table:
Source SS df MS F
X X
IV2
2 kIV2 − 1 SSIV 2 MSIV 2
IV 2
df IV 2 MSwithin
IV1*IV2 SStotal − SSIV1 − SSIV2 − SSwithin (dfIV1)(dfIV2) SSIV 1∗IV 2 MSIV 1∗IV 2
df IV 1∗IV 2 MSwithin
Ntotal − [(kIV1) SSwithin
2
Within X Xcell
(kIV2)] df within
Total
Ntotal − 1
2
XX
There is some new notation in this table. Notice there are several different kinds of means.
We have group means (like X IV1 which is the group means based on the first independ-
ent variable), we have cell means ( Xcell , which is the mean per cell), and the grand mean
(X), which is the mean of all scores regardless of group membership).
To begin our calculations, we will calculate group means, means for each variable
(also known as marginal means because they are in the margins of the table), cell means,
and a grand mean. Here we are using the standard formula for a mean, but we are doing
it multiple times for different groups of participants:
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 163
Treatment A 1 3 2 6 7 5 6 18 6 18 24
2 6 4
3 3 3 3 6 6
Treatment B 6 4 5 15 3 5 4 12 15 12 27
5 4 4. 5
3 3 3 3 6 6
6 15 21 18 12 30 6 15 18 12 51
3.5 5 4.25
6 6 6 6 12 12
So, the mean for the mood disorder group is 3.5, for the anxiety disorder group is 5, for
Treatment A is 4, for Treatment B is 4.5, and the grand mean is 4.25.
Using this information, we can calculate all of our sums of squares. In the tables
below, we illustrate the calculations for each sum of squares, beginning with the total
sum of squares:
IV1 IV2 X
X X X X
2
Notice that the grand mean is constant across all groups, because it is the mean of all
scores regardless of group membership. So, our total sum of squares (the sum of the
right-most column) is 34.256.
Next, we will calculate within sum of squares:
164 • BETWEEN-SUBJECTS DESIGNS
IV1 IV2 X
X X X X
cell cell
2
Notice here that we use the individual cell means, subtracting them from each score. The
within sum of squares (calculated by summing the right-most column) is 8.
Next, we will calculate the main effect of disorder type:
IV1 IV2 X
X IV1 X X IV1 X
2
In this case, our calculations look a little different. The formula calls for the difference
between the group mean and the grand mean. Everyone in the group has the same group
mean, so we get the same result for all members of a group. The fact that all the square
deviation scores come out the same is a feature of the 2×2 design when the number
of cases is perfectly balanced among the cells. In any other case, we would get differ-
ent values per group. Taking the sum of all those squared deviation scores (adding up
everything in the right-most column), we calculate the sum of squares for disorder type
as 6.756.
Next, we will calculate the sum of squares for treatment type. This will follow a similar
pattern, except now we are concerned with the means per treatment type:
IV1 IV2 X
X IV 2 X X IV 2 X
2
Finally, we’re ready to calculate the sum of squares for the interaction:
SStotal SSIV 1 SSIV 2 SSwithin 34.256 6.756 .756 8 18.744
Now, we have all the information we need to complete the source table and calculate our
F ratios:
166 • BETWEEN-SUBJECTS DESIGNS
Source SS df MS F
we have? If we plot the means, as we have below, we see the lines cross. This is a disordi-
nal interaction, meaning that there is a pattern reversal in the data.
1
Mood Disorder Anxiety Disorder
Treatment A Treatment B
We will need additional follow-up tests to determine exactly which differences within this
pattern are statistically significant. We explain the follow-up analysis below. However, in
general, we see a pattern where it looks like people with mood disorders do better in treat-
ment B, and people with anxiety disorders do better in treatment A. This is what we mean
by a pattern reversal—the treatment that was better for one group is worse for another.
What if our interaction was not significant? In that case, we would examine the two
main effects. If either of them were significant, we proceed with that main effect, inter-
preting it in much the same way as a one-way ANOVA. If the significant main effect had
more than two groups, we’d perform a post-hoc test to determine how the groups differ,
just like we would in the one-way ANOVA. In jamovi, all of these steps can be handled
in the same menu, as we’ll demonstrate later in this chapter.
with “effect” because we have three effects (two main effects and the interaction) we can
test. So, the formula for omega squared is:
SSE df E MSw
2
SST MSw
The only difference from what we presented in Chapter 8 is the replacement of “B” for
“between” with “E” for “effect.” All of the needed information is in our source table, so
we simply plug in those values and calculate our effect size estimate for the interaction:
We can also label the groups by going to the Data tab, and then clicking Setup. We could
label the groups on Disorder.
the box for ω2 under Effect Sizes. We generally recommend using omega squared (ω2)
for the effect size estimate when possible. Eta squared (η2) is another effect size esti-
mate that’s commonly used. It is interpreted in the same way as omega squared, but
tends to produce an overestimate of the effect size in most cases (Keppel & Wickens,
2004).
To produce Levene’s test, under Assumptions Checks, check the box for Homoge-
neity tests. We can also check the box next to Estimated Marginal Means under the
Estimated Marginal Means heading to produce cell means and standard errors.
We can also produce plots of group means under the Plots heading. This allows you to
specify a variable to put across the horizontal axis (X axis) and one to split into separate
lines. Sometimes, it might be a good idea to put one independent variable as horizontal
and the other as separate lines, and produce a second plot switching those two places.
In our case, though, we’ll put Disorder on the horizontal axis and Treatment as separate
lines.
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 171
The first piece of output shows a model summary. This information is not especially
useful for the design we are specifying, and we won’t interpret any of this block of output.
Next is the ANOVA summary table. This table produces some information that is not useful
for our purposes, but also produces all of the information we need to interpret and report
the omnibus test. Specifically, the “Model” line of output is not one we will interpret in a
factorial ANOVA design. It is produced because this program can run a wide array of anal-
yses, some of which would use that output. Next, we see the main effect of Disorder alone,
the main effect of Treatment alone, the interaction of Disorder and Treatment (Disorder *
Treatment), error or within (labelled in jamovi as Residuals), and Total.
172 • BETWEEN-SUBJECTS DESIGNS
Here, we see a significant interaction (F1, 8 = 18.750, p = .003). Because the interaction
is significant, we will not interpret the main effects of disorder or treatment type. We
demonstrated the calculation of effect size from the source table earlier in this chapter.
However, here jamovi has produced the effect size estimate for us, as well. The next piece
of the output is the “Fixed Effects Parameter Estimates,” which we can ignore for our
purposes. Then, it produces Estimated Marginal Means, showing group and cell means
and standard errors. We could also produce group descriptive statistics through the
Exploration → Descriptives menu. Next, jamovi produces the plot we requested.
Plots
5
Outcome
Treatment
4 Treatment A
Treatment B
2
Mood Disorder Anxiety Disorder
Disorder
On this plot, we see a disordinal interaction (which we know was statistically significant
based on the ANOVA table as discussed above). It’s disordinal because the lines cross,
showing a pattern reversal. Finally, jamovi produces Levene’s test at the very bottom of
the output. Because the assumption checks are printed at the bottom of the output, we
have to remember to drop down and check this test before proceeding to interpret the
other tests.
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 173
The first set of output shows the test of the difference between Treatment A and treatment
B among those in the Mood Disorder group. We see a significant difference between the
two treatments among those in the Mood Disorder group (F1, 8 = 13.500, p = .006). The
second line shows the same comparison for the Anxiety Disorder group, where we also
COMPARING MEANS: TWO INDEPENDENT VARIABLES • 175
see a significant difference between the two treatments (F1, 8 = 6.000, p = .040). In this
case, because the variable we listed as Moderator had only two groups, we can stop here.
But if that variable had more than two groups, the next set of output would further break
down the comparisons into pairwise comparisons for every group on the Moderator
variable, split by the groups on the Simple effects variable.
Based on this, we know there was a significant difference among those with mood
disorders between the two treatment types. Looking at our means (or the profile plots),
we see that those getting Treatment B had better outcomes compared with Treatment B
among those with mood disorders. The simple effects analysis also confirmed there was
a significant difference between the two treatment types for those with anxiety disorders.
Looking at the means (or profile plots), we can see that those getting Treatment A had
better outcomes among those with anxiety disorders. So, the overall pattern of results is
that those with mood disorders had better outcomes with Treatment B, while those with
anxiety disorders had better outcomes with Treatment A.
5. If the interaction was significant, report the results of the follow-up analysis such
as simple effects analysis. If the interaction was not significant, report and interpret
the two main effects.
6. What is the pattern of group differences?
7. What is the interpretation of that pattern?
This general format is very similar to what we presented in Chapter 8, but adds some
specificity in items 3 through 5 that may be helpful in thinking through the factorial
design. For our example we’ve followed through most of the chapter, we provide sample
responses to these items, followed by a sample results paragraph.
Results
(Continued)
178 • BETWEEN-SUBJECTS DESIGNS
Table 10.1
Descriptive Statistics for Treatment Outcomes
We might also choose to include a table of descriptive statistics. This is not absolutely
necessary as we have included cell means and standard deviations in the text (which we
can produce in the Exploration→Descriptives menu as explained in prior chapters, adding
both independent variables to the “Split By” box), but can still be helpful for readers as it
includes additional information. In the example Table 10.1, we have added some addi-
tional horizontal lines to make it clearer for readers where the change in disorder type falls.
In the next chapter, we’ll work through some examples of published research that used
factorial ANOVA designs, following these same steps.
Note
1 Note that for this option to appear, you must have the GAMLj module installed. To do so, if
you have not already, click the “Modules” button in the upper right corner of jamovi (which
has a plus sign above it). Then click “jamovi library.” Locate GAMLj—General Analyses for
Linear Models and click “Install.” While a factorial ANOVA can be produced in the ANOVA
menu, it does not have options that are quite as robust, particularly for the follow-up analysis
we will demonstrate in this chapter.
11
Factorial ANOVA case studies
In the previous chapter, we explored the factorial ANOVA using a made-up example
and some fabricated data. In this chapter, we will present several examples of published
research that used the factorial ANOVA. For each sample, we encourage you to:
1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the factorial ANOVA.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.
179
180 • BETWEEN-SUBJECTS DESIGNS
Research questions
The authors had several research questions, of which we focus here on one: Would educa-
tor perceptions of the seriousness of bullying vary based on the combination of the bully-
ing type (verbal, relational, or physical) and the scenario type (LGBTQ or non-LGBTQ)?
Hypotheses
The authors hypothesized the following related to X:
Write-up
Results
(Continued)
182 • BETWEEN-SUBJECTS DESIGNS
Table 11.1
Descriptive Statistics for Seriousness Ratings
4.5
3.5
2.5
1.5
1
Verbal Relational Physical
LGBTQ Non-LGBTQ
Notice that, because the interaction was significant, all of our interpretive attention is
on the interaction. In fact, we have not interpreted the main effects at all. In this sce-
nario, it’s particularly clear that main effects would be misleading in the presence of
an interaction. For example, we would find (using main effects) that LGBTQ scenarios
were rated higher in seriousness, but that pattern is reversed for physical bullying. So
when there is an interaction, our attention will normally be entirely on that interaction.
For the table, it would be placed after the References page, on a new page, with one
table per page.
Next, the figure would go on a new page after the final table, with one figure per page
(if more than one figure is included).
184 • BETWEEN-SUBJECTS DESIGNS
Research questions
In this study, the authors examined a single primary research question: Would social
participation differ across the interaction of country and special educational needs?
Hypotheses
The authors hypothesized the following related to social participation:
2. What are the assumptions of the test? Were they met in this case?
a. Level of measurement for the dependent variable is interval or ratio
The dependent variable was social acceptance as measured by peer nominations.
This variable was measured at the ratio level because it had a true, absolute zero.
b. Normality of the dependent variable
The authors reported that there were problems with normality. They reported
that peer acceptance was not normally distributed, but did not specify how the
data were non-normal. In most instances, it would be appropriate to provide the
normality statistics. As a reminder, the data in the online resources are simu-
lated data and so will be normally distributed, although the distribution was not
normal in the published study.
c. Observations are independent
The authors did not discuss this assumption in the published article. There are
potential factors like teacher or school variance that might cause some depend-
ence. But, in this case, the authors are most interested in comparing the coun-
tries and have included that as an independent variable. They also acknowledge
that countries are internally heterogeneous in how schools might operate and
note this as a limitation of the study.
d. Random sampling and assignment
The sample was not random but involved multi-site data collection. The authors
also acknowledge the limitation of their sampling strategy, which was not broad
in each country, in making international comparisons. Neither independent
variable was randomly assigned as both independent variables involved intact
groups. For special educational needs, there were very uneven group sizes,
which presents some complications for an ANOVA design.
e. Homogeneity of variance
The assumption of homogeneity of variance was not met (F11, 1323 = 1.988,
p = .026). The cell sizes are very uneven, with the largest cell having n = 469,
and the smallest having n = 14. However, the largest standard deviation (3.120)
divided by the smallest (1.340) is less than three (3.120/1.340 = 2.328). It is likely
safe to proceed with the unadjusted factorial ANOVA based on the criteria we
provided in Chapter 8, but we will note the heterogeneity of variance (lack of
homogeneity of variance) in the Results section.
3. What was the result of that test?
There was no significant difference in social participation based on the interaction
of the country and SEN grouping (F6, 1323 = .592, p = .737).
4. What was the effect size, and how is it interpreted?
For the interaction:
SSE df E MSw 16.328 6 4.598 16.328 27.588
2
SST MSw 6424.712 4.598 6429.310
11.588
.002 .000
6429.310
Remember that omega squared cannot be negative. In the case of extremely small
effects, the formula might return a negative value, as it has done here. But we will
report and interpret this as .000, or that none of the variance in social acceptance
was explained by the interaction of country and SEN group.
186 • BETWEEN-SUBJECTS DESIGNS
Write-up
Results
(Continued)
FACTORIAL ANOVA CASE STUDIES • 187
(Continued)
188 • BETWEEN-SUBJECTS DESIGNS
(p > .999). There was also no significant difference between those with
behavioral needs and those with other special needs (p > .999).
Overall, while there was no significant interaction between country
and SEN groups, there were significant main effects for both variables.
Specifically, students in Belgium had higher social acceptance, on aver-
age, than those in the Netherlands. Typically developing students had
higher social acceptance than students from the three other SEN groups,
suggesting social acceptance is lower, on average, for students with spe-
cial needs, regardless of the type of special educational need. See Table
11.2 for descriptive statistics.
Table 11.2
Descriptive Statistics for Social Acceptance
The table would go after the references page, starting on a new page, and if there was
more than one table, they would be one per page.
As with the previous case studies in this text, we encourage you to compare our ver-
sion of the Results with what is in the published article. How do they differ from each
other? Why do they differ? How did this analysis fit into the overall research design
and article structure? Comparing can be helpful in seeing the many different styles and
approaches researchers use to writing about this design.
For additional case studies, including example data sets, please visit the textbook
website for an eResource package, including specific case studies on race and racism in
education.
Note
1 Please note that the values from the simulated data provided in the online course resources
differ slightly from the authors’ calculations. This is an artifact of the simulation process and
the authors’ results are not incorrect or in doubt.
Part IV
Within-subjects designs
191
12
Comparing two within-subjects
scores using the paired samples t-test
In this section of the book, we will explore within-subjects designs. The simplest of these
designs is the paired samples design. The difference between between-subjects designs
and within-subjects designs is the nature of the independent variable. In between-sub-
jects designs, the independent variable was always a grouping variable. For example,
an independent samples t-test might be used to determine the difference between an
experimental and a control group. In within-subjects designs, the independent variable
is based on repeated measures. For example, we might have a within-subjects independ-
ent variable with two levels, such as a pre-test post-test design. Rather than having two
different groups we wish to compare, we would have two different sets of scores from the
same participants we wish to compare. The designs are called within-subjects because we
are comparing data points from the same participants rather than comparing groups of
participants.
193
194 • WITHIN-SUBJECTS DESIGNS
different way in the within-subjects design than they did in between-subjects designs.
We’ll briefly review all of the assumptions and how they apply to this design.
randomization of order, often referred to as counterbalancing, will help control for order
effects. Take the example of a taste test design, where participants will rate two new flavor
options for soda—grape and watermelon. It might be that the watermelon has a strong
aftertaste, so a person tasting it first might think the grape flavor tastes worse than they
would if they’d had it first. Because order effects are hard to anticipate in many cases, ran-
domly assigning order of administration or counterbalancing the order of administration
can help test and control for those order effects. We also mentioned earlier that on ability
tests and cognitive tests, there is often a practice effect, which counterbalanced order of
administration can help control for. The issue is that many within-subjects designs are
longitudinal, like the pre-test post-test design. In those cases, counterbalancing is not
possible, which means we cannot rule out order effects or practice effects.
D
t=
SED
where D is the difference between the two data points from each subject. The numer-
ator, then, is the mean difference between the two data points (the two levels of the
within-subjects independent variable). The denominator is the standard error of the
difference.
Partitioning variance
Calculating the mean difference is fairly straightforward. We simply calculate the dif-
ference between the two levels of the within-subjects variable for every participant and
then take the mean of those difference scores. To illustrate, we’ll return to an example
from earlier in the chapter. Imagine we’ve recruited participants to complete a workshop
designed to increase their mathematics self-efficacy. We give participants a measure of
their mathematics self-efficacy before and after the workshop. Were their mathematics
self-efficacy scores higher following the workshop? Based on those two scores, we can
calculate the difference scores and the mean difference as follows:
COMPARING TWO WITHIN-SUBJECTS SCORES • 197
Pre-Test Post-Test D
3 6 6−3=3
4 4 4−4=0
2 4 4−2=2
3 5 5−3=2
4 3 3 − 4 = −1
1 3 3−1=2
D 3 0 2 2 1 2 8
D 1.333
N 6 6
So the mean difference is 1.333. That will be the numerator for the t formula. We next
need to calculate the standard error of the difference for the denominator. The standard
error is calculated as:
SSD
SED
N N 1
However, to use this formula, we’ll first need to calculate the sum of squares for the dif-
ference scores, which is calculated as:
D
2
SSD D 2
N
This formula has some redundant parentheses to make it very clear when to square these
figures. The sum of squares will be calculated as the sum of the squared difference scores
(notice here the difference scores are squared, then summed) minus the sum of the dif-
ference scores squared (notice here the scores are summed, then squared) over sample
size. For our example, we could calculate this as follows:
Pre-Test Post-Test D D2
3 6 6−3=3 32 = 9
4 4 4−4=0 02 = 0
2 4 4−2=2 22 = 4
3 5 5−3=2 22 = 4
4 3 3 − 4 = -1 12 = 1
1 3 3−1=2 22 = 4
∑=8 ∑ = 22
D
2
82 64
SSD D 2
N
22
9
22 22 10.667 11.333
6
198 • WITHIN-SUBJECTS DESIGNS
We can then use the sum of squares to calculate the standard error of the difference:
SSD 11.333 11.333
SED 0.378 0.615
N N 1 6 6 1 30
Finally, we put all of this into the t formula:
D 1.333
=t = = 2.167
SED 0.615
We mentioned earlier that to test for normality we would need to first combine the pre-
and post-test scores into a single variable, because the assumption of normality is about
the entire distribution of dependent variable scores. (Thinking back to the between-sub-
jects designs, we didn’t have to do this because the dependent variable was already in a
single variable in jamovi—in the within-subjects designs, it’s split up into two or more.)
To do this, we’ll simply copy and paste the scores from both pre-test and post-test into
200 • WITHIN-SUBJECTS DESIGNS
a new variable. It doesn’t matter what that variable is named because it’s only temporary
for the normality test.
Then we’ll analyze that new variable for normality in the same way as we have in the past.
In the Analyses tab, we’ll click Exploration and then Descriptives. In the menu that comes
up, we’ll select the new variable we created (which here is named C by default), and move it
to the Variable box using the arrow button. Then under Statistics, we’ll click Skewness and
Kurtosis. We could also uncheck the other options we don’t need at this time.
COMPARING TWO WITHIN-SUBJECTS SCORES • 201
This is evaluated just like in the previous chapters. The absolute value of skewness is less
than two times the standard error of skewness (.000 < 2(.637)) and the absolute value
of kurtosis is less than two times the standard error of kurtosis (.654 < 2(1.232)), so the
distribution is normal.
Next, we can produce the paired samples t-test in the Analyses tab, then the “T-Tests”
menu, then “Paired Samples T-Test”. In the resulting menu, the box on the right shows
“Paired Variables.” Here we will “pair” our pre- and post-test scores, by clicking first on
Pre, then the arrow button, then Post, then the arrow button.
Next, we can produce the paired samples t-test by clicking Analyze → Compare Means
→ Paired-Samples T Test. We might want to check the boxes to produce the mean differ-
ence, confidence interval, and descriptives as well. The effect size option here produces
Cohen’s d, which is not ideal for our purposes, so we’ll instead hand calculate omega
squared. By default, under the “Hypothesis” options, it will specify a two-tailed hypothe-
sis. Our recommendation is to leave this setting alone, and if the test is one-tailed, simply
divide p by two. That tends to produce less confusion than using the hypothesis options
in the software.
202 • WITHIN-SUBJECTS DESIGNS
The resulting output produces the paired samples t-test, and descriptives. First, let’s look
at the test itself.
We see that t at 5 degrees of freedom is −2.169, and p = .082. Remember that this is
the two-tailed probability. Because our hypothesis was one-tailed (that scores would
improve at post-test), we can divide that probability by half, so p = .041. It also tells us the
mean difference between pre- ad post-test is −1.333 (negative because post-test is higher,
and the difference is pre-test minus post-test). 95% of the time in another sample of the
same size from the same population, the difference would be between -2.913 and 0.247,
based on the 95% confidence interval. So, we have a significant difference in students’
mathematics self-efficacy from pre-test to post-test. Next, we can look at the descriptives
to see how the scores changed.
Results
In the previous chapter, we explored the paired samples t-test using a made-up example
and some fabricated data. In this chapter, we will present several examples of published
research that used the paired samples t-test. We should note that, in these examples, the
simulated data provided in the online resources will not produce the exact result of the
published study. However, they will reproduce the essence of the finding—so don’t be
surprised to look up the published study and see somewhat different results.1 For each
sample, we encourage you to:
1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the t-test.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: the online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.
205
206 • WITHIN-SUBJECTS DESIGNS
Research questions
The authors asked two research questions in the portion of the article we review in this
case study:
1. Were emotional satisfaction scores significantly higher after the POGIL interven-
tion than before the intervention?
2. Were intellectual accessibility scores significantly higher after the POGIL interven-
tion than before the intervention?
Hypotheses
The authors hypothesized the following related to emotional satisfaction:
H0: There was no significant difference in pre-test emotional satisfaction compared
to post-test. (Mpre = Mpost)
H1: There was no significant difference in pre-test emotional satisfaction compared
to post-test. (Mpre ≠ Mpost)
The authors hypothesized the following related to intellectual accessibility:
H0: There was no significant difference in pre-test intellectual accessibility compared
to post-test. (Mpre = Mpost)
H1: There was no significant difference in pre-test intellectual accessibility compared
to post-test. (Mpre ≠ Mpost)
Write-up
Results
Research questions
This study had one research question: Would students’ statistical knowledge test scores be
higher after an undergraduate sociology research course than they were before the course?
PAIRED SAMPLES t-TEST CASE STUDIES • 209
Hypotheses
The author hypothesized the following related to statistical knowledge:
About 60% of the variance in statistical knowledge test scores was explained by
the change from before the sociology course to after the course (ω2 = .604).
5. What is the pattern of group differences?
Participants had significantly higher scores after the course (M = 64.800,
SD = 13.000) than they did before the course (M = 43.900, SD = 11.100).
Write-up
Results
The author used the paired samples t-test to determine if there was a
significant difference in statistical knowledge after an undergraduate
sociology research course as compared to before the course. There was
a significant difference in statistical knowledge scores from pre-test to
post-test (t184 = −16.812, p < .001). About 60% of the variance in statis-
tical knowledge test scores was explained by the change from before the
sociology course to after the course (ω2 = .604). Participants had signifi-
cantly higher scores after the course (M = 64.800, SD = 13.000) than
they did before the course (M = 43.900, SD = 11.100).
For additional case studies, including example data sets, please visit the textbook
website for an eResource package, including specific case studies on race and racism in
education.
Notes
1 We simulate the data for the online resources by simulating data with a certain mean and
standard deviation. That works perfectly for the between-subjects designs, which really mea-
sure only mean differences. But for within-subjects designs like the paired-samples t-test, this
turns out fairly differently. The paired samples t-test uses the mean of differences per case,
rather than the mean difference overall. Because we do not know the mean difference per
case from the published work, we cannot simulate data that perfectly reproduce those results.
PAIRED SAMPLES t-TEST CASE STUDIES • 211
However, the overall mean difference and the direction of the result will be the same as the
published study. In most cases, this results in a smaller effect size for the simulated data in the
online resources than for the actual published study. The published results are not in doubt,
but we cannot perfectly reproduce them in our simulated data.
2 As a reminder, this value will not match the published study exactly. As a second note about
this value, the author reports a positive t-test value, so may have dropped the sign and
reported the absolute value, or set the comparison up as post- vs. pre-test instead of pre-test
vs. post-test. In addition, because this is a directional hypothesis (one-tailed test), we would
divide the probability values in half to get the one-tailed probabilities
3 As a reminder, this value will not match the published study exactly. As a second note about
this value, the author reports a positive t-test value, so may have dropped the sign and
reported the absolute value, or set the comparison up as post- vs. pre-test instead of pre-test
vs. post-test. This would not affect the actual test values.
14
Comparing more than two points
from within the same sample
In Chapter 12, we learned how to use the paired samples t-test, and in Chapter 13, we
saw case studies of published work using that analysis. However, the paired samples
t-test was only able to test differences in two within-subjects data points, much like the
independent samples t-test could only compare two groups. When there are more than
two groups to compare, the one-way ANOVA is the appropriate test. But when there are
213
214 • WITHIN-SUBJECTS DESIGNS
more than two within-subjects data points, the within-subjects or repeated measures
ANOVA will be the correct analysis.
Like with the previous analyses, in this design we will need to assess the adequacy of the
sampling strategy to determine how much generalization is reasonable from the data.
How far we can expect these results to translate beyond the sample is dependent on how
robust the sampling strategy was.
Sphericity
This design is the first time we are encountering the assumption of sphericity. It is a
related idea to homogeneity of variance, but that idea works differently in a within-sub-
jects design. Recall that, in between-subjects designs, the assumption of homogeneity of
variance was that the variance of each group is equal. That assumption cannot apply to
within-subjects designs, as there are no groups. In place of that assumption, we find the
assumption of sphericity. The assumption of sphericity is that all pairwise error variances
are equal. In other words, the error variance of each pair of levels on the within-subjects
variable is equal. For example, the error variance of pre-test vs. post-test is equal to pre-
test vs. six-month follow-up is equal to the error variance of post-test vs. six-month fol-
low-up. Because this assumption deals with pairwise error variance, it was not applicable
in the paired samples t-test (with only two levels of the within-subjects variable, there is
only one pair, so no comparison is possible).
differences in how we illustrate the calculations and how they work out in jamovi. We
will highlight those differences in the section on using jamovi for the analysis.
Partitioning variance
The analysis works similarly, in terms of the calculations, to the one-way ANOVA. We will
have a source table with four sources of variance: Between, Subjects, Within, and Total.
Each source of variance will have a sum of squares, degrees of freedom, and mean square.
In our illustration of hand calculations, we will have two F ratios—Between and Subjects.
The jamovi software will produce the Between F ratio (marked by a “RM Factor” label).
Source SS df MS F
k − 1
Between
SSbetween MSbetween
2
Xk X
dfbetween MSwithin
nsubjects − 1
SSsubjects MSsubjects
2
Subjects X subject X
df subjects MSwithin
Within SStotal − SSbetween − SSsubjects (dfbetween)(dfsubjects) SSwithin
df within
ntotal − 1
Total
2
XX
1 8 17 14
39
= 13.00 =
X
3
2 6 18 16 40
=
X = 13.33
3
3 9 20 15 44
=
X = 14.67
3
4 4 17 14 35
=
X = 11.67
3
5 7 19 18 44
=
X = 14.67
3
Test means 34 91 77 202
=
X = 6.80 X
= = 18.20 =
X = 15.40 = X = 13.47
5 5 5 15
Notice that we have calculated the mean for each “test” (or level of the within-subjects
variable) across the bottom row, the mean for each subject across the right-most column,
and the grand mean in the bottom right cell. We will use those means to calculate the
sources of variance. We’ll begin by calculating total variation:
Participant Test X
X X
2
X−X
To get the sum of squares total, we add up the squared deviations for all scores, which
gives a sum of 385.72. So the total sum of squares is 385.72.
COMPARING MORE THAN TWO POINTS • 219
Next, we’ll calculate the between sum of squares, which will be the mean of each
observation minus the grand mean. For this calculation, we’ll be using the means for Pre
(6.80), Post (18.20), and Six months (15.40):
Participant Test X
X
2
Xk − X X
k
To get the sum of squares between, we add up all of these squared deviation scores,
which sum to 352.90. So the sum of squares between is 352.90.
Next, we will calculate the subjects sum of squares. For this calculation, we’ll use the
mean of each participant minus the grand mean. We’ll use the means for participant 1
(13.00), participant 2 (13.33), participant 3 (14.67), participant 4 (11.67), and participant
5 (14.67):
Participant Test X
X
2
X subject − X X
subject
Participant Test X
X
2
X subject − X X
subject
To get the subjects sum of squares, we add together all the squared deviation scores,
which sum to 19.08. So the subjects sum of squares is 19.08.
Finally, we can use the formula to determine the within sum of squares:
Source SS df MS F
table, we find that the critical value is 3.84. Our calculated value of 2.77 is less than the
critical value, so we fail to reject the null hypothesis and conclude there was no signifi-
cant difference between participants in culturally responsive teaching.
This would be interpreted in the same way as previous omega squared estimates.
About 90% of the variance in culturally responsive teaching practices was explained by
the difference between pre-test, post-test, and the six-month follow-up.
Eta squared
The problem with the formula for omega squared in this case is that jamovi will not pro-
duce the subjects or total sources of variance in the output. The denominator in omega
squared calls for the total sum of squares. However, jamovi does not print that sum of
squares in the output, and we cannot calculate it indirectly because it also does not pro-
duce the subjects sum of squares. As a result, we must use a different effect size estimate.
The estimate we will use in those cases is partial eta squared.
Eta squared has several advantages and disadvantages compared with omega squared.
Perhaps the biggest advantage is that jamovi will calculate partial eta squared for us in
this design. We will illustrate the process for calculating partial eta squared, but it is
not necessary to hand calculate this statistic. Another advantage of using eta squared in
this design is that it is interpreted essentially identically to omega squared. We will still
interpret this statistic as a proportion of variance explained. However, eta squared also
has disadvantages. These are very concisely summarized in Keppel and Wickens (2004).
One major disadvantage of eta squared is that it almost always overestimates the true
effect size. It does so because it does not account for sample size or subject variation in its
formula. As a result, eta squared is also an estimate for the sample, and makes no attempt
to estimate population effect size, unlike omega squared.
All of that being said, the formula for eta squared in this design is:
SSbetween
2 p
SSbetween SSwithin
Notice this formula does not adjust the numerator for sample size or error, and the
denominator does not consider subjects or total variation. As a result, this will usually
produce a larger effect size estimate than would omega squared. In fact, the only cases
in which eta squared will not overestimate the effect size is in those cases where it equals
omega squared. It is important, when interpreting this statistic, to keep in mind that is
usually an overestimate. In our example:
SSbetween 352.90 352.90
2 p 0.963
SSbetween SSwithin 352.90 13.74 366.64
So, this would be interpreted as indicating that about 96% of the variance in culturally
responsive teaching practices was explained by the difference between pre-test, post-
test, and six-month follow-up. This is a larger effect size estimate than we obtained with
omega squared. It has overestimated the proportion of variance explained by about 6%,
and partial eta squared will almost always overestimate in this way. Still, it is a usable
effect size estimate that most researchers will default to in within-subjects designs.
tests in a one-way ANOVA. They will compare each pair of levels on the within-subjects
variable. In our example, that would involve comparing pre-test to post-test, pre-test
to six-month follow-up, and post-test to six-month follow-up. Much like we did with
post-hoc tests, we will examine the pairwise comparisons to determine which pairs are
significantly different. We are not demonstrating the calculations for these comparisons,
and will rely on the jamovi software to produce them.
To assess for normality, we will use the same procedure as we did in the paired samples
t-test. We will copy all of the data across the three levels of the within-subjects variable
into a new variable, so that we can test the entire set of dependent variable data for nor-
mality together.
Then, under Analyses, we’ll click Exploration → Descriptives, and then select the new
variable and move it into the Variables column. We then select skewness and kurtosis
under Statistics, and uncheck the other options we won’t need.
COMPARING MORE THAN TWO POINTS • 225
We can then evaluate the output to determine if the data were normally distributed.
The data can be considered normally distributed if the absolute value of skewness is less
than two times the standard error of skewness, and if the absolute value of kurtosis is less
than two times the standard error of kurtosis. In this case, .595 is less than two times .580
(which would be 1.16), and 1.143 is less than two times 1.121 (which would be 2.242), so
the data were normally distributed.
Next, to produce the within-subjects ANOVA, we will go to Analyses → ANOVA →
Repeated Measures ANOVA.
226 • WITHIN-SUBJECTS DESIGNS
This menu works a bit differently from those we’ve seen before. At the top, “RM Factor
1” is the default name jamovi gives to our independent variable. By clicking on that label,
we can type in a more descriptive name, perhaps in this case something like, “Time.” By
default, it has created two levels, but we can add as many as we need. We can leave them
as “Level 1,” Level 2,” and so on, or we can rename by clicking and typing a new label,
like, “Pre-test,” “Post-test,” any “6-month follow-up.” Notice that to label the third “level,”
simply click on the grey “Level 3.” If we had more than three levels to the independent
variable, we could continue adding them in this way. Next, click on the variables on the
left, and use the arrow key to move them to the appropriate “Repeated Measures Cells”
on the right (matching up the labels to the correct variables).
Moving down the menu, we can then click “Partial η2p” to produce the effect size esti-
mate. Under Assumption Checks, we can click “Sphericity tests.” Note that if we were to
fail the assumption of sphericity, the checkbox to add the Greenhouse-Geisser correc-
tion is here as well. You may also notice a box that can be checked to produce Levene’s
test. This menu can be used for multiple different analyses, including those that have
a between-subjects variable as part of the design. Levene’s test only works when there
is a between-subjects independent variable, so that option is not useful for the current
design.
COMPARING MORE THAN TWO POINTS • 227
Next, under Post Hoc Tests, we can produce the pairwise comparisons. We’ll select our
within-subjects independent variable, here labelled “Time” and click the arrow button
to select it for analysis (moving it to the box on the right). We can then check the box
for which correction we’d like to use below. For now, we’ll leave Tukey selected.
Because this program is more general, meaning it can run many different repeated
measures analysis, it produces some output that we do not need nor have we yet learned
how to interpret. We’ll mention these as we move through the output. We’ll start about
halfway down the output under Assumptions to check the assumption of sphericity.
We see that W = .525, p = .380. Because p > .050, we fail to reject the null. Remember that
with Mauchly’s test, the null hypothesis is that the pairwise error variances are equal. In
228 • WITHIN-SUBJECTS DESIGNS
other words, the null is that sphericity was met. So here, we fail to reject the null, mean-
ing the data met the assumption of sphericity. As a result, there is no need for adding
the Greenhouse-Geisser correction. Next, we’ll look at the Repeated Measures ANOVA
output.
Here, the Between term is labelled “Time” and the Within or Error term is labelled
“Residual.” We see that F at 2 and 8 degrees of freedom is 102.796, and p is < .001. So,
there was a significant difference based on time (pre- versus post- versus 6-month fol-
low-up). We also see that partial eta squared is .963, which would mean that about 96%
of the variance in scores was explained by time. Note that this is an absurdly high effect
size estimate—they would typically be much smaller, but these are made up data to illus-
trate the analysis.
The next piece of output we’ll look at is “Post Hoc Tests.” Here are the results of the
Tukey pairwise comparisons.
We here see a significant difference between pre-test and post-test (p < .001), a signifi-
cant difference between pre-test and the six-month follow-up (p < .001), and a significant
COMPARING MORE THAN TWO POINTS • 229
difference between post-test and the six-month follow-up (p = .023). The jamovi output
also includes t and degrees of freedom for these comparisons. It would be appropriate to
include those in the results section, though it is less typical to see them there. One of the
reasons they may be less commonly reported for the follow-up analysis (and frequently,
only p is reported) is that other popular software packages like IBM SPSS produce only
the probability values for the pairwise comparisons. Looking at the descriptive statistics
(which we can produce as we’ve done in prior chapters in the Analyses → Descriptives
menu), we can see that scores were higher at post-test and at the six-month follow-up
than at pre-test. We also see that scores were higher at post-test than at the six-month
follow-up. So the pattern is that teachers scored higher in culturally responsive teaching
after the workshop and had a decline from post-test to the six-month follow-up. How-
ever, scores were still higher at six months than they were at pre-test.
Results
In the next chapter, we will explore some examples from published research using the
within-subjects ANOVA.
15
Within-subjects ANOVA case studies
In the previous chapter, we explored the within-subjects ANOVA using a made-up exam-
ple and some fabricated data. In this chapter, we will present several examples of published
research that used the within-subjects ANOVA. For each sample, we encourage you to:
1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the within-subjects ANOVA.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.
231
232 • WITHIN-SUBJECTS DESIGNS
follow-up eight weeks after the post-test. The follow-up test is important to this design
because it allowed the researchers to assess whether any differences found at post-test
were maintained after the program ended.
Research questions
The authors asked several research questions; in this case study, however, we will focus on
one: Was psychological distress, as measured by the global severity index, significantly
different at post-test and the follow-up than it was before the mindfulness program?
Hypotheses
The authors hypothesized the following related to the global severity index:
H0: There was no significant difference in global severity index scores between the
pre-test, post-test, and follow-up. (Mpre = Mpost = Mfollow-up)
H1: There was a significant difference in global severity index scores between the pre-
test, post-test, and follow-up. (Mpre ≠ Mpost ≠ Mfollow-up)
Notice that, although the authors theorized that scores would improve at post-test and
at the follow-up, the formal hypothesis do not specify a direction. The ANOVA design
doesn’t allow any specification of directionality in the omnibus test.
Write-up
Results
(Continued)
234 • WITHIN-SUBJECTS DESIGNS
(η2 = .390). There was no significant difference between pre-test and post-test
scores (p = .368), but there was a significant difference between pre-test and
follow-up scores (p = .011) and between post-test and follow-up scores
(p < .001). Global severity index scores were significantly lower at post-test
(M = 8.500, SD = 5.910) than they were at either pre-test (M = 11.930, SD =
7.790) or post-test (M = 11.930, SD = 7.790). This may suggest that mindful-
ness is associated with longer-term reductions in psychological distress.
Research questions
In the larger study, the researchers have several research questions. In this case study,
we focus on one: Were there differences in performance across the four exams in these
introductory psychology courses?
Hypotheses
The authors hypothesized the following related to exam scores:
H0: There was no significant difference in exam scores across the four exams.
(M1 = M2 = M3 = M4)
H1: There was a significant difference in exam scores across the four exams.
(M1 ≠ M2 ≠ M3 ≠ M4)
In the full article, they conduct additional analyses of exam scores, but here we focus on
their use of the within-subjects ANOVA.
WITHIN-SUBJECTS ANOVA CASE STUDIES • 235
Write-up
Results
(Continued)
WITHIN-SUBJECTS ANOVA CASE STUDIES • 237
fourth exam compared to any of the other three exams. Scores were also
higher on the third exam compared to the first exam.
Table 15.1
Descriptive Statistics for Exam Scores
Exam M SD
To add the table, we would start on a new page after the references page, and insert one
table per page.
Finally, we encourage you to compare these example results sections to the published
papers. The statistics will not match exactly due to the way we simulate data for the online
course practice data sets. But how did the authors present the results? Why might they
have presented results differently than our standard layout would have presented them?
In the next chapter, we will introduce the mixed ANOVA. This design will add another
layer to our analysis, allowing us to include one between-subjects variable and one with-
in-subjects variable.
For additional case studies, including example data sets, please visit the textbook
website for an eResource package, including specific case studies on race and racism in
education.
Notes
1 This value will not match the published results exactly. Because of the process used to simu-
late data for the online course page, it will not exactly match for within-subjects designs. The
authors’ published results are not in question here, but our simulated outcomes are slightly
different.
2 Again note that this value will not match published values exactly. That is an artifact of the
way that we have simulated data for the online course resources to allow students to practice
the analysis, not a commentary on the published results.
3 This, again, will not match the published results exactly due to the simulation process for the
practice data in the online resources. The authors report a “marginally significant” differ-
ence from pre-test to post-test. That is a way that some researchers will describe differences
between p = .05 and .10. However, we discourage the use of this criterion as significance is
already a fairly low bar for determine differences and raising it to allow discussion of nonsig-
nificant differences will create unacceptable inflation of Type I error.
4 Remember that our calculated values will differ from the published values as an artifact of the
way we simulated data for the online course practice datasets.
16
Mixed between- and within-subjects
designs using the mixed ANOVA
In this book so far, we have explored between-subjects designs, including the independ-
ent samples t-test, the one-way ANOVA, and the factorial ANOVA. We also discussed
within-subjects designs, including the paired samples t-test and the within-subjects
ANOVA. Now we will put these together in a design that has one between-subjects and
one within-subjects variable.
239
240 • WITHIN-SUBJECTS DESIGNS
a pre-test and a post-test). The mixed ANOVA will allow us to test for an interaction of
the between-subjects and the within-subjects variable. In simple terms, we might want
to know if the change from pre-test to post-test is different between the two groups. The
mixed ANOVA will provide a means to test such a question.
Ordinal
7
1
Time 1 Time 2
Group 1 Group 2
Disordinal
7
1
Time 1 Time 2
Group 1 Group 2
because the more biased the sample was, the less generalizable the results will be. The
design also assumes random assignment to groups on the between-subjects variable.
That assumption matters if the goal is causal inference. Experimental design (random
assignment to groups) is not sufficient for causal inference, but it does satisfy one of the
requirements of causal inference. Namely, experimental design helps to isolate the effect
of the independent variable from other potential causal factors. To establish a causal
claim, we would also need to establish a rationale for the causal claim (why would this
independent variable cause this dependent variable?), demonstrate a reliable relation-
ship between the variables (which usually means multiple samples and studies showing
a consistent result), and the causal factor would need to precede the outcome in time
(e.g., the cause has to come before the effect, which usually means longitudinal design).
If the goal is not causal inference, the assumption of random assignment to groups is less
important. As we have discussed in prior chapters, many variables we are interested in
researching cannot practically, legally, or ethically be randomly assigned.
Homogeneity of variance
This design involves a between-subjects variable, so the assumption of homogeneity of
variance applies. It works a bit differently, though, because of the presence of the with-
in-subjects variable. We will produce a Levene’s test for each level of the within-subjects
variable. So, if there are a pre-test and a post-test, we will test for equality of variance
between the groups among pre-test data, and then a second test for equality of variance
between the groups among post-test data. Because of this, jamovi will print multiple Lev-
ene’s test results in this design. For the assumption to be considered met, all of the Levene’s
tests should be passed. Remember that, for Levene’s test, the null hypothesis is that the
variances are equal (homogeneity of variance), so when p > .05, the assumption is met.
Sphericity
Because there is a within-subjects variable, the assumption of sphericity may also apply.
Remember that the assumption did not apply1 in paired samples t-test because that
design involves only two levels on the within-subjects variable. Sphericity is the assump-
tion that the pairwise error variances are equal, so when there are only two levels to the
within-subjects variable, there is only one pair, so the pairwise error variance cannot be
compared. This is an important thing to remember in this design, because jamovi pro-
duces Mauchly’s test only if we ask it to (by checking the box under Assumption Checks).
If there are only two levels of the within-subjects independent variable, the assumption
does not apply, so there is no need for Mauchly’s test. But if there are more than two lev-
els of the within-subjects variable, then we must produce Mauchly’s test, and the result
of Mauchly’s test needs to be nonsignificant (because the null hypothesis for Mauchly’s
test is sphericity or the equality of the pairwise error variances).
of the issue is that jamovi does not have all of the sources of variance we would hand
calculate. However, there will be three effects of interest:
The interaction effect. The mixed ANOVA will produce an interaction term,
1.
which, in this case, is the interaction of the between-subjects and within-subjects
independent variables. If there is a significant interaction, the entire focus of our
interpretation will be on the interaction. In other words, if there is a significant
interaction, we would ignore the other two effects in most cases.
The between-subjects effect. This test will also produce a test for between-subjects
2.
effects. That is the main effect of the between-subjects independent variable. In other
words, this tests whether there was a difference between groups, disregarding the
within-subjects variable. If the interaction is not significant, but the between-sub-
jects effect is significant, then we would interpret any group differences. If there
are only two groups, then the interpretation will be based on the means of each
group. When there are only two groups, this effect becomes conceptually the same
as an independent samples t-test—no follow-up analysis is needed. But if there is no
interaction, a significant between-subjects difference, and more than two groups,
then it is appropriate to use a post-hoc test. In that situation, the between-subjects
effect is conceptually the same as a one-way ANOVA, and any post-hoc tests will be
interpreted just like they would be in a one-way ANOVA design.
The within-subjects effect. The mixed ANOVA also produces a test of within-sub-
3.
jects differences, disregarding the between-subjects variable. If the interaction is not
significant, we would interpret this effect, in addition to the between-subjects effect.
If there is a significant within-subjects difference, and there are only two groups,
this is interpreted like a paired samples t-test. No follow-up analysis would be nec-
essary—simply interpret the mean difference. However, if it is significant and there
are more than two groups, then the test is conceptually the same as a within-sub-
jects ANOVA. So, the appropriate follow-up would be to use pairwise comparisons.
LGBTQ 6 5
LGBTQ 5 4
LGBTQ 7 6
LGBTQ 5 3
LGBTQ 6 4
Cisgender/Heterosexual 3 3
Cisgender/Heterosexual 4 3
Cisgender/Heterosexual 3 4
Cisgender/Heterosexual 2 3
Cisgender/Heterosexual 2 1
In reality, for this design, like the others covered in this text, the ideal sample size is at
least 30 per group. This design is a 2x2 mixed ANOVA because there are two within-sub-
jects levels (pre- and post-) and two groups (LGBTQ and cisgender/heterosexual), and
we would want at least 60 participants.
To begin, we’ll set up the data file. In the Data tab, using the Setup button, we can
specify variable names. We’ll need three variables—one for pre-test data, one for post-
test data, and one for group membership. We can enter our data, and then using the
Setup button for our grouping variable, label the two groups.
Before running the primary analysis, we would need to evaluate the assumption of nor-
mality, using the same process we described in Chapter 14. Next, to run the analysis, we
will go to the Analyses tab, then ANOVA, then Repeated Measures ANOVA. Notice this
is the same menu we used to produce the within-subjects ANOVA in Chapter 14. The
resulting menu is the same as it was in Chapter 14. We can name the “Repeated Meas-
ures Fact,” which by default is labelled RM Factor 1. Here we have changed that to Time.
Next, we can name the levels of the within-subjects variable, which here are Pre-test and
246 • WITHIN-SUBJECTS DESIGNS
Post-test. Then we can specify in the “Repeated Measures Cells” which variables corre-
spond to those levels (here, Pre- and Post-). Finally, in the only step that differs from what
we did in Chapter 14, we move Group to the “Between Subject Factors” box.
Next, under Effect Size, we will select partial eta squared. Under “Assumption Checks”
we will select the “Equality of variances test (Levene’s)”. In this case, we do not need to
select the sphericity test (which would produce Mauchly’s test) because there are only
two levels of the within-subjects variable. But the option to produce it is here, as are the
corrections if we failed that assumption.
At this point, we can pause to look at the output before deciding on follow-up procedures.
If the interaction was no significant, we would use the follow procedures described in
Chapter 14 if the main effect of the within-subjects variable was significant, or those
from Chapter 6 or 8 if the main effect of the between-subjects variable was significant
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 247
(depending on how many groups that variable had). First, though, we should evaluate
the assumption of homogeneity of variance.
Here, we see that the assumption is met for both pre-test and post-test data because p
> .050 in both cases. As we mentioned in previous chapters, the odd notation for the
F ratio associated with Levene’s test for the pre-test data is scientific notation, which is
equivalent to 1.285(10-29) or .00000000000000000000000000001285. As we previously
noted, we would likely report this as F1, 8 < .001, p > .999. However, in most cases, because
the assumption is met, we would not need to report this value in the manuscript. For
post-test, p = .713 which is also > .050, so the assumption was met. So, we next turn to
the ANOVA results.
We first interpret the interaction, here shown as Time * Group. There was a significant
interaction (F1, 8 = 7.538, p = .025). About 49% of the variance in experiences of bullying
was explained by the combination of time (pre-test vs. post-test) and group (LGBTQ vs.
cisgender/heterosexual; η2 = .485). Because the interaction was significant, we would not
248 • WITHIN-SUBJECTS DESIGNS
interpret either of the main effects, except for in exceptional circumstances where the
research questions require us to do so. Just like with the factorial ANOVA, in the mixed
ANOVA if there is a significant interaction, we focus all of our interpretation on the
interaction. The presence of an interaction means that neither independent variable can
be adequately understood in isolation.
However, to make sure it is clear how to find and interpret the two main effects, we
will briefly describe them here. Again, this would not really be done in this case because
the interaction was significant. However, the main effect of the within-subjects varia-
ble is, in this case, on the line marked Time. So, there was also a significant difference
between pre- and post-test scores (F1, 8 = 7.538, p = .025, η2 = .485).2 The main effect
for the between-subjects variable is in the next table down, marked Between Subjects
Effects, on the row labelled Group. From that, we can determine that there was a sig-
nificant difference between LGBTQ and cisgender/heterosexual students’ scores (F1, 8 =
16.277, p = .004, η2 = .485). Again, because the interaction was significant, we would
not interpret the main effects in this example but wanted to briefly demonstrate where
to find them. The follow-up procedures for a significant main effect when there was no
significant interaction are discussed earlier in this chapter.
Going back to the analyses options (which you can return to simply by clicking any-
where in the output for the analysis you want to change), there is an additional option
we will use to produce a plot of the group means, which helps us determine the type of
interaction. To do so, under Estimated Marginal Means, we will drag Time and Group
into the space for Term 1. By default, the box for Marginal means plots will already
be checked. Also, by default, under Plot, the option for Error bars is set to Confidence
Interval. That may work well, or we might want to change it to None for a somewhat
cleaner-looking plot.
Time
* Group
5
Dependent
Group
LGBTQ
4 Cisgender/Heterosexual
Pre-test Post-test
Time
From this plot we can determine that the interaction was ordinal, as discussed in Chap-
ter 10, because the lines do not cross. Next, we will need a follow-up analysis to deter-
mine how the cells differ from one another.
This output includes a test of each cell versus every other cell. This differs from the simple
effects analysis, which would have produced a comparison of LGBTQ vs. cisgender/het-
erosexual participants at pre-test and again at post-test (see Strunk & Mwavita, 2020 for
further discussion of that analysis and how it is produced in other software packages).
But, if we want to do the same thing, all the information is here. Pre-test LGBTQ versus
pre-test cisgender/heterosexual is the first line of the output, and we see a significant
difference (p = .003). Post-test LGBTQ versus post-test cisgender/heterosexual is the last
line, and that comparison was not significant (p = .104). From those two test statistics
combined with the plot, we can conclude that LGBTQ students experienced significantly
more bullying than cisgender/heterosexual students at pre-test; after the intervention,
however, there was no significant difference between the two groups. That would tend to
suggest the intervention was effective.
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 251
Using jamovi’s output lets us make other comparisons as well. For example, we could
also test whether bullying experiences changed from pre- to post-test for LGBTQ students,
which is found on the second line of output. We see a significant difference (p = .019).
We could also ask if there were changes in bullying experiences from pre- to post-test for
cisgender/heterosexual students. That is found on the fifth line of output (second to last)
and we see no significant difference (p > .999). This has produced a total of six compari-
sons, and so far we have interpreted five of them. The only one we have not interpreted is
the comparison of pre-test scores for LGBTQ students to post-test scores for cisgender/
heterosexual students. That comparison does not make a lot of sense to interpret in this
case. In fact, we would encourage only interpreting the comparisons necessary to answer
the research question. As we have discussed in previous chapters, conducting and inter-
preting many different comparisons has the potential to increase Type I error rates.
For our above example, we’ll provide a sample Results section using the comparison by
groups.
1. What test did you use, and why?
We used a mixed ANOVA to determine if student reports of bullying signifi-
cantly differed across the interaction of LGBTQ students versus cisgender/het-
erosexual students, and from before (pre-test) to after (post-test) an anti-bullying
intervention.
2. Note any issues with the statistical assumptions, including any corrections that
were needed.
No issues with the statistical assumptions found. Homogeneity of variance was
met, and sphericity did not apply. Note that, in this example, to test normality, we
would need to combine the scores from pre- and post-test data, just like we did
with the within-subjects ANOVA in Chapter 14.
3. What is the result of the omnibus test?
There was a significant difference based on the interaction (F1, 8 = 7.538, p = .025).
4. If significant, report and interpret the effect size. (If nonsignificant, report effect
size alongside F and p in #3.)
About 49% of the variance in bullying was explained by the combination of student
252 • WITHIN-SUBJECTS DESIGNS
Results
(Continued)
MIXED BETWEEN- AND WITHIN-SUBJECTS DESIGNS • 253
Table 16.1
Descriptive Statistics by Group
Pre-Test Post-Test
Group M SD M SD
Notes
1 Technically, the assumption applied but was automatically met. We often describe the
assumption of sphericity as only applying when there are more than two levels of the within-
subjects independent variable, but actually it is simply always met when there are only two
levels. Because sphericity is the assumption that the pairwise error variances are equal, and
with only two levels on the within-subjects independent variable there is only one possible
‘pair’, the assumption is automatically met (there is no other pair to which to compare). So, it’s
not technically true that the assumption does not apply, but it is an easy way to think about
254 • WITHIN-SUBJECTS DESIGNS
why we don’t need Mauchly’s test if there are only two levels to the within-subjects indepen-
dent variable.
2 In this case, the test statistics for the within-subjects variable and interaction are identical –
this is unusual and is caused by the way that we ‘made up’ data for use in the example. We
note it here in case it might initially seem like an error or appear curious—it’s just an artifact
of how the data were created for this example.
17
Mixed ANOVA case studies
In the previous chapter, we explored the mixed ANOVA using a made-up example and
some fabricated data. In this chapter, we will present several examples of published
research that used the mixed ANOVA. For each sample, we encourage you to:
1. Use your library resources to find the original, published article. Read that article
and look for how they use and talk about the mixed ANOVA.
2. Visit this book’s online resources and download the datasets that accompany this
chapter. Each dataset is simulated to reproduce the outcomes of the published
research. (Note: The online datasets are not real human subjects data but have been
simulated to match the characteristics of the published work.)
3. Follow along with each step of the analysis, comparing your own results with what
we provide in this chapter. This will help cement your understanding of how to use
the analysis.
255
256 • WITHIN-SUBJECTS DESIGNS
In this article, the researchers report on a method of measuring implicit bias against
transgender people. To do so, they use a technique called the affect misattribution pro-
cedure. Participants received scores for neutral primes (where they reacted to neutral
words like “relevant,” “green,” and “cable”), and also reacted to a set of transgender
primes (where they reacted to words like “transgender” and “transman”). The procedure
is relatively involved, but participants were presented with one of the set of prime words,
then ambiguous stimuli (in this case, Chinese language symbols, which no participants
were familiar with), after which they rated the “pleasantness” of the ambiguous stimuli.
Their question was whether there would be a difference in ratings based on the priming
words (neutral versus transgender), with all participants completing both sets of ratings,
and on whether the participant had regular contact with transgender people (yes or no)
as a between-subjects variable.
Research questions
The authors present several questions in the full paper, but here we focus on one research
question: Was there a significant difference in ratings based on the interaction of prime
type (transgender versus neutral primes) and whether the participant had regular con-
tact with transgender people.
Hypotheses
The authors hypothesized the following related to ratings:
Write-up
Results
(Continued)
MIXED ANOVA CASE STUDIES • 259
Table 17.1
Descriptive Statistics for Ratings
The table would go on a new page after the references page, with one table per page. In
this case, a figure is probably unnecessary because there is no interaction to visualize.
Research questions
Related to attitudes about suicide, the research question was: Was there a difference in
attitudes about suicide based on the interaction of pre- versus post-test and placement in
the experimental versus control groups?
Hypotheses
The authors hypothesized the following related to attitudes about suicide:
H0: There was no difference in attitudes about suicide based on the interaction of pre-
versus post-test and placement in the experimental versus control groups.
(MPreXControl = MPreXIntervention = MPostXControl = MPostXIntervention)
H1: There was a difference in attitudes about suicide based on the interaction of pre-
versus post-test and placement in the experimental versus control groups.
(MPreXControl ≠ MPreXIntervention ≠ MPostXControl ≠ MPostXIntervention)
Write-up
Results
(Continued)
262 • WITHIN-SUBJECTS DESIGNS
Table 17.2
Descriptive Statistics for Attitudes about Suicide
Test Group M SD N
25
24
23
22
21
20
19
18
17
16
15
Pre-test Post-test
Control Experimental
The table would be placed after the references page, on a new page, with one table per
page.
In this case, it is probably also appropriate to include a figure to allow readers to visualize
the interaction. The figure would go on a new page after any tables.
In concluding this final chapter of case studies, we want to restate our advice to read
the studies on which these cases are based. Pay attention to how different authors in
different fields and different publications put an emphasis on varying aspects of the anal-
ysis, use different terminology, or write about the analyses differently. Also notice that
many of the authors have written about their work in a more aesthetically pleasing or
creative manner than we have. Our Results sections follow closely the outlines we’ve sug-
gested in the analysis chapters, but it is clear from reading the published work that there
are many ways to write, some of which may be easier to read.
Note
1 As a reminder, these values from the online course dataset will not match the published work
exactly because our method of simulating the data cannot precisely replicate the published
study for within-subjects designs. However, the pattern of differences, means, and standard
deviations will be the same, and this should not be read as casting doubt on the published
results.
Part V
Considering equity in quantitative
research
265
18
Quantitative methods for
social justice and equity
Even a superficial review of the research cited in policy briefs, produced by and for U.S.
federal agencies, and referred to in public discourse would reveal that the vast major-
ity of that research is quantitative. In fact, some federal agencies have gone so far as to
specify that quantitative methods, and especially experimental methods, are the gold
standard in social and educational research (Institute for Education Sciences, 2003). In
other words—those with power in policy, funding, and large-scale education initiatives
have made explicit their belief that quantitative methods are better, more objective, more
trustworthy, and more meritorious than other methodologies.
Visible in the national and public discourses around educational research is the nat-
uralization of quantitative methods, with other methods rendered as exotic or unusual.
In this system, quantitative methods take on the tone of objectivity, as if the statistical
tests and theories are some sort of natural law or absolute truth. This is in spite of the
fact that quantitative methods have at least as much subjectivity and rocky history as
other methodologies. But because they are treated as if they were objective and without
history, quantitative methods have a normalizing power, especially in policy discourse.
In part because of that normalization, quantitative methods are also promising for use
in research for social justice and equity. The assumption that these methods are superior,
more objective, or more trustworthy than qualitative and other methodologies can be
a leverage point for those working to move educational systems toward equity. Several
267
268 • CONSIDERING EQUITY IN QUANTITATIVE RESEARCH
authors (e.g., Strunk & Locke, 2019) have written about specific approaches to and appli-
cations of quantitative methods for social justice and equity, but our purpose in this
chapter is to more broadly review the practical and theoretical considerations in using
quantitative methods for equitable purposes. We begin by exploring the ways in which
quantitative methods are not, in fact, neutral given their history and contemporary uses.
We then describe the ways that quantitative methods operate in hegemonic ways in
schools and broader research contexts. Next, we examine the potential for dehumani-
zation in quantitative methods and how researchers can avoid those patterns. We then
offer practical considerations for doing equitable quantitative research and highlight the
promise of quantitative work in social justice and equity research.
the richness of experiences to numbers, quantities, and scales distances researchers from
participants and the inferences from their experiences. Moreover, researchers must make
difficult decisions about the creation of categorical variables. While many students and
established scholars alike default to federally defined categories (like the five federally
defined racial categories of White, non-Hispanic; Black, non-Hispanic; Hispanic; Asian;
or Native American), those categories are rarely sufficient or appropriate. Researchers,
such as Teranishi (2007), have pointed out the problems created by these overly simplis-
tic categories and of the practice of collapsing small categories together. When catego-
ries are not expansive enough, or when they are combined into more generic categories
for data analysis, much of the variation is lost. Moreover, asking participants to select
identity categories with which they do not identify can, in and of itself, be oppressive.
Thinking carefully about the identities of research participants and how to present sur-
vey options is an important step in humanizing quantitative research.
Many times, researchers simply throw demographic items on the end of a survey with-
out much consideration for how those items might be perceived or even how they might
use the data. We suggest that researchers only ask for demographic data when those
data are central to their analysis. In other words, if the research questions and planned
analyses will not make use of demographic items, consider leaving them out completely.
If those items are necessary, researchers should carefully consider the wording of those
items. One promising practice is to simply leave response options open, allowing partic-
ipants to type in the identity category of their choice. For example, rather than providing
stock options for gender, researchers can simply ask participants their gender and allow
them to type in a freeform response. One issue with that approach is that it requires more
labor from researchers to code those responses into categories. However, that labor is
worthwhile in an effort to present more humanizing work. Researchers might also find
categories they did not consider are important to participants, enriching the analysis.
In some cases, it is impractical to hand-code responses. This is particularly true in
large-scale data collection, where there might be thousands of participants. It might also
be difficult when the study is required to align with institutional, sponsor, or governmen-
tal data. For example, it is common for commissioned studies to be asked to determine
“representativeness” by comparing sample demographics to institutional or regional sta-
tistics. In such cases, a strategy that might be useful is to allow the open-response demo-
graphic item, followed by a forced choice item with the narrower options. In our work,
we have used the phrasing, “If you had to choose one of the following options, which one
most closely matches your [identity]?” Doing so allows for meeting the requirements of
the study, while also allowing more expansive options for use in subsequent analyses.
As one example, we provide below a sample of decisions researchers might make
around collecting data on gender and sexual identities. Similar thinking could inform
data collection on a number of demographic factors, as we illustrate in the Appendix
found at the end of this chapter.
about the nature of research and the data. A practical struggle for researchers using those
methods, then, is to work against those post-positivist impulses. One way that research-
ers can do this is by openly writing about their assumptions, their epistemology, their
theoretical framework, and how they approach the tests. That type of writing is atypical
in quantitative methods but useful.
One important step in using quantitative methods for social justice and equity is to reject
the notion that these tests are somehow objective. All research is informed by researcher
and participant subjectivities. As others have suggested, the very selection of research
questions, hypotheses, measurement approaches, and statistical tests are all ideological
and subjective choices. While quantitative work is often presented as if it was devoid of
values, political action, and subjectivity, such work is inherently political, unquestionably
ideological, and always subjective. A small but important step is acknowledging that sub-
jectivity, the researcher’s positionality, and the theoretical and ideological stakes. It is also
important for researchers to acknowledge when their subjectivities diverge from the com-
munities they study. As Bonilla-Silva and Zuberi (2008) convincingly argue, these meth-
ods were created through the logics of whiteness and, unless researchers work against that
tendency, will center whiteness at the expense of all other perspectives and knowledges.
Another practical strategy is to approach the data and the statistical tests more reflex-
ively. One of the problems with quantitative work is that by quantifying individuals
researchers inherently dehumanize their participants. Researchers using quantitative
methods must actively work to be more reflexive and to engage with the communities
from which their data are drawn in more continuous and purposeful ways. There are
statistics that are more person-centered than variable-centered (like cluster analysis,
multidimensional scaling, etc.), but even in those approaches, people are still reduced to
numbers. As a result, writing up those results requires work to rehumanize the partici-
pants and their experiences.
One way in which this plays out is in how researchers conceptualize error. Most quan-
titative models evidence an obsession with error. In fact, advances in quantitative meth-
ods over the past half-century have almost entirely centered around the reduction of and
accounting for error, sometimes to the point of ridiculousness. Lost in the quest to reduce
error is the fact that what we calculate as error or noise is often deeply meaningful. For
example, when statistical tests of curriculum models treat variation among teachers as
error or even noncompliance, they obscure the real work that teachers do of modifying
curricula to be more culturally responsive and appropriate for their individual students.
When randomly assigning students to receive different kinds of treatments or instruc-
tion, researchers treat within-group variation as error when it might actually be attrib-
utable to differences in subject positioning and intersubjectivity. Quantitative methods
might not ever be capable of fully accounting for the richness of human experiences that
get categorized as error, but researchers can work to conceptualize of error differently,
and write about it in ways that open possibilities rather than dismiss that variation.
legislators) that the injustices marginalized communities voice are real and demand their
attention. While it is a sad commentary that the voices of marginalized communities are
not sufficient to move policymakers to action, the naturalized sense of quantitative meth-
ods as objective or neutral can be useful in shifting those policy conversations.
Others have attempted to integrate various critical theoretical frameworks with quan-
titative methods. One such approach is QuantCrit, which attempts to merge critical race
theory (CRT) and quantitative methods. Much has been written elsewhere about this
approach, but it has been used in research on higher education to challenge whiteness
in college environments (Teranishi, 2007). Similarly, experimental methods have been
used to document the presence of things like implicit bias, the collective toll of microag-
gressions, and to attempt to map the psychological processes of bias and discrimination
(Koonce, 2018; Strunk & Bailey, 2015).
Quantitative methods, such as those described in this text, can be used in equitable
and socially just ways. However, researchers must carefully think about the implications
of their work, how that work is intertwined with issues of inequity and oppression, and
how they can reimagine their approaches to work toward equity. Throughout the case
studies and examples in this text, we have been intentional to include examples that
speak to a commitment to social justice and equity. We have also included more “tradi-
tional” quantitative research examples to illustrate the broad array of approaches availa-
ble. But our hope is that researchers and students using this book will opt to move their
approaches toward more equitable, inclusive, and just methodologies.
Note
1 This chapter originally appeared in Strunk and Locke (2019) as a chapter in an edited vol-
ume. It has been modified and reproduced here by permission from Palgrave Macmillan, a
division of Springer. The original chapter appeared as: Strunk, K. K., & Hoover, P. D. (2019).
Quantitative methods for social justice and equity: Theoretical and practical considerations.
In K. K. Strunk & L. A. Locke (Eds.), Research methods for social justice and equity in educa-
tion (pp. 191–201). New York, NY: Palgrave.
Appendices
−3.00 0.13 .001 −2.72 0.33 .003 −2.44 0.73 .007 −2.16 1.54 .015
−2.99 0.14 .001 −2.71 0.34 .003 −2.43 0.75 .007 −2.15 1.58 .016
−2.98 0.14 .001 −2.70 0.35 .003 −2.42 0.78 .008 −2.14 1.62 .016
−2.97 0.15 .002 −2.69 0.36 .004 −2.41 0.80 .008 −2.13 1.66 .017
−2.96 0.15 .002 −2.68 0.37 .004 −2.40 0.82 .008 −2.12 1.70 .017
−2.95 0.16 .002 −2.67 0.38 .004 −2.39 0.84 .008 −2.11 1.74 .017
−2.94 0.16 .002 −2.66 0.39 .004 −2.38 0.87 .009 −2.10 1.79 .018
−2.93 0.17 .002 −2.65 0.40 .004 −2.37 0.89 .009 −2.09 1.83 .018
−2.92 0.18 .002 −2.64 0.41 .004 −2.36 0.91 .009 −2.08 1.88 .019
−2.91 0.18 .002 −2.63 0.43 .004 −2.35 0.94 .009 −2.07 1.92 .019
−2.90 0.19 .002 −2.62 0.44 .004 −2.34 0.96 .010 −2.06 1.97 .020
−2.89 0.19 .002 −2.61 0.45 .005 −2.33 0.99 .010 −2.05 2.02 .020
−2.88 0.20 .002 −2.60 0.47 .005 −2.32 1.02 .010 −2.04 2.07 .021
−2.87 0.21 .002 −2.59 0.48 .005 −2.31 1.04 .010 −2.03 2.12 .021
−2.86 0.21 .002 −2.58 0.49 .005 −2.30 1.07 .011 −2.02 2.17 .022
−2.85 0.22 .002 −2.57 0.51 .005 −2.29 1.10 .011 −2.01 2.22 .022
−2.84 0.23 .002 −2.56 0.52 .005 −2.28 1.13 .011 −2.00 2.28 .023
−2.83 0.23 .002 −2.55 0.54 .005 −2.27 1.16 .012 −1.99 2.33 .023
−2.82 0.24 .002 −2.54 0.55 .005 −2.26 1.19 .012 −1.98 2.39 .024
−2.81 0.25 .002 −2.53 0.57 .006 −2.25 1.22 .012 −1.97 2.44 .024
−2.80 0.26 .003 −2.52 0.59 .006 −2.24 1.25 .013 −1.96 2.50 .025
−2.79 0.26 .003 −2.51 0.60 .006 −2.23 1.29 .013 −1.95 2.56 .026
−2.78 0.27 .003 −2.50 0.62 .006 −2.22 1.32 .013 −1.94 2.62 .026
−2.77 0.28 .003 −2.49 0.64 .006 −2.21 1.36 .014 −1.93 2.68 .027
−2.76 0.29 .003 −2.48 0.66 .007 −2.20 1.39 .014 −1.92 2.74 .027
−2.75 0.30 .003 −2.47 0.68 .007 −2.19 1.43 .014 −1.91 2.81 .028
−2.74 0.31 .003 −2.46 0.69 .007 −2.18 1.46 .015 −1.90 2.87 .029
−2.73 0.32 .003 −2.45 0.71 .007 −2.17 1.50 .015 −1.89 2.94 .029
(Continued)
275
276 • APPENDICES
−1.88 3.01 .030 −1.47 7.08 .071 −1.06 14.46 .145 −0.65 25.78 .258
−1.87 3.07 .031 −1.46 7.21 .072 −1.05 14.69 .147 −0.64 26.11 .261
−1.86 3.14 .031 −1.45 7.35 .073 −1.04 14.92 .149 −0.63 26.44 .264
−1.85 3.22 .032 −1.44 7.49 .075 −1.03 15.15 .152 −0.62 26.76 .268
−1.84 3.29 .033 −1.43 7.64 .076 −1.02 15.39 .154 −0.61 27.09 .271
−1.83 3.36 .034 −1.42 7.78 .078 −1.01 15.62 .156 −0.60 27.43 .274
−1.82 3.44 .034 −1.41 7.93 .079 −1.00 15.87 .159 −0.59 27.76 .278
−1.81 3.52 .035 −1.40 8.08 .081 −0.99 16.11 .161 −0.58 28.10 .281
−1.80 3.59 .036 −1.39 8.23 .082 −0.98 16.35 .164 −0.57 28.43 .284
−1.79 3.67 .037 −1.38 8.38 .084 −0.97 16.60 .166 −0.56 28.77 .288
−1.78 3.75 .038 −1.37 8.53 .085 −0.96 16.85 .169 −0.55 29.12 .291
−1.77 6.84 .068 −1.36 8.69 .087 −0.95 17.11 .171 −0.54 29.46 .295
−1.76 3.92 .039 −1.35 8.85 .088 −0.94 17.36 .174 −0.53 29.81 .298
−1.75 4.01 .040 −1.34 9.01 .090 −0.93 17.62 .176 −0.52 30.15 .302
−1.74 4.09 .041 −1.33 9.18 .092 −0.92 17.88 .179 −0.51 30.50 .305
−1.73 4.18 .042 −1.32 9.34 .093 −0.91 18.14 .181 −0.50 30.86 .309
−1.72 4.27 .043 −1.31 9.51 .095 −0.90 18.41 .184 −0.49 31.21 .312
−1.71 4.36 .044 −1.30 9.68 .097 −0.89 18.67 .187 −0.48 31.56 .316
−1.70 4.46 .045 −1.29 9.85 .098 −0.88 18.94 .189 −0.47 31.92 .319
−1.69 4.55 .046 −1.28 10.03 .100 −0.87 19.22 .192 −0.46 32.28 .323
−1.68 4.65 .047 −1.27 10.20 .102 −0.86 19.49 .195 −0.45 32.64 .326
−1.67 4.75 .048 −1.26 10.38 .104 −0.85 19.77 .198 −0.44 33.00 .330
−1.66 4.85 .049 −1.25 10.56 .106 −0.84 20.05 .201 −0.43 33.36 .334
−1.65 4.95 .050 −1.24 10.75 .108 −0.83 20.33 .203 −0.42 33.72 .337
−1.64 5.05 .051 −1.23 10.93 .109 −0.82 20.61 .206 −0.41 34.09 .341
−1.63 5.16 .052 −1.22 11.12 .111 −0.81 20.90 .209 −0.40 34.46 .345
−1.62 5.26 .053 −1.21 11.31 .113 −0.80 21.19 .212 −0.39 34.83 .348
−1.61 5.37 .054 −1.20 11.51 .115 −0.79 21.48 .215 −0.38 35.20 .352
−1.60 5.48 .055 −1.19 11.70 .117 −0.78 21.77 .218 −0.37 35.57 .356
−1.59 5.59 .056 −1.18 11.90 .119 −0.77 22.06 .221 −0.36 35.94 .359
−1.58 5.71 .057 −1.17 12.10 .121 −0.76 22.36 .224 −0.35 36.32 .363
−1.57 5.82 .058 −1.16 12.30 .123 −0.75 22.66 .227 −0.34 36.69 .367
−1.56 5.94 .059 −1.15 12.51 .125 −0.74 22.96 .230 −0.33 37.07 .371
−1.55 6.06 .061 −1.14 12.71 .127 −0.73 23.27 .233 −0.32 37.45 .375
−1.54 6.18 .062 −1.13 12.92 .129 −0.72 23.58 .236 −0.31 37.83 .378
−1.53 6.30 .063 −1.12 13.14 .131 −0.71 23.89 .239 −0.30 38.21 .382
−1.52 6.43 .064 −1.11 13.35 .134 −0.70 24.20 .242 −0.29 38.59 .386
−1.51 6.55 .066 −1.10 13.56 .136 −0.69 24.51 .245 −0.28 38.97 .390
−1.50 6.68 .067 −1.09 13.79 .138 −0.68 24.83 .248 −0.27 39.36 .394
−1.49 6.81 .068 −1.08 14.01 .140 −0.67 25.14 .251 −0.26 39.74 .397
−1.48 6.94 .069 −1.07 14.23 .142 −0.66 25.46 .255 −0.25 40.13 .401
APPENDICES • 277
−0.24 40.52 .405 0.18 57.14 .429 0.60 72.91 .271 1.02 84.61 .154
−0.23 40.90 .409 0.19 57.53 .425 0.61 73.32 .267 1.03 84.85 .152
−0.22 41.29 .413 0.20 57.93 .421 0.62 73.56 .264 1.04 85.08 .149
−0.21 41.68 .417 0.21 58.32 .417 0.63 73.89 .261 1.05 85.31 .147
−0.20 42.07 .421 0.22 58.71 .413 0.64 75.17 .248 1.06 85.54 .145
−0.19 42.47 .425 0.23 59.10 .409 0.65 75.49 .245 1.07 85.77 .142
−0.18 42.86 .429 0.24 59.48 .405 0.66 75.80 .242 1.08 85.99 .140
−0.17 43.25 .433 0.25 59.87 .401 0.67 74.86 .251 1.09 86.21 .138
−0.16 43.64 .436 0.26 60.26 .397 0.68 75.17 .248 1.10 86.64 .134
−0.15 44.04 .440 0.27 60.64 .394 0.69 75.49 .245 1.11 86.65 .134
−0.14 44.43 .444 0.28 61.03 .390 0.70 75.80 .242 1.12 86.86 .131
−0.13 44.83 .448 0.29 61.41 .386 0.71 76.11 .239 1.13 87.08 .129
−0.12 45.22 .452 0.30 61.79 .382 0.72 76.42 .236 1.14 87.29 .127
−0.11 45.62 .456 0.31 62.17 .378 0.73 76.73 .233 1.15 87.49 .125
−0.10 46.02 .460 0.32 62.55 .375 0.74 77.04 .230 1.16 87.70 .123
−0.09 46.41 .464 0.33 62.93 .371 0.75 77.34 .227 1.17 87.90 .121
−0.08 46.81 .468 0.34 63.31 .367 0.76 77.64 .224 1.18 88.10 .119
−0.07 47.21 .472 0.35 63.68 .363 0.77 77.94 .221 1.19 88.30 .117
−0.06 47.60 .476 0.36 64.06 .359 0.78 78.23 .218 1.20 88.49 .115
−0.05 48.01 .480 0.37 64.43 .356 0.79 78.52 .215 1.21 88.69 .113
−0.04 48.40 .484 0.38 64.80 .352 0.80 78.81 .212 1.22 88.88 .111
−0.03 48.80 .488 0.39 65.17 .348 0.81 79.10 .209 1.23 89.07 .109
−0.02 49.20 .492 0.40 65.54 .345 0.82 79.39 .206 1.24 89.25 .108
−0.01 49.60 .496 0.41 65.91 .341 0.83 79.67 .203 1.25 89.44 .106
0.00 50.00 .500 0.42 66.28 .337 0.84 79.95 .201 1.26 89.62 .104
0.01 50.40 .496 0.43 66.64 .334 0.85 80.23 .198 1.27 89.80 .102
0.02 50.80 .492 0.44 67.00 .330 0.86 80.51 .195 1.28 89.97 .100
0.03 51.20 .488 0.45 67.36 .326 0.87 80.78 .192 1.29 90.15 .098
0.04 51.60 .484 0.46 68.08 .319 0.88 81.06 .189 1.30 90.32 .097
0.05 51.99 .480 0.47 68.84 .312 0.89 81.33 .187 1.31 90.49 .095
0.06 52.40 .476 0.48 68.79 .312 0.90 81.59 .184 1.32 90.66 .093
0.07 52.79 .472 0.49 69.14 .309 0.91 81.86 .181 1.33 90.86 .091
0.08 53.19 .468 0.50 69.50 .305 0.92 82.12 .179 1.34 90.99 .090
0.09 53.59 .464 0.51 69.85 .302 0.93 82.38 .176 1.35 91.15 .088
0.10 53.98 .460 0.52 70.19 .298 0.94 82.64 .174 1.36 91.31 .087
0.11 54.38 .456 0.53 70.54 .295 0.95 82.89 .171 1.37 91.47 .085
0.12 54.78 .452 0.54 70.88 .291 0.96 83.15 .169 1.38 91.62 .084
0.13 55.17 .448 0.55 71.23 .288 0.97 83.40 .166 1.39 91.77 .082
0.14 55.57 .444 0.56 71.57 .284 0.98 83.65 .164 1.40 91.92 .081
0.15 55.96 .440 0.57 71.90 .281 0.99 83.89 .161 1.41 92.07 .079
0.16 56.36 .436 0.58 72.24 .278 1.00 84.13 .159 1.42 92.22 .078
0.17 56.75 .433 0.59 72.57 .274 1.01 84.38 .156 1.43 92.36 .076
(Continued)
278 • APPENDICES
1.44 92.51 .075 1.85 96.78 .032 2.26 98.81 .012 2.67 99.62 .004
1.45 92.65 .073 1.86 96.86 .031 2.27 98.84 .012 2.68 99.63 .004
1.46 92.79 .072 1.87 96.93 .031 2.28 98.87 .011 2.69 99.64 .004
1.47 92.92 .071 1.88 96.99 .030 2.29 98.90 .011 2.70 99.65 .003
1.48 93.06 .069 1.89 97.06 .029 2.30 98.93 .011 2.71 99.66 .003
1.49 93.19 .068 1.90 97.13 .029 2.31 98.96 .010 2.72 99.67 .003
1.50 93.32 .067 1.91 97.19 .028 2.32 98.98 .010 2.73 99.68 .003
1.51 93.45 .066 1.92 97.26 .027 2.33 99.01 .010 2.74 99.69 .003
1.52 93.57 .064 1.93 97.32 .027 2.34 99.04 .010 2.75 99.70 .003
1.53 93.70 .063 1.94 97.38 .026 2.35 99.06 .009 2.76 99.71 .003
1.54 93.82 .062 1.95 97.44 .026 2.36 99.09 .009 2.77 99.72 .003
1.55 93.94 .061 1.96 97.50 .025 2.37 99.11 .009 2.78 99.73 .003
1.56 94.06 .059 1.97 97.56 .024 2.38 99.13 .009 2.79 99.74 .003
1.57 94.18 .058 1.98 97.61 .024 2.39 99.16 .008 2.80 99.74 .003
1.58 94.29 .057 1.99 97.67 .023 2.40 99.18 .008 2.81 99.75 .002
1.59 94.41 .056 2.00 97.72 .023 2.41 99.20 .008 2.82 99.76 .002
1.60 94.52 .055 2.01 97.78 .022 2.42 99.22 .008 2.83 99.77 .002
1.61 94.63 .054 2.02 97.83 .022 2.43 99.25 .007 2.84 99.77 .002
1.62 94.74 .053 2.03 97.88 .021 2.44 99.27 .007 2.85 99.78 .002
1.63 94.84 .052 2.04 97.93 .021 2.45 99.29 .007 2.86 99.79 .002
1.64 94.95 .051 2.05 97.98 .020 2.46 99.31 .007 2.87 99.79 .002
1.65 95.05 .050 2.06 98.03 .020 2.47 99.32 .007 2.88 99.80 .002
1.66 95.15 .049 2.07 98.08 .019 2.48 99.34 .007 2.89 99.81 .002
1.67 95.25 .048 2.08 98.12 .019 2.49 99.36 .006 2.90 99.81 .002
1.68 95.35 .047 2.09 98.17 .018 2.50 99.38 .006 2.91 99.82 .002
1.69 95.45 .046 2.10 98.21 .018 2.51 99.40 .006 2.92 99.82 .002
1.70 95.54 .045 2.11 98.26 .017 2.52 99.41 .006 2.93 99.83 .002
1.71 95.64 .044 2.12 98.30 .017 2.53 99.43 .006 2.94 99.84 .002
1.72 95.73 .043 2.13 98.34 .017 2.54 99.45 .005 2.95 99.84 .002
1.73 95.82 .042 2.14 98.38 .016 2.55 99.46 .005 2.96 99.85 .002
1.74 95.91 .041 2.15 98.42 .016 2.56 99.48 .005 2.97 99.85 .002
1.75 95.99 .040 2.16 98.46 .015 2.57 99.49 .005 2.98 99.86 .001
1.76 96.08 .039 2.17 98.50 .015 2.58 99.51 .005 2.99 99.86 .001
1.77 93.16 .068 2.18 98.54 .015 2.59 99.52 .005 3.00 99.87 .001
1.78 96.25 .038 2.19 98.57 .014 2.60 99.53 .005
1.79 96.33 .037 2.20 98.61 .014 2.61 99.55 .005
1.80 96.41 .036 2.21 98.64 .014 2.62 99.56 .004
1.81 96.48 .035 2.22 98.68 .013 2.63 99.57 .004
1.82 96.56 .034 2.23 98.71 .013 2.64 99.59 .004
1.83 96.64 .034 2.24 98.75 .013 2.65 99.60 .004
1.84 96.71 .033 2.25 98.78 .012 2.66 99.61 .004
APPENDICES • 279
df One-tailed Two-
tailed
1 6.31 12.71
2 2.92 4.30
3 2.35 3.18
4 2.13 2.78
5 2.01 2.57
6 1.94 2.45
7 1.89 2.36
8 1.86 2.31
9 1.83 2.26
10 1.81 2.23
11 1.80 2.20
12 1.78 2.18
13 1.77 2.16
14 1.76 2.14
15 1.75 2.13
16 1.75 2.12
17 1.74 2.11
18 1.73 2.10
19 1.73 2.09
20 1.72 2.09
21 1.72 2.08
22 1.72 2.07
23 1.71 2.07
24 1.71 2.06
25 1.71 2.06
26 1.71 2.06
27 1.70 2.05
28 1.70 2.05
29 1.70 2.04
30 1.70 2.04
280 • APPENDICES
1 2 3 4 5
1 161.45 199.50 215.71 224.58 230.16
2 18.51 19.00 19.16 19.25 19.30
3 10.13 9.55 9.28 9.12 9.01
4 7.71 6.94 6.59 6.39 6.26
5 6.61 5.79 5.41 5.19 5.05
6 5.99 5.14 4.76 4.54 4.39
7 5.59 4.74 4.35 4.12 3.97
8 5.32 4.46 4.07 3.84 3.69
9 5.12 4.26 3.86 3.63 3.48
10 4.96 4.10 3.71 3.48 3.33
11 4.84 3.98 3.59 3.36 3.20
Denominator df (dfW)
2 3 4 5 6 7 8 9 10
X: Score
X: Group mean, also written as M
X : Grand mean
s: Standard deviation
SD: Standard deviation
s2: Variance
Σ: Sum of
N: Sample size
H0: Null hypothesis
H1: Alternative hypothesis
∑(X − X)
2
2
s =
N −1
∑(X − X)
2
2
s= s =
N −1
X−X
z=
s
Probabilities
p ( A) = A / N
( )(
p ( AB ) = p ( A ) p ( B ) )
p ( Aor B ) = p ( A ) + p ( B )
One-sample tests
M −µ
Z=
σ
N
M −µ
d=
σ
APPENDICES • 283
M −µ
t=
s
N
M −µ
d=
s
X −Y
t=
sdiff
sdiff = s 2diff
s 2diff = s 2 M X + s 2 MY
s 2 pooled
s2 MX =
NX
s 2 pooled
s 2 MY =
NY
df df
s 2 pooled = X s 2 X + Y s 2Y
dftotal dftotal
X −Y
d=
s pooled
t2 −1
ω2 =
t 2 + N X + NY − 1
One-way ANOVA
Source SS df MS F
dfB = k – 1
( ) SSB MSB
2
Between SSB = X − X MSB = F=
df B MSW
Within dfW = n – k SSW
SSW = ( X − X )
2
MSW =
dfW
dfT = n – 1
Total
( )
2
SST = X − X
284 • APPENDICES
SSB − ( df B )( MSw )
ω2 =
SST + MSw
Factorial ANOVA
Source SS df MS F
kIV1 − 1
( ) SSIV 1 MSIV 1
2
IV1 ∑ X IV1 − X
df IV 1 MSwithin
kIV2 − 1
( ) SSIV 2 MSIV 2
2
IV2 ∑ X IV 2 − X
df IV 2 MSwithin
IV1*IV2 SStotal − SSIV1 − SSIV2 − SSwithin (dfIV1)(dfIV2) SSIV 1∗IV 2 MSIV 1∗IV 2
df IV 1∗IV 2 MSwithin
( ) Ntotal − [(kIV1) SSwithin
2
Within ∑ X − Xcell
(kIV2)] df within
Total
( ) Ntotal − 1
2
∑ X−X
SSE − ( df E )( MSw )
ω2 =
SST + MSw
D
t=
SED
∑D
D=
N
SSD
SED =
N ( N − 1)
(∑ D)
2
SSD = ∑ D ( )2
−
N
t2 −1
ω2 =
t2 + n −1
APPENDICES • 285
Within-subjects ANOVA
Source SS df MS F
k − 1
( ) SSbetween MSbetween
2
Between ∑ Xk − X
dfbetween MSwithin
MSsubjects
nsubjects − 1
( ) SSsubjects
2
Subjects ∑ X subject − X MSwithin
df subjects
SSbetween
η2 =
SSbetween + SSwithin
References
287
288 • REFERENCES
Fischer, C., Fishman, B., Levy, A., Dede, C., Lawrenze, F., Jia, Y., Kook, K., & McCoy, A. (2016).
When do students in low-SES schools perform better-than-expected on high-stakes tests? Analyzing
school, teacher, teaching, and professional development. Urban Education. Advance online publi-
cation. https://doi.org/10.1177/0042085916668953
Giroux, H. A. (2011). On critical pedagogy. New York, NY: Bloomsbury.
Glantz, S. A., Slinker, B. K., & Neilands, T. B. (2016). Primer on regression and analysis of variance
(3rd ed.). New York, NY: McGraw-Hill.
Goodman-Scott, E., Sink, C. A., Cholewa, B. E., & Burgess, M. (2018). An ecological view of school
counselor ratios and student academic outcomes: A national investigation. Journal of Counseling &
Development, 96(4), 388–398. https://doi.org/10.1002/jcad.12221
Guba, E. G., & Lincoln, Y. S. (1994). Competing paradigms in qualitative research. In N. Denzin & Y.
Lincoln (Eds.), Handbook of qualitative research (1st ed.). Thousand Oaks, CA: SAGE.
Hagen, K. S. (2005). Bad blood: The Tuskegee syphilis study and legacy recruitment for experimental
AIDS vaccines. New Directions for Adult & Continuing Education, 2005(105), 31–41. https://doi.
org/10.1002/ace.167
Herrnstein, R. J., & Murray, C. (1994). The bell curve: Intelligence and class structure in American
life. New York, NY: Free Press.
Institute for Education Sciences. (2003, December). Identifying and implementing educational
practices supported by rigorous evidence: A user friendly guide. National Center for Education
Evaluation and Regional Assistance. https://ies.ed.gov/ncee/pubs/evidence_based/randomized.asp
Kanamori, Y., Harrell-Williams, L. M., Xu, Y. J., & Ovrebo, E. (2019). Transgender affect misattribu-
tion procedure (transgender AMP): Development and initial evaluation of performance of a measure
of implicit prejudice. Psychology of Sexual Orientation and Gender Diversity. Online first publica-
tion. https://doi.org/10.1037/sgd/0000343
Kennedy, B. R., Mathis, C. C., & Woods, A. K. (2007). African Americans and their distrust of the
health care system: Healthcare for diverse populations. Journal of Cultural Diversity, 14(2), 56–60.
Keppel, G., & Wickens, T. D. (2004). Design and analysis: A researcher’s handbook (4th ed.). New
York, NY: Pearson.
Kim, A. S., Choi, S., & Park, S. (2018). Heterogeneity in first-generation college students influencing
academic success and adjustment to higher education. Social Science Journal. Advance online pub-
lication. https://doi.org/10.1016/j.soscij.2018.12.002
Kincheloe, J. L., Steinberg, S. R., & Gresson, A. D. (1997). Measured lies: The bell curve examined.
New York, NY: St. Martins.
Koonce, J. B. (2018). Critical race theory and caring as channels for transcending borders between
an African American professor and her Latina/o students. International Journal of Multicultural
Education, 20(2), 101–116.
Lachner, A., Ly, K., & Nückles, M. (2018). Providing written or oral explanations? Differential effects
of the modality of explaining on students’ conceptual learning and transfer. Journal of Experimental
Education, 86(3), 344–361. https://doi.org/10.1080/00220973.2017.1363691
Lather, P. (2006). Paradigm proliferation as a good thing to think with: Teaching research in education
as a wild profusion. International Journal of Qualitative Studies in Education, 19(1), 35–37. https://
doi.org/10.1080/09518390500450144
Leonardo, Z., & Grubb, W. N. (2018). Education and racism: A primer on issues and dilemmas. New
York, NY: Routledge.
Mills, G. E. & Gay, L. R. (2016). Educational research: Competencies for analysis and applications
(12th ed.). Upper Saddle River, NJ: Prentice Hall.
National Center for Education Statistics. (2018). Digest of education statistics. https://nces.ed.gov/
programs/digest/d18/tables/dt18_105.30.asp
Nolan, K. (2014). Neoliberal common sense and race-neutral discourses: A critique of “evidence-based”
policy-making in school policing. Discourse: Studies in the Cultural Politics of Education, 36(6),
894–907. https://doi.org/10.1080/01596306.2014.905457
REFERENCES • 289
Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.). New York, NY: Wadsworth.
Perez, E. R., Schanding, G. T., & Dao, T. K. (2013). Educators’ perceptions in addressing bullying of
LGBTQ/gender nonconforming youth. Journal of School Violence, 12(1), 64–79. https://doi.org/10
.1080/15388220.2012.731663
Reed, S. J., & Miller, R. L. (2016). Thriving and adapting: Resilience, sense of community, and syn-
demics among young black gay and bisexual men. American Journal of Community Psychology,
57(1–2), 129–143. https://doi-org.spot.lib.auburn.edu/10.1002/ajcp.12028
Richardson, T. Q. (1995). The window dressing behind The Bell Curve. School Psychology Review,
24(1), 42–44.
Shannonhouse, L., Lin, Y. D., Shaw, K., Wanna, R., & Porter, M. (2017). Suicide intervention train-
ing for college staff: Program evaluation and intervention skill measurement. Journal of American
College Health, 65(7), 450–456. https://doi.org/10.1080/07448481.2017.1341893
Shultz, K. S., Whitney, D. J., & Zickar, M. J. (2013). Measurement theory in action (2nd ed.). New
York, NY: Routledge.
Steele, C. M., & Aronson, J. (1995). Stereotype threat and the intellectual test performance of African
Americans. Journal of Personality and Social Psychology, 69(5), 797–811.
Strunk, K. K. (in press). A critical theory approach to LGBTQ studies in quantitative methods courses.
In N. M. Rodriguez (Ed.), Teaching LGBTQ+ studies: Theoretical perspectives. New York, NY:
Palgrave.
Strunk, K. K., & Bailey, L. E. (2015). The difference one word makes: Imagining sexual orientation in
graduate school application essays. Psychology of Sexual Orientation and Gender Diversity, 2(4),
456–462. https://doi.org/10.1037/sgd0000136
Strunk, K. K., & Hoover, P. D. (2019). Quantitative methods for social justice and equity: Theoretical
and practical considerations. In K. K. Strunk & L. A. Locke (Eds.), Research methods for social
justice and equity in education (pp. 191–201). New York, NY: Palgrave.
Strunk, K. K., & Locke, L. A. (Eds.) (2019). Research methods for social justice and equity in educa-
tion. New York, NY: Palgrave.
Strunk, K. K., & Mwavita, M. (2020). Design and analysis in educational research: ANOVA designs
in SPSS. New York, NY: Routledge.
Teranishi, R. T. (2007). Race, ethnicity, and higher education policy: The use of critical quantitative
research. New Directions for Institutional Research, 2007(133), 37–49. https://doi.org/10.1002/
ir.203
Thompson, B. (Ed.). (2002). Score reliability: Contemporary thinking on reliability issues. Thousand
Oaks, CA: SAGE.
Thorndike, R. M. & Thorndike-Christ, T. (2010). Measurement and evaluation in psychology and
education (8th ed.). Boston, MA: Pearson.
U.S. Department of Health and Human Services. (n.d.). The Belmont report. Office for Human
Research Protections. https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html
Ungar, M., & Liebenberg, L. (2011). Assessing resilience across cultures using mixed methods:
Construction of the Child and Youth Resilience Measure. Journal of Mixed Methods Research, 5(2),
126–149.
Usher, E. L. (2018). Acknowledging the whiteness of motivation research: Seeking cultural relevance.
Educational Psychologist, 53(2), 131–144. https://doi.org/10.1080/00461520.2018.1442220
Valencia, R. R., & Suzuki, L. A. (2001). Intelligence testing and minority students: Foundations, per-
formance factors, and assessment issues. Thousand Oaks, CA: Sage.
Vishnumolakala, V. R., Southam, D. C., Treagust, D. F., Mocerino, M., & Qureshi, S. (2017). Students’
attitudes, self-efficacy, and experiences in a modified process-oriented guided inquiry learning
undergraduate chemistry classroom. Chemistry Education Research and Practice, 18(2), 340–352.
https://doi.org/10.1039/C6RP00233A
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and
purposes. American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108/
INDEX
alpha level (for probability/significance) 33, 70, 84, 116, 155–156, 158–159, 161, 161, 166, 180, 185, 193, 195,
146–149, 151, 154n1, 180, 207–208, 260 199, 206–207, 209, 213, 215, 224, 232, 235, 239, 241,
alternate forms reliability 32; see also reliability 243, 256, 260
alternative hypothesis or research hypothesis 59, 62, 69, descriptive statistics 52–53, 100, 139, 142, 142, 143,
74, 89, 94, 131 149, 153, 154, 172, 175, 178, 178, 182, 182, 188, 188,
analysis of variance (ANOVA) 112–121, 124–125, 132, 203, 223, 229, 236, 253, 258, 262
134, 135, 137–141, 145–154n1, 155–162, 166–168, disordinal interaction 155, 158, 167, 172, 181–182
172–173, 175–181, 184–186, 213–214, 216–217, 221,
223, 225, 228–233, 237, 239–241, 243–245, 247–249, effect size (including eta squared, Cohen’s d, omega
251–253, 255, 257–258, 260–261 squared) 70, 73, 76–77, 79, 83, 94–97, 100, 102,
a priori comparisons 113, 126, 130–131, 132–133, 107–108, 111, 113, 126, 136, 138, 141, 143, 148, 152,
140–143 154n2, 155, 167, 170, 172, 175–176, 181, 185, 193,
198, 201–203, 207, 210–213, 221–222, 226, 228–229,
Belmont report 3, 15–16 233, 236, 239, 244, 246, 251, 261
Bonferroni adjustment 107–108, 116, 128–129, 137, empirical research 11, 269
146–149, 207–208, 223, 229, 235–236, 249 epistemology (including positivist, post-positivism,
post-modernism, interpretivism, constructivism,
causality 30–31, 61, 243 etc.) 3, 8, 10–13, 20, 88, 269, 271–272
Central limit theorem 59, 67–68 equity 4, 265, 267–270, 272–274
central tendency 37–38, 40, 44, 50, 52, 54 error 11, 47–48, 52, 59, 70, 76–77, 79, 84, 87, 89–91, 93,
Common Federal Rule 3, 16, 19–20 95, 101, 107–108, 110, 113–116, 118–120, 125–129,
comparisons 3, 54, 62, 89, 110, 112–114, 116, 127–134, 137, 138, 141, 146–147, 149, 160, 170–173, 176,
138–141, 166, 173, 175, 185, 213, 222–223, 227–229, 196–198, 201, 207–209, 216, 222–223, 225, 227, 237,
233, 236, 244, 251, 258, 268, 270 243, 248–249, 251, 253n1, 257, 260, 269, 272
confidence interval 99, 101, 195–196, 248 estimate 32, 35, 37–41, 50, 52, 54, 73, 76–77, 79, 95–97,
confounding variable 59–61, 118 120, 126, 127–129, 148, 154n2, 167, 170, 172, 198,
construct 7, 29, 32, 34, 65, 194, 273 221–222, 226, 228
counterbalancing 196, 207, 209, 215, 233, 235, 261 experimental design 117, 240, 243
external validity 84
degrees of freedom 77–78, 84, 89, 91–92, 94, 101,
114–115, 120, 124–125, 132–133, 138, 166, 198, 202, factor analysis 29, 34, 151
216–217, 220, 228–229 familywise error 107–108, 113, 115–116, 147, 149,
dependent variable 60–61, 75, 78, 83, 86–88, 96, 99, 207–208
107, 110, 113, 118, 120, 135, 141, 146–147, 151, F distribution 68, 88, 113–115, 119–120, 124–126
291
292 • INDEX
generalizability 11, 21, 25–26, 75, 84, 107, 195, 242 median 37–40, 46–47, 52, 114
general linear model 114, 118, 169, 174, 269 mixed design ANOVA 221, 237, 239–241, 243–245,
Greenhouse-Geisser correction 216, 226, 228, 233, 248–249, 251–253, 257–258
235–236 mode 37–38, 40, 46, 52, 114
multivariate 60, 146
histogram 45, 47–48, 53–54, 67
homogeneity of variance 26, 83, 88–89, 99–100, 103n1, nesting 87, 147, 151, 195, 207, 209, 215, 232, 235, 242,
107, 111, 113, 119, 138, 147, 151, 155, 160, 173, 176, 257, 260
181, 185, 187, 216, 239, 243, 247, 251, 257, 261 nonparametric 118
hypothesis 8–9, 55, 57, 59–71, 73–74, 76–78, 84, 88–89, normality 26, 37, 46–48, 50, 52, 75, 83, 87, 97, 107, 111,
94, 99, 101, 106–107, 115–116, 119, 125–127, 131, 113, 118, 139, 141, 147, 151, 155, 159, 180, 185, 193,
133, 138, 140, 160, 166–167, 198, 201–202, 211n2, 195, 199–200, 207, 209, 213, 215, 224, 229, 232, 235,
216, 220–221, 228, 232, 243 239, 241, 245, 251, 257, 260
null hypothesis 55, 57, 59, 62, 68–71, 73–74, 76–78, 84,
independence of observations 87, 107, 118, 242 88–89, 94, 103n1, 115–116, 119, 125, 129, 131, 133,
independent variable 59–61, 96, 99, 115, 135–136, 147, 138, 160, 166–167, 198, 216, 220–221, 228, 243
155–163, 166, 169–170, 173, 178, 181, 185, 193–194,
196, 199, 221, 226–227, 239–240, 243–244, 248, observational 31, 117
253n1, 256, 260 omnibus test 116, 125–126, 132, 138, 171, 229, 232, 251
informed consent 15–18 one-tailed test (incl. directional hypothesis) 74, 76, 78,
Institutional Review Board (IRB) 19–20 83, 94, 99, 101, 110–111, 113, 125, 198, 211n2
interaction effects 244 one-way ANOVA 112–118, 120, 126, 134, 135,
interpretivism 11–12 140–141, 145–150, 152, 155, 158–162, 167, 173, 175,
213, 217, 223, 239, 244
Kolmogorov-Smirnov (KS) test 48 order effects 196, 199, 215, 242
kurtosis (including leptokurtosis, platykurtosis, and ordinal interaction 158, 240, 249, 252, 261–262
mesokurtosis) 37, 46–48, 52, 75, 87, 107, 111, 118, orthogonality 131
141, 147, 151, 159, 176, 180, 195, 200–201, 207, 209, outliers 39–43
215, 224–225, 257, 260
pairwise comparisons 116, 127, 130, 173, 213, 222–223,
Levene’s test 83, 88–89, 100, 103n1, 111, 119, 120, 136, 227–229, 233, 236, 244
137, 138, 141, 147, 151, 160, 170, 172, 176, 216, 226, population 5, 7, 15–16, 21–27, 30, 55, 63–69, 71,
243, 247 73–79, 88, 101, 103n1, 119, 160, 195, 202, 215,
longitudinal 25, 30, 196, 207, 209, 214–215, 233, 235, 222, 242
242–243, 261 post-hoc test 113, 127–130, 134, 137, 138–143,
148–149, 153, 167, 175, 186–187, 222–223, 239, 244,
main effects 155, 160–161, 166–168, 172, 175–177, 183, 249, 252, 258, 261–262
186–188, 248, 251–252, 257–258 practice effects 32, 194, 196, 199, 215, 242
Mauchly’s test 216, 228, 235, 243, 246, 254n1 probability 22, 30, 59, 62–71, 73, 76, 84, 88, 101, 111,
mean 24–25, 28–30, 34, 37–44, 46–47, 49–50, 52, 114, 138, 146, 151, 173, 202, 211n2, 229
61–62, 64, 66–70, 74–79, 88–91, 93, 95, 97, 101, 114,
120–123, 127, 133, 134, 138–140, 146, 150–151, random assignment 21, 30–31, 60, 84–86, 88, 106–107,
155–160, 162–165, 167–168, 170, 172, 175, 178, 182, 109–110, 117, 119, 147, 154n3, 155, 160, 181, 185,
196–197, 199, 201–203, 210n1, 217–219, 240, 244, 193, 195, 207, 215, 217, 219, 233, 235, 239, 242–243,
248–249, 253, 262–263 248, 257, 261, 269
mean square (MS) 120, 124, 132–133, 137, 148, 152, random sampling or random selection 21–22, 30, 65,
161, 162, 162, 166, 168, 181, 185, 217, 220–221 67, 75, 78, 83, 88, 107, 111, 113, 119, 147, 151, 155,
measurement 8, 18, 21, 25–26, 28, 31, 35, 49, 51, 70, 83, 160, 181, 185, 193, 195, 207, 215, 217, 219, 233, 235,
86–87, 107, 110, 113, 118, 147, 154n3, 155, 158, 180, 239–240, 242–243, 253, 260
185, 193, 195, 207, 209, 213–215, 229, 232, 235, 239, range 37, 39, 41, 43–44, 46, 49–50, 52, 54, 147, 151,
241, 256–257, 259–260, 267, 269–270, 272 180, 257, 268
INDEX • 293
ratio scale 21, 26–29, 51, 61, 75, 78, 83, 86, 89, 93, 97, sphericity 213–214, 216, 226–229, 233, 235–236, 239,
100, 107, 110, 113–114, 118, 124–125, 127, 138, 147, 243, 246, 251, 253n1, 257, 261
149, 154–155, 158, 180–181, 184, 193, 195, 207, 209, standard deviation 37, 41, 43–44, 46, 49–50, 52, 74–79,
213, 215, 232, 235, 239, 249, 257, 260 88, 90, 143, 178, 181, 185, 203, 210n1, 253, 263n1
reliability 31–35, 107, 110, 151, 194, 206, 209, 232, standard error of the mean 52, 77, 79, 114, 170, 172,
235, 260 196–197, 201
repeated measures ANOVA or within-subjects ANOVA sums of squares (SS) 41–42, 120–126, 132, 137, 148,
213, 216–217, 221, 223, 225, 228–234, 237, 239, 152, 159, 162, 165, 168, 181, 185, 197–198, 217,
243–245, 251 220–222
sample 21–27, 30, 35, 38–41, 43–46, 48–49, 54–55, 59, Tukey HSD test 127–129, 137
63–71, 73–79, 83, 86–88, 90–93, 96, 101–103, 105, two-tailed test 74, 76, 83, 94, 99, 101, 110–111, 193,
107, 110, 113, 116, 117–121, 130, 142–143, 147, 151, 198, 201–202
159–160, 176–179, 181–182, 185, 195, 203, 207, 209, two-way ANOVA or factorial ANOVA 155–162,
214, 216, 222–223, 231–233, 235, 243, 245, 255, 257, 166–168, 171, 173, 175–181, 184–186, 240, 247–249
261, 271 Type I error 59, 70, 76, 84, 107–108, 110, 115–116, 126,
sample size 23, 26, 39–41, 44, 48, 67, 68, 70, 77, 96, 129, 146, 207–208, 223, 237n3, 251
103n2, 119, 120, 158, 160, 187, 197–198, 214, Type II error 59, 70
222, 245
sampling bias 21, 23–24, 35, 88, 107, 119, 195, 233, 243 unbalanced design or unequal sample sizes 93, 119,
scale of measurement (incl. nominal, ordinal, interval, 128, 159–160, 187
ratio) 21, 26–29, 35, 51–52, 61, 75, 78, 83, 86, 98, univariate 146
107, 111, 113, 118, 215
Scheffe test 128–129, 137, 138, 142–143, 148–149, 153, validity 31, 33–35, 84, 107, 147, 151, 180, 206, 209, 232,
186–187, 223, 249 235, 256, 260
significance level 55, 57, 59, 62, 68–71, 73, 76–79, variance 26, 29, 37, 41–44, 52, 55, 83, 87–93, 95–96,
88, 94, 97, 101–102, 107–108, 111, 116, 119, 120, 99–100, 102, 103n1, 107–108, 111, 113–114,
124–125, 133, 134, 138–143, 167, 220, 237n3, 249 119–122, 123, 126, 132, 136–138, 141, 143, 147–148,
simple effects analysis 155, 173–177, 181–182, 249–250 151, 153, 155, 160, 162, 168, 173, 176–177, 181–182,
skewness 37, 46–48, 52, 54, 75, 87, 107, 111, 118, 141, 185–189, 193, 196, 199, 203, 207–213, 216–218,
147, 151, 159, 176, 179, 195, 200–201, 207, 209, 215, 221–222, 228–230, 233, 235–236, 243–244, 246–247,
224–225, 257 251–253, 257, 261–262
source table 120, 121, 122–125, 128, 131–133, 155,
160–161, 165, 168, 172, 175, 213, 217, 220–221 Z scores 37, 49–50, 54, 74