Professional Documents
Culture Documents
METHODS IN SECOND
LANGUAGE RESEARCH
Birdsong
Second Language Acquisition and the Critical Period Hypotheses (1999)
Ohta
Second Language Acquisition Processes in the Classroom: Learning Japanese (2001)
Major
Foreign Accent: Ontogeny and Phylogeny of Second Language Phonology (2001)
VanPatten
Processing Instruction: Theory, Research, and Commentary (2003)
VanPatten/Williams/Rott/Overstreet
Form-Meaning Connections in Second Language Acquisition (2004)
Bardovi-Harlig/Hartford
Interlanguage Pragmatics: Exploring Institutional Talk (2005)
Dörnyei
The Psychology of the Language Learner: Individual Differences in Second
Language Acquisition (2005)
Long
Problems in SLA (2007)
VanPatten/Williams
Theories in Second Language Acquisition (2007)
Ortega/Byrnes
The Longitudinal Study of Advanced L2 Capacities (2008)
Liceras/Zobl/Goodluck
The Role of Formal Features in Second Language Acquisition (2008)
Philp/Adams/Iwashita
Peer Interaction and Second Language Learning (2013)
VanPatten/Williams
Theories in Second Language Acquisition, Second Edition (2014)
Leow
Explicit Learning in the L2 Classroom (2015)
Dörnyei/Ryan
The Psychology of the Language Learner—Revisited (2015)
Yule
Referential Communication Tasks (1997)
Gass/Mackey
Stimulated Recall Methodology in Second Language Research (2000)
Markee
Conversation Analysis (2000)
Gass/Mackey
Data Elicitation for Second and Foreign Language Research (2007)
Duff
Case Study Research in Applied Linguistics (2007)
McDonough/Trofimovich
Using Priming Methods in Second Language Research (2008)
Dörnyei/Taguchi
Questionnaires in Second Language Research: Construction, Administration, and
Processing, Second Edition (2009)
Bowles
The Think-Aloud Controversy in Second Language Research (2010)
Jiang
Conducting Reaction Time Research for Second Language Studies (2011)
Barkhuizen/Benson/Chik
Narrative Inquiry in Language Teaching and Learning Research (2013)
Jegerski/VanPatten
Research Methods in Second Language Psycholinguistics (2013)
Larson-Hall
A Guide to Doing Statistics in Second Language Research Using SPSS and R,
Second Edition (2015)
Plonsky
Advancing Quantitative Methods in Second Language Research (2015)
Of Related Interest:
Gass
Input, Interaction, and the Second Language Learner (1997)
Gass/Sorace/Selinker
Second Language Learning Data Analysis, Second Edition (1998)
Mackey/Gass
Second Language Research: Methodology and Design (2005)
Edited by
Luke Plonsky
NORTHERN ARIZONA UNIVERSITY
First published 2015
by Routledge
711 Third Avenue, New York, NY 10017
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2015 Taylor & Francis
The right of Luke Plonsky to be identified as the author of the editorial
material, and of the authors for their individual chapters, has been asserted in
accordance with sections 77 and 78 of the Copyright, Designs and Patents
Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or
utilised in any form or by any electronic, mechanical, or other means, now
known or hereafter invented, including photocopying and recording, or in
any information storage or retrieval system, without permission in writing
from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered
trademarks, and are used only for identification and explanation without
intent to infringe.
Library of Congress Cataloging-in-Publication Data
Plonsky, Luke.
Advancing quantitative methods in second language research / Luke
Plonsky, Northern Arizona University.
pages cm. — (Second Language Acquisition Research Series)
Includes bibliographical references and index.
1. Second language acquisition—Resesarch. 2. Second language
acquisition—Data processing. 3. Language and languages—Study and
teaching—Research. 4. Language acquisition—Research. 5. Language
acquisition—Data processing. 6. Quantitative research. 7. Multilingual
computing. 8. Computational linguistics. I. Title.
P118.2.P65 2015
401'.93—dc23
2014048744
ISBN: 978-0-415-71833-2 (hbk)
ISBN: 978-0-415-71834-9 (pbk)
ISBN: 978-1-315-87090-8 (ebk)
Typeset in Bembo
by Apex CoVantage, LLC
For Pamela
This page intentionally left blank
CONTENTS
List of Illustrations xi
Acknowledgments xvii
List of Contributors xix
PART I
Introduction 1
1 Introduction 3
Luke Plonsky
PART II
Enhancing Existing Quantitative Methods 21
PART III
Advanced and Multivariate Methods 129
Index 347
ILLUSTRATIONS
FIGURES
TABLES
Introduction
This page intentionally left blank
1
INTRODUCTION
Luke Plonsky
field has seen a rapid increase in its awareness of methodological issues in the last
decade. Evidence of this movement, which holds that methodological rigor and
transparency are critical to advancing our knowledge of L2 learning and teaching,
is found in meta-analyses (e.g., Norris & Ortega, 2000), methodological syntheses
(e.g., Hashemi & Babaii, 2013; Plonsky & Gass, 2011), methodologically oriented
conferences and symposia (e.g., the Language Learning Currents conference in
2013), and a number of article- and book-length treatments raising method-
ological issues (e.g., Norris, Ross, & Schoonen, in press; Plonsky & Oswald, 2014;
Porte, 2012).This book aims to both contribute to and benefit from the momen-
tum in this area, serving as a catalyst for much additional work seeking to advance
the means by which L2 research is conducted.
Themes
In addition to the general aim of moving forward quantitative L2 research, three
major themes present themselves across the volume. The first and most prevalent
theme is the role of researcher judgment in conducting each of the analyses pre-
sented here. Results based on statistical analyses can obscure the decisions made
throughout the research process that led to those results. As Huff (1954) states in
the now-classic How to Lie with Statistics, “despite its mathematical base, statistics
is as much an art as it is a science” (p. 120). As noted throughout this book, deci-
sion points abound in more advanced and multivariate statistics.These procedures
involve multiple steps and are particularly subject to the judgment of individual
researchers. Consequently, researchers must develop and combine not only sub-
stantive but also methodological/statistical expertise in order for the results of
such analyses to maximally inform L2 theory, practice, and future research.
The second theme, transparency, builds naturally on the first. Appropriate deci-
sion making is a necessary but insufficient requisite for the theoretical and/or
practical potential of a study to be realized. Choices made throughout the process
must also be justified in the written report, giving proper consideration to the
strengths and weaknesses resulting from each decision relative to other avail-
able options. Consumers of research can then more adequately and confidently
interpret study results. Of course, the need for transparency applies not only to
methodological procedures but also to the reporting of data (see Larson-Hall &
Plonsky, in press).
The third major theme found throughout this volume is the interrelatedness of
the procedures presented. Statistical techniques are often presented and discussed
in isolation despite great conceptual and statistical commonalities. ANOVA and
multiple regression, for example, are usually considered—and taught—as distinct
statistical techniques. However, ANOVA can be considered a type of regression
with a single, categorical predictor variable; see Cohen’s (1968) introduction to
the general linear model (GLM). The relationship between these procedures
can also be demonstrated statistically: The eta-squared effect size yielded by an
6 Luke Plonsky
ANOVA will be equal to the R2 from a multiple regression based on the same
independent/predictor and dependent/criterion variables. Both indices express
the amount of variance the independent variable accounts for in the dependent
variable. Whenever applicable, the chapters in this volume have drawn attention
to such similarities and shared utility among procedures.
Software
One of the challenges in preparing and using a book like this one is choos-
ing the statistical software. Such a decision involves considering accessibility, cost,
Introduction 7
TABLE 1.1 Software used and available for procedures in this book
user friendliness, and consistency across chapters, among other issues. Further-
more, there are numerous options available, each of which possess a unique set of
strengths and weaknesses. IBM’s SPSS, for example, is very user friendly but can
be costly.The default settings in SPSS can also lead to users not understanding the
choices that the program makes for them (e.g., Mizumoto & Plonsky, in review;
Plonsky & Gonulal, in press).
As shown in Table 1.1, most analyses in this book have been demonstrated
using SPSS.To a much lesser extent, Microsoft Excel and R (R development core
team, 2014) have also been used along with, in a small number of cases, more
specialized packages.
References
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin,
70, 426–443.
Gass, S. (2009). A survey of SLA research. In W. Ritchie & T. Bhatia (Eds.), Handbook of
second language acquisition (pp. 3–28). Bingley, UK: Emerald.
Hashemi, M. R., & Babaii, E. (2013). Mixed methods research:Toward new research designs
in applied linguistics. Modern Language Journal, 97, 828–852.
Huff, D. (1954). How to lie with statistics. New York: Norton & Company.
Larson-Hall, J. (2015). A guide to doing statistics in second language research using SPSS and R.
New York: Routledge.
8 Luke Plonsky
Larson-Hall, J., & Plonsky, L. (in press). Reporting and interpreting quantitative research
findings: What gets reported and recommendations for the field. Language Learning,
65, Supp. 1, 125–157.
Loewen, S., Lavolette, B., Spino, L. A., Papi, M., Schmidtke, J., Sterling, S., et al. (2014).
Statistical literacy among applied linguists and second language acquisition researchers.
TESOL Quarterly, 48, 360–388.
Mizumoto, A., & Plonsky, L. (in review). R as a lingua franca: Advantages of using R for
quantitative research in applied linguistics. Manuscript under review.
Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and
quantitative meta-analysis. Language Learning, 50, 417–528.
Norris, J. M., Ross, S., & Schoonen, R. (Eds.) (in press). Improving and extending quantitative
reasoning in second language research. Malden, MA: Wiley.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodologi-
cal synthesis and call for reform. Modern Language Journal, 98, 450–470.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes:
The case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Gonulal, T. (2015). Methodological reviews of quantitative L2 research: A
review of reviews and a case study of exploratory factor analysis. Language Learning,
65, Supp. 1, 9-35.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Porte, G. (Ed.) (2012). Replication research in applied linguistics. New York: Cambridge Uni-
versity Press.
R development core team. (2014). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing.
Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychol-
ogy journals: Guidelines and explanations. American Psychologist, 54, 594–604.
2
WHY BOTHER LEARNING
ADVANCED QUANTITATIVE
METHODS IN L2 RESEARCH?
James Dean Brown
is so much more to be learned from using follow-up analyses and more still from
thinking about all of your results as one comprehensive picture of what is going
on in your data.
the picture at the same time, and might therefore see relationships between and
among variables (all at once) that you might otherwise have missed or failed to
understand.
Indeed, you will gain an even more comprehensive view of the data and results
for a particular area of research by learning about and applying an advanced
technique called meta-analysis. As Plonsky and Oswald explain (Chapter 6 in this
volume), meta-analysis can be defined narrowly as “a statistical method for cal-
culating the mean and the variance of a collection of effect sizes across studies,
usually correlations (r) or standardized mean differences (d )” or broadly as “not
only these narrower statistical computations, but also the conceptual integration
of the literature and the findings that gives the meta-analysis its substantive mean-
ing” (p. 106).Truly, this advanced form of analysis will give you the much broader
perspective of comparing the results from a number of (sometimes contradictory)
studies in the same area of research.
between two continuous scales (i.e., interval or ratio) or to predict one of those
scales from the other. However, more advanced statistical analyses offer consider-
ably more flexibility. For instance, multiple regression (see Jeon, Chapter 7 in this
volume) allows you the possibility of predicting one dependent variable from
multiple continuous and/or categorical independent variables. Discriminant func-
tion analysis (see Norris, Chapter 13 in this volume) makes it possible to predict
a categorical variable from multiple continuous variables (or more accurately, to
determine the degree to which the continuous variables correctly classify mem-
bership in the categories). Logistic regression makes it possible to predict a categori-
cal variable such as group membership from categorical or continuous variables,
or both. Loglinear modeling can be applied to purely categorical data to test the
fit of a regression-like equation to the data. For excellent coverage of all of these
forms of analysis, see Tabachnick and Fidell (2013).
Other advanced statistical procedures provide the flexibility to look beyond
simple relationships to patterns in relationships. For example, instead of look-
ing at a correlation coefficient or a matrix of simple correlation coefficients, it
is possible to examine patterns in those correlation coefficients by performing
factor analysis, which can reveal subsets of variables in a larger set of variables that
are related within subsets, yet are fairly independent between subsets. The three
types of factor analysis (principle components analysis, factor analysis, and con-
firmatory factor analysis; see Chapter 9 in this volume for Loewen and Gonulal’s
explanation of the differences) can help you understand the underlying pattern
of relationships among your variables, and thereby help you to: (a) determine
which variables are redundant and therefore should be eliminated (as described
earlier); (b) decide which variables or combination of variables to use in sub-
sequent analyses; and (c) item-analyze, improve, and/or validate your measures.
In contrast, cluster analysis is a “multivariate exploratory procedure that is used
to group cases (e.g., participants or texts). Cluster analysis is useful in studies
where there is extensive variation among the individual cases within predefined
categories” (Staples & Biber, Chapter 11 in this volume, p. 243). Also useful is
multiway analysis, which can help you study the associations among three or
more categorical variables (see Tabachnick & Fidell, 2013 for more on multiway
analysis).
Another form of analysis that provides you with considerable flexibility is
structural equation modeling (SEM), which is
SEM combines ideas that underlie many of the other forms of analysis discussed
here, but can additionally be used to model theories (a) to investigate if your data
fit them, (b) to compare that fit for several data sets (e.g., for boys and girls), or
(c) to examine changes in fit longitudinally.
With regard to means comparisons, mixed effects models (see Cunnings &
Finlayson, Chapter 8 in this volume), which by definition are models that include
both fixed and random effects, are flexible enough to be used with data that are
normally distributed or that are categorical (i.e., nonnumeric). In addition, mixed
effects models are especially useful when designs are unbalanced (i.e., groups
have different numbers of participants in each) or have missing data. Importantly,
if you are studying learning over time, these models can accommodate repeated
measures in longitudinal studies.
The conceptual difference between null hypothesis testing and the Bayes-
ian alternative is that predictions about mean differences are stated a priori
in a hierarchy of differences as motivated by theory-driven claims. . . . In this
approach, the null hypothesis is typically superfluous, as the researchers aim
to confirm that the predicted order of mean differences are instantiated in
the data. Support for the hierarchically ordered means hypothesis is evident
only if the predicted order of mean differences is observed. The predicted
and plausible alternative hypotheses thus must be expressed in advance of
the data analysis—thus making the subsequent ANOVA confirmatory.
(Mackey & Ross, Chapter 14 in this volume, p. 334)
Clearly, this advanced alternative form of analysis not only provides a means for
examining data hierarchically and with consideration to previous findings and/
or theoretical predictions, but in fact, it also demands that the data be examined
in that way from the outset.
16 James Dean Brown
Additional Assumptions
Another disadvantage of the more advanced statistical procedures is that they
tend to require that additional assumptions be met. Where a simple correlation
coefficient will have three assumptions, a multiple regression analysis will have
at least five assumptions, two of which will require the data screening discussed
in the next paragraph. In addition, whereas for univariate statistics a good deal
is known about the robustness of violating assumptions (e.g., it is known that
ANOVA is fairly robust to violations of the assumption of equal variances if
the cell sizes are fairly similar), less is known about such robustness in the more
complex designs of advanced statistical procedures. For a summary of assumptions
underlying univariate and some multivariate statistics, see Brown (1992), or for
multivariate statistics, see the early sections of each of the chapters in Tabachnick
and Fidell (2013).
A Case for Advanced Quantitative Methods 17
results, a higher probability of finding significant results if they exist, more power-
ful results, and ultimately to more credible results in your own mind as well as in
the minds of your readers.
Additional Assumptions
Checking the more elaborate assumptions of advanced statistical tests forces you
to slow down at the beginning of your analyses and think about the descriptive
statistics, the shapes of the distributions involved, the reliability of various mea-
surements, the amounts of variance involved and accounted for, the degrees of
redundancy among variables, any univariate or multivariate outliers, and so forth.
Ultimately, all of this taken together with the results of the study can and should
lead to greater understanding of your data and results.
analyses is not all bad. Indeed, it can lead you to exciting places you never thought
you would go.
Conclusion
In writing this chapter, I wrestled with using the word advantages. Perhaps it is bet-
ter to think about the advanced procedures described here as opening up options
rather than as having advantages—but then it occurred to me that people with
those options will have distinct advantages, so I stayed with the idea of advantages.
That is not to say that using advanced statistics, especially multivariate analyses,
for every study will be the best way to go. For example, I once had a student who
hated statistics so much that he set out to write a paper that used only descriptive
statistics and a single t-test, and he did it, writing an elegant, straightforward, and
interesting paper. Simple as it was, he was using exactly the right tools for that
research project.
However, learning new, advanced statistical techniques can help you to stay
interested and up-to-date in your research. Having multiple options can also help
you avoid getting stuck in a statistical rut. For instance, I know of one researcher
in our field who clearly learned multiple regression (probably for her disserta-
tion) and has used that form of analysis repeatedly and almost exclusively across
a number of studies. She is clearly stuck in a statistical rut. She is holding a ham-
mer, so she uses it for everything, including screws. I just wish she would extend
her knowledge to include some other advanced statistical procedures, especially
extensions of regression like factor analysis or SEM.
The bottom line here is that advanced statistics like those covered in this book
can be useful and even exciting to learn, but the harsh reality is that these forms
of analysis will mean nothing without good ideas, solid research designs, reliable
measurement, sound data collection, adequate data screening, careful checking of
assumptions, and comprehensive interpretations that include all facets of the data,
their distributions, and all of the statistics in the study.
Fortunately, you have this book in your hands. I say fortunately because this col-
lection of chapters is a particularly good place for L2 researchers to start expanding
their knowledge of advanced statistical procedures: It covers advanced statistical
techniques; it was written by L2 researchers; it was written for L2 researchers; and
it contains examples drawn from L2 research.
Good researching!
References
Brown, J. D. (1990). The use of multiple t tests in language research. TESOL Quarterly,
24(4), 770–773.
Brown, J. D. (1992). Statistics as a foreign language—Part 2: More things to look for in read-
ing statistical language studies. TESOL Quarterly, 26(4), 629–664.
20 James Dean Brown
Brown, J. D. (2007). Statistics Corner. Questions and answers about language testing sta-
tistics: Sample size and power. Shiken: JALT Testing & Evaluation SIG Newsletter, 11(1),
31–35. Also retrieved from http://www.jalt.org/test/bro_25.htm
Brown, J. D. (2008a). Statistics Corner. Questions and answers about language testing statis-
tics: Effect size and eta squared. Shiken: JALT Testing & Evaluation SIG Newsletter, 12(2),
36–41. Also retrieved from http://www.jalt.org/test/bro_28.htm
Brown, J. D. (2008b). Statistics Corner. Questions and answers about language testing statis-
tics:The Bonferroni adjustment. Shiken: JALT Testing & Evaluation SIG Newsletter, 12(1),
23–28. Also retrieved from http://www.jalt.org/test/bro_27.htm
Brown, J. D. (2011a). Statistics Corner. Questions and answers about language testing sta-
tistics: Likert items and scales of measurement. Shiken: JALT Testing & Evaluation SIG
Newsletter, 15(1), 10–14. Also retrieved from http://www.jalt.org/test/bro_34.htm
Brown, J. D. (2011b). Statistics Corner. Questions and answers about language testing sta-
tistics: Confidence intervals, limits, and levels? Shiken: JALT Testing & Evaluation SIG
Newsletter, 15(2), 23–27. Also retrieved from http://www.jalt.org/test/bro_35.htm
Mueller, R. O., & Hancock, G. R. (2008). Best practices in structural equation modeling.
In J. Osborne (Ed.). Best practices in quantitative methods (pp. 488–508). Thousand Oaks,
CA: Sage.
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Boston: Pearson.
Ullman, J. B. (2006). Structural Equation Modeling: Reviewing the basics and moving
forward. Journal of Personality Assessment, 87(1), 35–50.
PART II
Enhancing Existing
Quantitative Methods
This page intentionally left blank
3
STATISTICAL POWER, P VALUES,
DESCRIPTIVE STATISTICS, AND
EFFECT SIZES
A “BACK-TO-BASICS” APPROACH
TO ADVANCING QUANTITATIVE
METHODS IN L2 RESEARCH
Luke Plonsky
Introduction
Methodologically speaking, a great deal of quantitative L2 research has been mis-
guided. All too often we have been asking the wrong questions of our data. Con-
sequently, many of the answers we have derived have been, at best, weak in their
ability to inform theory and practice and, at worst, wrong or misleading. This
chapter seeks to reorient the field toward more appropriate kinds of questions
and analytical approaches. More specifically, I argue here against the field’s flawed
use and interpretation of statistical significance and, instead, in favor of more
thorough consideration of descriptive statistics including effect sizes and confi-
dence intervals (CIs). The approach I advocate in this chapter is not only more
basic, statistically speaking, and more computationally straightforward, but it is also
inherently more informative and more accurate when compared to the most fun-
damental and commonly used analyses such as t tests, ANOVAs, and correlations.
I begin the chapter with a model that describes quantitative L2 research as cur-
rently practiced, pointing out major flaws in our approach. I then review major weak-
nesses of relying on statistical significance ( p values), particularly in the case of tests
comparing means (t tests, ANOVAs) and correlations. I follow this discussion with a
brief introduction to the notion of statistical power, followed by guides to calculating
and using effect sizes and other descriptive statistics including CIs. I conclude with a
revised/proposed model of what quantitative L2 research might look like if we were
to embrace this approach. Points made throughout the discussion are illustrated with
data-based examples, many of which can be replicated using the practice data set that
accompanies this chapter (http://oak.ucc.nau.edu/ldp3/AQMSLR.html). Unlike
much of the remainder of this book, the statistical issues in this chapter are very
simple. Nevertheless, these ideas largely go against what is often taught in introduc-
tory research methods courses and certainly what is found in most L2 journals.
24 Luke Plonsky
Before beginning the main discussion, I also want to emphasize that the con-
cepts and procedures in this chapter, though far from mainstream L2 research
practice, are central to a set of methodological reforms currently gaining traction
in the field. Among other issues, this movement has sought to (a) encourage rep-
lication research (Porte, 2012), (b) promote a synthetic ethic in primary as well as
secondary research (e.g., Norris & Ortega, 2000, 2006; Oswald & Plonsky, 2010;
Plonsky & Oswald, Chapter 6 in this volume), (c) critically reflect on and exam-
ine methodological practices and self-efficacy (e.g., Larson-Hall & Plonsky, 2015;
Loewen et al., 2014; Plonsky, 2013, 2014), and (d) introduce novel analytical tech-
niques (e.g., Cunnings, 2012; Larson-Hall & Herrington, 2010; LaFlair, Egbert, &
Plonsky, Chapter 4 in this volume; Plonsky, Egbert, & LaFlair, in press). Taking
yet another step back, it is also worth noting that, although many of the concepts
and techniques embodied by this movement and discussed in this chapter may be
unfamiliar to L2 researchers, they have been recognized for decades as the pre-
ferred means to conducting basic quantitative research among methodologists in
other social sciences such as psychology and education.
Conduct a study
(e.g., the effects of A on B)
Once the data are collected and analyzed using, for example, a t-test or Pearson
correlation, most researchers will take special note of the p value associated with
the results of those tests. As depicted in Figure 3.1, if on one hand the p value is
larger than .05, the difference between groups or the correlation is often consid-
ered uninteresting and is discarded, and another study might be run to attempt to
achieve a statistically significant result. On the other hand, if the t-test or correla-
tion yields a statistically significant result (i.e., <.05), it is considered important
and is much more likely to get published and to consequently have an impact on
L2 theory, future research, and practice.
In this model, which is, again, the dominant approach in quantitative L2
research, researcher perception and dissemination of study results both hinge
critically on our adherence to null hypothesis significance testing (NHST). As
I describe in the remainder of this section, this approach is deeply flawed on many
accounts, both conceptually and statistically. I focus here, though, on three main
arguments: (a) NHST is unreliable, (b) NHST is crude and uninformative, and
(c) NHST is arbitrary. Among the many other, more comprehensive accounts
of the inherent flaws in NHST, I recommend Kline (2013, Chapter 3), Norris
(in press), and Cumming (2012, Chapter 2).
NHST Is Unreliable
The first major flaw of NHST is that it is unreliable. More specifically, because
p values vary as a function of sample size, any correlation or difference in mean
scores can reach statistical significance, given a large enough sample. Consider
the (fabricated) data from three studies in Tables 3.1–3.3 each of which, let’s say,
is interested in comparing the effects of traditional (Group 1) with experimen-
tal (Group 2) approaches to teaching vocabulary. A t-test comparing the means
in Study 1 found no difference between the two groups, which each have five
participants.
Study 2 collected data from 15 participants in each condition. Although their
means and standard deviations were identical to those in Study 1, the p value in
results to a yes/no dichotomy, often overlooking or even ignoring the rich infor-
mation provided by our descriptive statistics. By doing so, we waste our data and
we fail to accurately or informatively advance L2 theory, research, and practice.
To be sure, p values tell us nothing about (a) replicability, (b) theoretical or
practical importance, or, perhaps most importantly, (c) magnitude of effects. A p
value of greater than .05 does not necessarily indicate that there is no difference
between two group means or even that there is a small difference between two
group means. Nevertheless, many researchers interpret it that way, falling prey to
what Cumming (2012) calls the “slippery slope of nonsignificance” (p. 31). Like-
wise, very small p values can certainly correspond to small effects.
To illustrate the lack of informational value provided by p values, consider
the following examples from published L2 studies. In one study published
recently the authors present the results of a t-test comparing the “ideal L2 self ”
ratings for high- (M = 4.65, SD = 1) and low-motivation (M = 4.56, SD = 1.1)
learners. The t-test yielded a nonstatistically significant p value, indicating no
difference between the two groups. This result, to be expected given the very
similar descriptives, was confirmed by a very small eta-squared effect size of
.002, which we can understand to mean that group membership (i.e., high vs.
low motivation) explains less than 1% of the variance in ideal self ratings. In the
same table, the authors present the results of another t-test comparing the same
two groups on ought-to self ratings. The mean score was 3.74 (SD = 1.1) for
the high-motivation group and 3.96 (SD = 1) for the low-motivation group. In
this case, however, the t-test revealed a statistically significant difference between
the groups. Are we then to interpret the difference between groups here to be
large or important? The eta-squared value for this contrast was just .01, indicat-
ing that group membership could explain 1% of the variance in group means.
From a dichotomous NHST perspective, one of these tests reveals an important
difference in group means and the other does not. From the perspective of
practical significance based on the effect size and other descriptive statistics, it
is clear that the two groups are nearly identical. (See results related to Table 4
in Mackey & Sachs, 2012, for a counterexample wherein the authors correctly
interpret substantial correlations despite the nonstatistical p values associated
with them.)
Consider as well the results in Table 3.4 which were extracted from nine
primary studies in Taylor, Stevens, and Asher’s (2006) meta-analysis of the effects
of reading strategy instruction. Three distinct patterns of results can be observed
in this sample, each of which reveals the crudeness of p. First, although the
means being compared in studies A–E were not found to be statistically signifi-
cant, their effect sizes (Hedges’ g, which expresses mean difference in standard
deviation units, similar to Cohen’s d ) were substantial—certainly more than
the null effect we might interpret based on a nonstatistical p value. These effect
sizes were, in fact, almost identical to but slightly larger, actually, than those in
28 Luke Plonsky
A 12 15 −.555 .152
B 8 8 .556 .259
C 30 29 .492 .060
D 24 21 .553 .066
E 21 22 .472 .123
F 78 80 .481 .003
G 183 61 .530 .000
H 29 14 −.251 .436
I 12 14 −.292 .450
*Results from Taylor et al. (2006)
NHST Is Arbitrary
Students in introductory research methods courses often ask what is so special
about the .05 level of statistical significance. The answer, of course, is nothing—a
sentiment Rosnow and Rosenthal (1989) had in mind when they quipped,“surely,
God loves the .06 nearly as much as the .05” (p. 1277). Nevertheless, much of the
field lives (or least publishes) according to an arbitrary standard for importance.
To summarize the discussion thus far, quantitative L2 research relies very
strongly on an analytical approach that is unreliable and arbitrary. Even if
NHST-based findings were stable and principled, results based on this approach
would still fail to provide us any indication of the kinds of information we are
most interested in or that can guide L2 theory and practice. Consequently, unless
we are content to attempt to advance our field in this fashion (i.e., based on arbi-
trary, unreliable, yes/no-only results), we must change our approach (see Norris,
in press).
A “Back-to-Basics” Approach 29
Statistical Power
A closely related notion, statistical power is the probability of observing a statis-
tically significant relationship given that the null hypothesis is false (e.g., d ≠ 0;
r ≠ 0). The more powerful the study, the less likelihood of false negatives. An
understanding of power can also be used to answer the very practical and frequent
question of “How many participants do I need (to detect statistical significance)?”
(That is, assuming we are still interested in statistical significance.)
The conventionally desired level of statistical power in the social sciences is .80
which, when achieved, provides the researcher with an 80% chance of detecting
a statistical relationship if present (Cohen, 1992). (Note that the .80 convention
for avoiding false negatives is much more liberal than the typical safeguard for
avoiding false positives of .05. In the former, we implicitly accept an error rate of
20%; in the latter the accepted error rate is theoretically only 5%.) But how can
we determine if .80 power is possible? As with statistical significance, power varies
as a function of the effect size and sample size such that, given a larger anticipated
effect (e.g., d ≈ 1), a smaller sample will be able to detect a statistical relationship
80% of the time (N ≈ 35). Likewise, when a small effect (e.g., d ≈ .2) is expected
based on theoretical predictions and/or previous research, a larger sample
(N ≈ 400) is needed in order to have an 80% chance of finding the effect at the
.05 level.1
A related exercise and consideration might be to estimate the statistical power
in previous L2 research, much of which relies necessarily on small samples. Plon-
sky and Gass (2011) examined this issue by means of a post hoc power analysis for
174 studies in the interactionist tradition of L2 research. Their results show that
this subdomain has had, on average, just a 56% chance of obtaining statistically
significant results. Likewise, looking at 606 primary studies across many different
subdomains of L2 research, Plonsky (2013) found average post hoc power at just
.57.These results can be interpreted as indicating that the likelihood of observing
expected relationships is, on average, comparable to tossing a coin and hoping for
heads.
Evidence of what I refer to as the “power problem” (Plonsky, 2013, p. 678) in
L2 research does not stop there. Additional indications include (a) extremely rare
use of power analyses in order to inform sampling decisions, (b) generally small
samples / high sampling error, (c) heavy reliance on NHST, (d) presence of non-
normal distributions and a lack of checking for statistical assumptions, and (e) rel-
atively infrequent use of multivariate statistics that can preserve experiment-wise
power (Plonsky, 2013).
One step toward addressing this problem is to determine sample sizes based on
a priori power analyses, rather than simply based on convenience or convention.
Using free software such as G*Power (Faul, Erdfelder, Lang, & Buchner, 2007) or
any number of freely available online calculators designed for this purpose, you
can calculate the sample size needed for a given level of statistical power such as
30 Luke Plonsky
.80. The only information you need to bring to the equation is the anticipated
effect size. One source for obtaining this value would be a meta-analysis on a
topic closely related to that of the study. In the absence of a relevant meta-analytic
effect size, you could also plug into the equation the effect size from one or more
studies on a closely related topic.
At this point I should recognize that in some instances it is not possible to col-
lect data from a sample large enough to obtain an ideal level of statistical power.
For example, researchers who study learners of less commonly taught languages
may find it difficult to obtain large samples. Similarly, funding may not be avail-
able to pay as many participants as are needed for adequate power.These problems
are further compounded in cases where the anticipated effect size is small, thus
necessitating a larger sample. In such cases, I recommend taking one or more of
the three following courses of action. First, when you know that a study lacks
statistical power, you should avoid the use of statistical testing. Focus instead on
the descriptives, including effect sizes and CIs (see discussion below). Second, in
addition to avoiding tests of statistical significance, underpowered studies should
also address fewer contrasts between or among groups. For example, if you only
expect to be able to recruit 35 participants, rather than comparing four groups/
conditions, divide them into two. The additional two conditions can then be
compared to themselves and to the first two in a subsequent study. Third, you
could bootstrap the analyses or statistics of interest based on the available data/
sample (see Larson-Hall & Herrington, 2010; Plonsky et al., in press; LaFlair et al.,
Chapter 4 in this volume).
However, even if we were able to adequately address the multifaceted “power
problem” in L2 research, we would still be relying on the flawed notion of statis-
tical significance. More specifically, a proper understanding and use of statistical
power can help the field overcome, at least in part, the unreliability of NHST.
The other problems, however, remain. Consider Cumming’s (2012) comments
on this issue:
I’m ambivalent about statistical power. On the one hand, if we’re using
NHST, power is a vital part of research planning . . . On the other hand,
power is defined in terms of NHST, so if we don’t use NHST we can
ignore power and instead use precision for research planning . . . However,
I feel it’s still necessary to understand power . . . partly to understand NHST
and its weaknesses. . . . although I hope that, sometime in the future, power
will need only a small historical mention.
(p. 321)
To be clear, I am not suggesting that sample size does not matter. Larger sam-
ples will yield less sampling error and, thus, greater precision in our results. The
point here, though, is that the notion of statistical power as a means to reliably
detect small p values is only relevant within the (flawed) NHST framework. As an
A “Back-to-Basics” Approach 31
alternative, I argue in the next section that thorough use of descriptive statistics,
including effect sizes and CIs, can and should replace much of the statistical test-
ing in L2 research.
Effect Sizes
The focus up to this point in the chapter has been somewhat negative. I have
essentially been describing problematic trends and practices in the field. In this
section I describe a way forward that helps us to address and improve on these
practices by relying on effect sizes in place of NHST. In doing so, I want to
address three fundamental questions: (a) What are effect sizes, and how do we
calculate them? (b) Why should we use effect sizes? (That is, how is this approach
an improvement on current quantitative data practice?) (c) How can we interpret
effect sizes?
M1 − M 2
d=
SD
The difference between means (the numerator) is divided by the pooled stan-
dard deviation or that of a control or baseline group, depending on whether the
groups have equal variance (see Cumming, 2012). This calculation can be done by
hand, but there are also numerous online calculators and Microsoft Excel macros
developed for this purpose. (Unfortunately and inexplicably, SPSS does not cur-
rently provide Cohen’s d in the output from tests comparing mean scores.) I often
use the calculator developed by David B. Wilson that can be downloaded freely
here: http://mason.gmu.edu/~dwilsonb/downloads/ES_Calculator.xls. Figure 3.2
shows how user-friendly macros such as this one are. The user simply enters the
groups’ means, standard deviations, and sample sizes. The effect size here is d = .85,
which is based on the sample data I used earlier to show the unreliability of
p values. A similar calculator freely available through the Centre for Evaluation and
Monitoring is also available here: http://www.cem.org/effect-size-calculator. This
32 Luke Plonsky
calculator has the added advantage of providing CIs around the d value. We can
see in Figure 3.3, for example, that the standardized mean difference, which we
observed at .85, is likely between .41 and 1.27 in the population. Finally, Hedges’ g,
a variant of Cohen’s d, also expresses mean differences and is useful in that it applies
a correction for biased effects due to small samples, which are often found in L2
research (Plonsky, 2013).
Though not often viewed this way, correlations such as Pearson’s r are another
type of effect size. This index, which ranges from −1 to +1, is likely very familiar
FIGURE 3.3 Screenshot of effect size calculator for Cohen’s d with CIs
A “Back-to-Basics” Approach 33
Coefficientsa
95.0%
Unstandardized Standardized Confidence
Coefficients Coefficients Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
FIGURE 3.6 Output for linear regression with CIs for correlation
you need to run the ANOVA through the General Linear Model drop-down
menu: Analyze > General Linear Model > Univariate. This procedure will
produce an ANOVA. To request an eta-squared value as part of the output, click
the Options button and check the box for Estimates of effect size. An eta-squared
value will then be provided in the column labeled as such. Note also that this
value for the overall result (“Corrected model”) will be identical to the R2 value
provided as a footnote underneath the output (another remnant of the fact that
ANOVA is actually a type of regression, falling under the larger family of general
linear models; see Cohen, 1968).
There are several additional types of effect size indices for different types of data
and analyses. For categorical or frequency data, researchers may turn to phi and
Kramer’s V. Another option for categorical data is a simple percentage. Though
not traditionally regarded as an effect size, percentages certainly comply with our
earlier definition and, more importantly, they are very easy both to calculate and to
interpret. A final effect size commonly used with categorical data is the odds ratio.
This index, which expresses the probability of a possible (binary) outcome given a
particular condition, is particularly useful in conjunction with logistic regression.
TABLE 3.5 General benchmarks for interpreting d and r effect sizes in L2 research
Indeed, there are a number of additional factors that merit consideration when
interpreting effect sizes. Most critically, researchers must provide an explanation
of what the particular numerical effects they observe mean in the context of their
domain. Others factors, discussed at length in Plonsky and Oswald (2014), include
(a) effects found in previous studies in the same subdomain; (b) mathematical
readings of effect sizes (see Plonsky & Oswald, 2014, pp. 893–894); (c) theoreti-
cal and methodological maturity of the domain in question; (d) research setting
(e.g., lab vs. classroom); (e) practical significance; (f ) publication bias in previous
research; (g) psychometric properties and artifacts; and (h) other methodological
features.
1. Calculate the mean score by typing in the following in the first empty cell at
the bottom of the column of data you are interested in: =AVERAGE(X:Y),
where X and Y refer to the top and bottom cells of data (be sure to exclude
any header rows).
2. In the cell immediately below the mean score, calculate the standard devia-
tion for the set of scores: =STDEV(X:Y), where X and Y are the same as for
the step 1.
3. In the cell immediately below the standard deviation, calculate the interval
that will be added and subtracted from the mean score to construct the
CI: =CONFIDENCE.NORM(alpha,SD,N). The alpha field here is usually
.05, corresponding to a 95% CI, but could easily be adjusted; for a 90% CI,
for example, this value would be .1. In the SD field of this formula, simply
type in the name of the cell where that value was calculated in step 2 (e.g.,
U55). And the N field refers to the number of data points/cases/observations
in the sample.
40 Luke Plonsky
There are many ways to interpret CIs (see Cumming, 2012), but their primary
purpose is to help us situate mean scores in the context of the many other possible
values that might represent the true population score (as opposed to that of the
sample). As Carl Sagan (1996) put it, CIs are “a quiet but insistent reminder that
no knowledge is complete or perfect” (pp. 27–28). As with standard deviations,
considering the CIs around our mean scores, numerically and/or visually, helps
us avoid the temptation to view our samples and their mean scores as absolute.
In the case of abstract ratings for this particular L2 research conference, we can
see in Figure 3.7 that the mean score is 3.64 (on a scale of 1–5) with 95% CIs of
[3.56, 3.71]. (CIs are typically reported in brackets.) The width of the interval is
quite narrow, which is likely due to the relatively large sample (N = 287). Con-
sequently, assuming these data are based on a valid and reliable instrument, we
can be fairly confident that our point estimate of 3.64 is very close to the true
population mean for scores at this conference.
CIs can also be used to indicate whether the difference between a pair of mean
scores is statistically significant and whether that difference is stable. This infor-
mation is also quite easy to access: We simply check to see whether the mean of
one group falls within or outside the CI for the other group’s mean. We can try
this out using the abstract data set. Let overall score here be the dependent vari-
able and let the presence of one or more errors be a dichotomous independent
Descriptives
Statistic Std. Error
FIGURE 3.7 Output for descriptive statistics produced through Explore in SPSS
A “Back-to-Basics” Approach 41
variable. The menu sequence using SPSS is, again, Analyze > Descriptive Sta-
tistics > Explore. This time, however, we will move the “Errors” variable into
the “Factor list” box. As we can see in Figure 3.8, the mean score for the “no
errors” group (3.68) does not fall within the CI for the “error(s) present” group
[3.23, 3.60] and vice versa, thus indicating that the difference between these two
means is statistically different. We can also calculate the effect size for the differ-
ence between these groups using one of the tools described earlier: d = .40.
Though it is not strictly necessary, we could confirm this result by running
an independent samples t test, which would produce a t value of 2.62 with an
associwated p value of .009. An advantage to following up our analysis based on
CIs with a t test is that the SPSS output will also provide a CI around the mean
difference, which can help us better understand how stable it is. In this particular
case, the mean difference between the two groups is .26, and the CI associated
Descriptives
Errors Statistic Std. Error
FIGURE 3.8 Descriptive statistics and CIs for abstracts with vs. without errors
42 Luke Plonsky
with that difference is [.07, .46]. Yet another confirmation of the statistical dif-
ference between these means scores here is the fact that the CI around the mean
difference does not cross 0.What is perhaps more interesting is to note that the CI
is somewhat narrow, indicating that our point estimate for the difference (.26) is
rather stable and reliable. If the CI had been much larger relative to the five-point
scale, say [.20, 3.9], we would have less certainty—that is, confidence—in our
observed mean difference. For a number of worked examples and practice inter-
preting CIs, see Cumming (2012) and, in the context of L2 research, Larson-Hall
and Plonsky (2015, p. 135).
Finally, it is not sufficient to simply calculate and examine a full set of descrip-
tive statistics when analyzing quantitative data. Such results also need to be made
available in published reports and/or appendices to justify interpretations and to
enable consumers of L2 research to draw their own conclusions as well. More
complete reporting of data also assists in meta-analyses and other synthetic efforts.
For these reasons and in line with the APA (2010), all mean-based analyses should
be reported, at a minimum, with their associated means, standard deviations, CIs,
and effect sizes (again, see Larson-Hall & Plonsky, 2015).
Looking Forward
The impetus behind this chapter—the entire volume, really—is to improve and
advance L2 research practices. Toward that end, I’d like to propose a revised model
of L2 research (Figure 3.9) both as a point of contrast with the descriptive model
in Figure 3.1 and as a suggestion for how our individual and collective research
efforts ought to proceed.
Conduct a study
(e.g., the effects of A on B)
As with the model currently in place, the process begins when a researcher
conducts a study. Unlike the current model, however, assuming the study is well
designed, the importance of the study’s findings and its likelihood of getting pub-
lished do not hinge on the flawed notion of statistical significance. Rather, both
statistical and practical significance are considered and interpreted, and the results
of the study and others in the domain are brought together via research synthesis
and meta-analysis. By embracing a synthetic research ethic both at the primary
and secondary levels, the domain in question is able to arrive at a view of the rela-
tionships or effects in question that is more reliable, thereby enabling L2 theory
and practice to be more accurately informed by empirical efforts.
Further Reading
Cohen, J., (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and
meta-analysis. New York: Routledge.
Kline, R. B. (2013). Beyond significance testing: Statistics reform in the behavioral sciences (2nd
ed.). Washington, DC: American Psychological Association.
Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS.
Chapter 4. New York: Routledge.
Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psy-
chology journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Discussion Questions
1. Summarize, in your own words, the main arguments against the use of p
values and, conversely, in favor of “estimation thinking” and effect sizes.
Can you think of any counterarguments or situations in which the NHST
approach might be preferable or even justifiable?
2. Considering the current place of NHST and effect sizes in quantitative L2
research, what changes would you suggest to the field?
3. The subtitle of this chapter (“A back-to-basics approach to advancing quan-
titative methods in L2 research”) implies that power and statistical vs. practice
significance have been around for a while. If this is the case, why have we as
a field been so slow to embrace these notions in these research practice?
44 Luke Plonsky
Notes
1. These values also assume a normal distribution; variance must also be considered in
calculating power and effect sizes.
2. However, the width of CIs for effect sizes is influenced by sample size.
References
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin,
70, 426-443.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 97–1003.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and
meta-analysis. New York: Routledge.
Cunnings, I. (2012). An overview of mixed-effects statistical models for second language
researchers. Second Language Research, 28, 369–382.
Egbert, J., & Plonsky, L. (in press). Success in the abstract: Exploring linguistic and stylistic
predictors of conference abstract ratings. Corpora.
Ellis, N. C. (2000). Editorial statement. Language Learning, 50, xi–xiii.
Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical
power analysis program for the social, behavioral, and biomedical sciences. Behavior
Research Methods, 39, 175–191.
Granena, G. (2013). Individual differences in sequence learning ability and second lan-
guage acquisition in early childhood and adulthood. Language Learning, 63, 665–705.
Kline, R. B. (2013). Beyond significance testing: Statistics reform in the behavioral sciences (2nd
ed.). Washington, DC: American Psychological Association.
Larson-Hall, J., & Herrington, R. (2010). Improving data analysis in second language
acquisition by utilizing modern developments in applied statistics. Applied Linguistics,
31, 368–390.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research
findings: What gets reported and recommendations for the field. Language Learning,
65, Supp. 1, 125–157.
Loewen, S., Lavolette, B., Spino, L. A., Papi, M., Schmidtke, J., Sterling, S., et al. (2014).
Statistical literacy among applied linguists and second language acquisition researchers
TESOL Quarterly, 48, 360–388.
Mackey, A., & Sachs, R. (2012). Older learners in SLA research: A first look at working
memory, feedback, and L2 development. Language Learning, 62, 704–740.
Norris, J. M. (in press). Statistical significance testing in second language research: Basic
problems and suggestions for reform. Language Learning, 65, Supp. 1.
Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and
quantitative meta-analysis. Language Learning, 50(3), 417–528.
A “Back-to-Basics” Approach 45
Norris, J. M., & Ortega, L. (2006).The value and practice of research synthesis for language
learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on lan-
guage learning and teaching (pp. 3–50). Amsterdam: John Benjamins.
Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and
challenges. Annual Review of Applied Linguistics, 30, 85–110.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodologi-
cal synthesis and call for reform. Modern Language Journal, 98, 450–470.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes:
The case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., Egbert, J., & LaFlair, G. T. (in press). Bootstrapping in applied linguistics: Assess-
ing its potential using shared data. Applied Linguistics.
Plonsky, L., & Oswald, F. L. (2014). How big is ‘big’? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Porte, G. (2010). Appraising research in second language learning: A practical approach to critical
analysis of quantitative research (2nd ed.). Philadelphia/Amsterdam: John Benjamins.
Rosnow, R. L. & Rosenthal, R. (1989). Statistical procedures and the justification of
knowledge in psychological science. American Psychologist, 44, 1276–1284.
Sagan, C. (1996). The demon-haunted world. New York: Random House.
Stukas, A. A., & Cumming, G. (in press). Interpreting effect sizes: Towards a quantitative
cumulative social psychology. European Journal of Social Psychology.
Taylor, A.M., Stevens, J. R., & Asher, J. W. (2006). The effects of explicit reading strategy
training on L2 reading comprehension: A meta-analysis. In J. M. Norris & L. Ortega
(Eds.), Synthesizing research on language learning and teaching (pp. 213–244). Amsterdam:
John Benjamins.
Thompson, B. (1992). Two and one-half decades of leadership in measurement and evalu-
ation. Journal of Counseling and Development, 70, 434–438.
4
A PRACTICAL GUIDE TO
BOOTSTRAPPING DESCRIPTIVE
STATISTICS, CORRELATIONS,
T TESTS, AND ANOVAS
Geoffrey T. LaFlair, Jesse Egbert, and Luke Plonsky
audience possible, we explain this process for both SPSS and R.The chapter con-
cludes with suggestions for further reading and a set of discussion questions, both
meant to build and extend on the chapter. For readers who are interested in a
more thorough overview of conducting statistical analyses in R, we would direct
them to Larson-Hall (2015).
Conceptual Motivation
A number of reviews of quantitative L2 research have found that means-based
analyses such as t tests and ANOVAs dominate the analytical landscape (Gass,
2009; Lazaraton, 2005; Plonsky, 2013; Plonsky & Gass, 2011). This practice is not
necessarily problematic. However, such analyses are useful and meaningful only
given (a) the data conform to a set of statistical assumptions and (b) sufficient
statistical power (i.e., the ability to detect a statistically significant effect, when
present), both of which are often lacking (Phakiti, 2010; Plonsky, 2013; Plonsky &
Gass, 2011). The bootstrapped equivalents of these tests provide nonparametric
alternatives that do not make such strong assumptions about the distributions of
the data (Davison & Hinkley, 1997).
Before going on, we recognize of course that other procedures have been
designed to provide nonparametric equivalents to t tests and ANOVAs, such as
the Kruskal-Wallis and Mann-Whitney U tests. However, simulation research
carried out in the field of applied statistics has revealed that bootstrapped analyses
nearly equal their parametric equivalents in power and accuracy, when statistical
assumptions such as normality are met; when the data are not normally distrib-
uted, bootstrapped analyses provide greater statistical power (Lansing 1999; Lee &
Rogers, 1998;Tukey, 1960;Wilcox 2001), meaning that bootstrapping can provide
researchers with a method for accurately estimating their parameters of interest
(e.g., differences in means and accompanying tests statistics).
Whether or not the data conform to the requirements of parametric tests, the
sample sizes typical of L2 research provide perhaps the most compelling reason to
employ bootstrapping in place of or in addition to traditional tests. More specifi-
cally, quantitative analyses in L2 research are severely limited by the small samples
typically employed. Methodological reviews of quantitative research in the inter-
actionist tradition (Plonsky & Gass, 2011; K = 174) and the L2 domain more
generally (Plonsky, 2013; K = 606), for example, found average group/sample n
sizes of just 22 and 19, respectively. Furthermore, post hoc power calculated based
on these data and their corresponding effect sizes was only .56 and .57—that is,
slightly better than a coin toss. By resampling from the observed data, bootstrap-
ping enables researchers to obtain a data set that simulates a sample much larger
than what is typically found, simulating Ns in the thousands. Put another way,
bootstrapping provides researchers with the opportunity to overcome the lack of
statistical power and Type II error (failing to reject the null hypothesis when the
alternative is true) resulting from analyses based on small samples.
48 Geoffrey T. LaFlair et al.
reported statistically significant results in the original reports were not replicated
according to the bootstrapped analyses (i.e., a Type I error misfit five times higher
than an alpha of .05). Interestingly, all four misfits achieved a post hoc power of
.99, suggesting that traditional hypothesis testing coupled with very large samples
may overestimate the importance of an effect. Put another way, if the sample is
large enough, p values of less than .05 can always be obtained, regardless of the
actual difference between group means. Based on the results, we argue in favor
of the use of bootstrapping, not as a replacement for but in conjunction with
parametric statistics, particularly when (a) samples are especially small (in order
to increase power), (b) samples are especially large (in order to offset statistically
significant results that are due to large samples rather than strong effects), (c) the
data violate one or more assumptions such as normality, and (d) when any one or
more of these situations occurs in analysis of pilot data that will be used as a basis
for collecting more data. Echoing our colleagues (e.g., Norris & Ortega, 2000,
2006; Larson-Hall, 2015; Nassaji, 2012; Plonsky, 2011, 2013), we also argued for
a diminished role of the flawed and unreliable practice of statistical significance
testing and instead for a greater emphasis on descriptive statistics—namely means,
standard deviations, CIs, and effect sizes.
By now we hope to have made clear the potential of bootstrapping as a tool
for overcoming some of the challenges facing quantitative data and data analysis
in L2 research. However, it is important to note that this does not replace the
need for good design, large samples, or replicating our experiments. In the section
that follows, we describe the steps involved in running bootstrapped equivalents
of some of the most common analyses found in the field: descriptives statistics, t
tests, ANOVAs, and correlations.
Bootstrapping in Practice
This section of the chapter presents the step-by-step processes for conducting
simple bootstrapping with descriptive statistics, correlations, t-tests, and ANOVAs
in both SPSS and R. It is organized first by software program and then by statistic.
The reason that this part of the chapter is separated by software program is the
difference in flexibility between each program. The bootstrapping options that
are available in the SPSS interface are somewhat limited. As you will see in the
one-way ANOVA example, SPSS bootstraps the CIs for all pairwise comparisons
(much like a Tukey post hoc analysis of an ANOVA). However, R offers the abil-
ity to bootstrap any statistic of interest. In the one-way ANOVA section in R, you
will learn how to bootstrap the pairwise comparisons (as in SPSS) in addition to
the omnibus F-statistic and its corresponding effect size (eta-squared).
R can require some effort to learn because to utilize it to its full capabili-
ties it is necessary to learn the R programming language. Many researchers may
not need its full capabilities or may not be able to commit to learning how to
program in R. However, the effort that is put into learning how to use it will be
50 Geoffrey T. LaFlair et al.
Considerations in Bootstrapping
Before we begin the step-by-step procedures we need to discuss four decision
points when conducting a bootstrap analysis:
assigned to one of two groups: a treatment group or a control group. Because the
data are homogenous and have been randomly sampled, simple resampling would
be most similar to how the data were collected (Davison & Hinkley, 1997). If you
are working with a set of data that is drawn from two considerably different sub-
populations, you should use a stratified resampling procedure. An example of this
would be in the comparison of treatment effects on two different subpopulations
such as native speakers of a language and nonnative speakers of the same language.
In this method simple case resampling is applied within each stratum. A third
resampling procedure is resampling the residuals (or errors) of a fitted model.This
is considered a semiparametric approach to resampling because the data are fit to
a parametric model (e.g., regression or ANOVA); however, the resampling is still
conducted using nonparametric procedures (Carpenter & Bithell, 2000). Resam-
pling residuals adjusts the value of each observation with a randomly sampled
residual—or the distance between an observation and the estimated parameter
value such as the sample mean. This method assumes homogeneity of variance.
Other resampling methods exist for other situations (e.g., non-homogenous vari-
ance; see Davison & Hinkley, 1997).
In the SPSS examples, all bootstrapped analyses have been performed using
the simple resampling method. In R, bootstrapped analyses of descriptive statis-
tics, correlations, and t-tests were performed using the simple resampling method.
To illustrate how residual resampling is conducted, this method was used for the
bootstrapped analyses for the ANOVA parameters (so we are assuming that the
residuals are homoscedastic).
The third in this introductory set of decisions points encountered when con-
ducting bootstrapped analyses involves the calculation of CIs. One of the goals
of bootstrapping is to estimate accurate CIs for the statistic of interest that would
closely match an exact CI for the population. A number of methods are available
and a discussion of their strengths and weaknesses are beyond the scope of this
chapter (see Davison & Hinkley, 1997 and DiCiccio & Efron, 1996 for further
discussion). Generally, the BCa method is more accurate in a wide variety of situ-
ations (Carpenter & Bithell; 2000; Chernick, 1999; Crawley, 2007; DiCiccio &
Efron, 1996). BCa stands for “bias corrected and accelerated,” and this method
adjusts CIs for skewness (bias-corrected) and nonconstant variance (accelerated)
in the bootstrapped data sets. In this chapter we will be reporting BCa intervals
from both SPSS (offers percentile and BCa) and the boot package in R (offers
five types of intervals).
The fourth consideration is bootstrap diagnostics. Canty, Davison, Hinkley, and
Ventura (2006) provide a detailed overview of four diagnostic methods to assess
the reliability of the bootstrap calculations. The procedure covered in this chapter
is jackknife-after-boot, which is useful for investigating the effect of outliers on
the bootstrapped calculations. This examines the effects of individual cases on
bootstrap samples by plotting the quantiles of bootstrap distribution with each
case removed. The jackknife-after-boot plot shows how much an individual case
affects the bootstrap statistic (Chernick, 1999; Davison & Hinkley, 1997).
52 Geoffrey T. LaFlair et al.
• As shown in Figure 4.1, move “Students should aspire to speak like native
speakers” to the Dependent List box.
• Move “Participant L1” to the Factor List box.
• Click Statistics in the upper right corner.
• Click Bootstrap.
As shown in Figure 4.2, select Perform bootstrapping.
Enter “10000” into the Number of Samples box.
Select Bias corrected accelerated (BCa).
Click Continue.
• Click OK in the Explore dialogue box.
Bootstrap Specifications
Figure 4.3 shows the SPSS settings used for the bootstrap that was performed.
In this case, the table indicates that (a) we set Sampling Method to Simple rather
than Stratified because we resampled (with replacement) from the entire data set
rather than from within each group separately, (b) we resampled 10,000 times,1
and (c) we used a bias-corrected and accelerated 95% CI.
The Descriptives table (Figure Table 4.4) shows the results of the bootstrapped
CIs for the mean and standard deviation of responses to “Students should aspire to
speak like native speakers” grouped according to participant L1 background. The
first two columns contain the mean values and their standard errors for a variety
of descriptive statistics. The four columns on the right contain the BCa bootstrap
results, including 95% CIs for each of the statistics as well as their respective biases
and standard errors.
These results show some variation in teacher beliefs across the three L1 groups.
A comparison between the results in “96% CI for Mean” and “BCa 95% CI” for
the mean reveals some small differences between the width and endpoints of the
bootstrap CI and the original CI. These results also show the bias, which is the
difference between the average of the bootstrap statistics and the original statistic
(e.g., the difference between the original estimate of the mean and the mean of
the bootstrapped samples).
Descriptive Statistics in R
Before any analysis, we need to get the data into R. The first step is to make sure
that you have set your working directory for R to the location of your data, or
type in a file path as in the screenshot that follows. (Note that here and through-
out the chapter bolded text in the Courier New font denotes a command, as does
bolded text in the regular body font; nonbolded Courier New font is the output
produced by R.) Setting the working directory can be done from the drop-down
menus in the R interface or through the command line (using the setwd com-
mand). The next step is to read in the data (using the read command), and then
to take a quick look at the data frame. By using the head() command we can
see that we will be using the same data set and data structure for the examples
in R as we are in SPSS. The str command allows us to see the structure of our
variables. In the code sample we can see that the second line of code changed the
Participant L1 Statistic Std. Bootstrap
Error
Bias Std. BCa 95% CI
Error
Lower Upper
a. Unless otherwise noted, bootstrap results are based on 10,000 bootstrap samples
FIGURE 4.4 Descriptive statistics table with bootstrapped 95% CIs for various descrip-
tive statisticsa
56 Geoffrey T. LaFlair et al.
L1 variable into a factor with three levels: English, Vietnamese, and Spanish. We
will be using this data frame for each of the analyses and will call on subsets of the
variables depending on the analysis.
> head(belief)
ID L1 L2_months Attitude
1 1 English 250 1
2 2 English 6 1
3 3 English NA 6
4 4 English 24 3
5 5 English 60 3
6 6 English 3 5
> str(belief)
'data.frame': 90 obs. of 4 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 . . .
$ L1 : Factor w/ 3 levels "English","Vietnamese",..:
1 1 1 1 1 1 1 . . .
$ L2_months: int 250 6 NA 24 60 3 120 4 120 12 . . .
$ Attitude : int 1 1 6 3 3 5 3 5 4 3 . . .
To retrieve the CIs for each of the moments we can call them one at a time as
illustrated in MEng.ci, or we can write a short “for loop” that will put them all
in a data frame for us (DESci.s and Dci).
> print(MEng.ci)
CALL:
boot.ci(boot.out = DESboot, conf = 0.95, type = "bca",
t0 = DESboot$t0[1],
t = DESboot$t[, 1])
Intervals :
Level BCa
95% (2.500, 3.625)
Calculations and Intervals on Original Scale
The next code sample illustrates a “for loop” in R that will collect the lower
and upper ends of the CIs from the boot.ci object and then put them in a data
frame with the original statistic.
> print(DESboot.CI)
t0 lwr upr
English.mean 3.06666667 2.5000000 3.6250000
English.sd 1.59597194 1.3615271 1.8825143
English.skew 0.04501668 -0.4550728 0.5788735
English.kurt 1.77394466 1.3456360 2.3605649
Vietnamese.mean 4.13333333 3.4855947 4.6923077
Vietnamese.sd 1.65536397 1.4104068 1.9984294
Vietnamese.skew -0.35456522 -0.9802685 0.1727889
Vietnamese.kurt 1.88111459 1.4385776 2.7561735
Spanish.mean 3.63333333 2.9677419 4.2121212
Spanish.sd 1.73171897 1.4879193 2.0225216
Spanish.skew -0.26918104 -0.8334546 0.3219356
Spanish.kurt 1.70501835 1.3560934 2.5062921
We can see from the output that the values in the results from the bootstrapped
analysis in R are slightly different than the results of the SPSS analysis. However,
the general results are the same. This will be the case for every analysis and for
any repeated bootstrapped procedure because every time a bootstrap analysis is
conducted, there are different random samples that are drawn.We can see that the
Vietnamese group differs the most from the other two L1 groups in their mean
beliefs about whether or not students should aspire to speak like native speakers.
• Move “L2 months of study” and “Students should aspire to speak like native
speakers” to the Variables box.
• Click Bootstrap.
Select Perform bootstrapping.
Enter “10000” into the Number of samples box.
Select Bias corrected accelerated (BCa).
Click Continue.
• Click OK in the Bivariate Correlations dialogue box.
The Correlations table from SPSS contains the Pearson Correlation coefficient,
significance level, and sample size information for the original, non-bootstrapped
data set. Like the Descriptives table, it also contains the bias, standard error, and
BCa 95% CI for the bootstrap correlation coefficients. The original results show
a small, nonsignificant positive correlation between “L2 months of study” and
teachers’ beliefs. The results of the bootstrap CI for the Pearson Correlation
Correlations
L2 months Students
of study should aspire
to speak
like native
speakers.
coefficient is notably large, ranging from –.05 to .41.The results also show a slight
negative bias.This would indicate a lack of confidence in the accuracy or stability
of the original estimate of the correlation coefficient.
Again, when we use the Boot function, we first enter our data (belief ). The
Boot function will read the data set and identify the variables that we indicated
in the function earlier (“L2_months” and “Attitude”). Again, both results for the
estimated bias, standard errors, and CIs are similar to the SPSS results. Now that
we are familiar with the numerical output of a bootstrap analysis, we will plot the
10,000 bootstrap t-statistics and their normal Q-Q plot.
Call:
boot(data = belief, statistic = Corstat, R = 10000)
Bootstrap Statistics :
original bias std. error
t1* 0.1984088 -0.004074199 0.1120397
CALL :
boot.ci(boot.out = CORboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (-0.0449, 0.3955)
Calculations and Intervals on Original Scale
We can see that the results of bootstrapping the correlation coefficient in SPSS
and in R are very similar. In R, we can also plot the bootstrapped samples and
the Q-Q plot to assess whether or not the sampling distribution follows a normal
distribution. In the plot on the left in Figure 4.6, the value of the original cor-
relation coefficient is marked with a vertical dashed line.This plot, taken together
with the information from the bootstrapped CI, shows that the sampling correla-
tion coefficient is very likely going to be small, and possibly 0.The Q-Q plot and
the accompanying histogram show that the samples of our statistic are normally
distributed. Because we are simulating the sampling distribution, this provides an
indication of the shape of the population distribution.
> plot(CORboot)
Histogram of t
4
0.6
0.4
3
Density
0.2
2
t*
0.0
1
−0.2
0
• Move “Students should aspire to speak like native speakers” to the Test
Variable(s) box.
• Move “Participant L1” to the Grouping Variable box.
• Click Bootstrap.
Select Perform bootstrapping.
Enter “10000” into the Number of Samples box.
Select Bias corrected accelerated (BCa).
Click Continue.
• Click OK in the Independent-Samples T Test dialogue box.
The Independent Samples Test output table in Figure 4.7 contains the mean
differences for the original data set. It also contains the same bootstrap statistics as
the Descriptives and Correlation tables (figures 4.4 and 4.5): bias, standard error,
and 95% BCa CI around the bootstrapped mean difference values. In addition,
this table also includes significance values for the bootstrapped results. These sig-
nificance values can be interpreted as the proportion of the bootstrapped mean
difference values that are more extreme than the original mean difference value.
In this case, we see that about 1.5% of the bootstrapped mean different values
were more extreme than the mean difference of –1.071 as found in the original
analysis.
Mean Bootstrapa
Difference
Bias Std.Error Sig.(2-tailed) BCa 95% CI
Lower Upper
FIGURE 4.7 Independent-Samples Test output table with bootstrapped 95% CIs
64 Geoffrey T. LaFlair et al.
Since we are interested only in the mean difference between two groups in
this scenario, we will pass a subset of the data frame to the boot command that
contains only the two groups.
Call:
boot(data = belief[1:60,], statistic = Mdiffstat,
R = 10000)
Bootstrap Statistics :
original bias std. error
t1* 1.066667 0.0008049593 0.4103226
Again, the bootstrapped results of mean difference, the bias, and the standard
error are similar to those in SPSS. Likewise, R produces a CI showing a possible
significant difference between the two groups’ beliefs about whether or not stu-
dents should aspire to speak like native speakers. The signs of the CI produced
by R are different than those in SPSS because the ordering of the groups was
reversed and does not make any difference to the conclusions that can be drawn
about the differences in means.
CALL :
boot.ci(boot.out = Mdboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (0.233, 1.850)
Calculations and Intervals on Original Scale
At this point, we will go one step further and check a diagnostic plot of the
jackknife-after-boot to investigate whether or not there are any individual cases
that have affected the bootstrap sample distribution.
The plots in Figure 4.8 show the original mean difference (dashed line) plot-
ted with the bootstrap mean differences in belief between English and Vietnamese
speakers. This and the CI obtained show that 0 is not near the original estimate
nor in the 95% CI of the original estimate. The Q-Q plot shows that the boot-
strap values follow a normal distribution. The jackknife-after-boot shows that
there are three possible influential cases (3, 39, 59). When these are removed from
the bootstrap analysis the distance between the quantiles narrows, which creates
a slightly more peaked distribution. Without these values, it is possible that the
bootstrap CI around the original mean difference may be smaller.
Histogram of t
1.0
2.5
0.8
2.0
1.5
0.6
Density
t*
1.0
0.4
0.5
0.2
−0.5 0.0
0.0
0 1 2 3 −4 −2 0 2 4
t*
Quantiles of Standard Normal
5, 10, 16, 50, 84, 90, 95 %−iles of (T*−t)
***
*** * * * *** ** ** * * * ** * ***** * * * * * * ** ** * ******** ***** *
0.0
39 18 22 58 753 10 47 16 385232
36 34 25 51
FIGURE 4.8 Bootstrap mean differences, Q-Q plot, and jackknife-after-boot plot of
the mean difference between English and Vietnamese
The results of the bootstrap t-test shown next give the original test statistic
(t1* = –2.54), the bias (indicating that the average resampled test statistic was
smaller), and the standard error. The 95% BCa CI shows that 0 is not in the CI, a
standard criteria for evaluating the mean difference between two groups.
A Practical Guide to Bootstrapping 67
Call:
boot(data = belief[1:60,], statistic = TTstat, R = 10000)
Bootstrap Statistics :
original bias std. errort1* -2.540798 -0.04585189
1.072339
CALL :
boot.ci(boot.out = TTboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (-4.664, -0.464)
Calculations and Intervals on Original Scale
Figure 4.9 shows the bootstrap t-statistic values (the original estimate is marked
by a dashed line) in a histogram, a normal Q-Q plot (showing a normal distribu-
tion), and a jackknife-after-boot plot. This plot shows how the quantiles of the
distribution change when the case marked on the bottom is removed from the
bootstrap. The purpose of this plot is to identify influential cases in the original
data set that could affect the bootstrap estimation of the sampling distribution if
the influential cases are drawn too often in the bootstrap analysis. Influential cases
would be marked by points showing large deviations from the lines that represent
the quantiles. The plot in Figure 4.9 does not show much variation in the distri-
bution when any of the cases are removed, which indicates a lack of influential
data points.
Pairwise Comparisons
SPSS does not currently bootstrap t-statistics or F-statistics. Bootstrapping
for ANOVAs in SPSS is limited to post hoc pairwise comparisons. To obtain
68 Geoffrey T. LaFlair et al.
Histogram of t
0.4
0
0.3
−2
Density
0.2
t*
−4
0.1
−6
−8
0.0
−8 −6 −4 −2 0 2 −4 −2 0 2 4
Quantiles of Standard Normal
t*
5, 10, 16, 50, 84, 90, 95 %−iles of (T*−t)
2
* * *
56144050 45 410 17 49 51 25 33
2151 23 12 11 20 5 58 6 62 6 3
21 3582 31 16 27 42 46 9 34 8 3
4824 35 54 47 19
−3
−1 0 1 2
standardized jackknife value
FIGURE 4.9 Plot of the bootstrap T-statistics, their Q-Q plot, and the jackknife-after-
boot plot
bootstrapped CIs for the post hoc pairwise comparisons of a one-way ANOVA,
select Analyze > Compare Means > One-Way ANOVA.
• Move “Students should aspire to speak like native speakers” to the Depen-
dent List box.
• Move “Participant L1” to the Factor box.
A Practical Guide to Bootstrapping 69
Lower Upper
FIGURE 4.10 One-way ANOVA output table with bootstrapped 95% CIs
• Click Bootstrap.
Select Perform bootstrapping.
Enter “10000” into the Number of Samples box.
Select Bias corrected accelerated (BCa).
Click Continue.
• Click OK in the One-Way ANOVA dialogue box.
ANOVA in R
We can run a function in R that will bootstrap the pairwise comparisons, return
CIs for the mean difference, and return a nonparametric significance value (as in
SPSS).
Again the results of this analysis are similar to those from SPSS. The largest
mean difference in the beliefs is found between teachers with English as their first
70 Geoffrey T. LaFlair et al.
language and teachers with Vietnamese as their first language (t1*).The mean differ-
ence between English and Spanish (t2*) and Spanish and Vietnamese (t3*) is similar.
Call:
boot(data = belief, statistic = Pairstat, R = 10000)
Bootstrap Statistics :
original bias std. error
t1* 1.0666667 0.003789025 0.4197999
t2* 0.5666667 0.008612038 0.4276186
t3* -0.5000000 0.004823013 0.4359305
The next code sample creates a data frame of all CIs for the pairwise com-
parisons of differences between group means. The results of the bootstrap CI
show one meaningful difference among the groups’ mean beliefs about students’
aspirations to speak like native speakers.The two groups that differed in this belief
were teachers who speak Vietnamese as a native language and teachers who speak
English as a native language (0 is not in the CI).
> print(PAIRboot.ci)
t0 lwr upr
Vietnamese-English 1.0666667 0.2333333 1.8909219
Spanish-English 0.5666667 -0.2940099 1.3590520
Spanish-Vietnamese -0.5000000 -1.3545038 0.3525119
A Practical Guide to Bootstrapping 71
F-Statistic
In the next function, we have fit a linear ANOVA model to the data. Then we
have created a vector or data for the residuals from the ANOVA model and a
vector of data for the predicted values from the model. The function was written
so that it randomly resamples the residuals and attaches them to the randomly
resampled cases.
For this type of test, we would reject the null hypothesis if the F-value was much
larger than 1.Therefore, because the number 1 falls within the CI shown at the bot-
tom of the following lines of code and their corresponding output, we would fail
to reject the null hypothesis that there is no difference between the three groups.
Call:
boot(data = belief, statistic = Fstat, R = 10000)
Bootstrap Statistics :
original bias std. error
t1* 3.093494 1.102791 2.823289
CALL :
boot.ci(boot.out = Fboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (0.167, 8.656)
Calculations and Intervals on Original Scale
Our results show that the original R2 value is .0664 and that the bootstrap
estimate has a bias of .0192.This means that the average R2 value of the resampled
distributions is slightly larger than the original estimate and shows that there is
some variability in the bootstrapped estimates of the effect size.This indicates that
our original estimate may not be a very accurate approximation of the effect size
of the differences. This is reflected in the bootstrap CI below. The CI for R2 is
quite large (.0035, .167), which indicates that anywhere between .3 and 16.7%
of the variance may be explained by the three groups in this particular ANOVA
model.
Call:
boot(data = belief, statistic = Rsqstat, R = 10000)
Bootstrap Statistics :
original bias std. error
t1* 0.06639327 0.01920224 0.05149469
> print(Rsqboot.ci)
CALL :
boot.ci(boot.out = Rsqboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (0.0035, 0.1673)
Calculations and Intervals on Original Scale
Example Study 1
Calzada, M. E., & Gardner, H. (2011). Confidence intervals for the mean: To bootstrap or
not to bootstrap. Mathematics and Computer Education, 45(1), 28–38.
Background
It is well-known that parametric statistical tests are inappropriate when sam-
ple sizes are small or data is skewed. However, additional research is needed to
74 Geoffrey T. LaFlair et al.
document the utility of nonparametric bootstrapping methods for such data with
varying sample sizes and degrees of skewness.
Research Aims
The goal of this study was to investigate whether bootstrap t CIs are superior to
Student’s t CIs in data from a range of sample sizes and with skewed distributions.
Method
Student’s t CIs (95%) and 100,000 bootstrap t CIs (95%) were generated for
simulated samples from a range of distributions (normal, Student’s t, continuous
uniform, Poisson, and gamma) and sample sizes (n = 5, 10, 15, 20, 25, 30, 35, 40,
and 45). The authors recorded the (a) percent of “correct” CIs that contained µ
(i.e., the “true” population mean) and (b) the precision or width of the CI.
Results
The results suggest that Student’s t CIs are appropriate for symmetric (i.e., non-
skewed) data. However, bootstrap t CIs are better for skewed data with sample
sizes n ≥ 10. The authors also emphasize the effectiveness of bootstrapping for
estimating unknown means for data sets that are skewed or small.
Example Study 2
Guan, N. C.,Yusoff, M.S.B., Zainal, N. Z., & Yun, L. W. (2012). Analysis of two independ-
ent samples with non-normality using non-parametric method, data transformation
and bootstrapping method. International Medical Journal, 19(3), 218–220.
Background
Researchers commonly encounter data that is nonnormally distributed. Three
possible avenues for addressing such issues are: (a) nonparametric statistical tech-
niques (e.g., Mann-Whitney Test), (b) data transformations, and (c) bootstrapping.
Research Aims
The authors aimed to compare the use of nonparametric tests, data transforma-
tion, and bootstrapping in order to measure differences between two independent
samples.
Method
Upon being discharged from a psychiatric hospital, the psychopathology of 202
patients was assessed using a standardized instrument. They were then divided
into two groups based on whether or not they were readmitted to the hospital
A Practical Guide to Bootstrapping 75
less than six months later. The original data was found to be nonnormal using
a Kolmogorov-Smirnov Test. The authors then compared the usefulness of a
Mann-Whitney Test, log transformations, and bootstrapping (500 times) in mea-
suring group differences.
Results
The authors found a significant difference between the two groups using all three
methods. They suggest that all three are useful approaches to samples that fail to
meet assumptions of normality. However, they mention that one advantage of
bootstrapping over the other two methods is that it allows researchers to estimate
CIs for a range of statistics, including effect sizes.
Discussion Questions
1. What are the primary goals of bootstrapping?
2. What is the difference between simple, stratified, and residual resampling?
3. What is one advantage of conducting bootstrap analyses instead of nonpara-
metric techniques or transformations when data are non-normal?
4. Using the belief data set available on the companion website (http://oak.ucc.
nau.edu/ldp3/AQMSLR.html), answer the research question below. First,
conduct a traditional parametric ANOVA and any necessary post hoc analyses.
Second, bootstrap the ANOVA and check the model fit. Third, compare the
results of the bootstrap ANOVA with the traditional ANOVA. If you are using
R, perform a jackknife-after-boot diagnostic analysis. What conclusions can
you make about your data based on your bootstrapped parameters and CIs?
RQ: Is there a mean difference between beliefs of the effects of motivation on language
learning for instructors from different first-language (L1) backgrounds?
Further Reading
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge:
Cambridge University Press.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence limits. Statistical Science, 11(3),
189–228.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics,
7, 26.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York:
Chapman & Hall.
Lapage, R., & Billard, L. (Eds.). (1992). Exploring the limits of the bootstrap. New York: John
Wiley & Sons.
Larson-Hall, J. (2012). Our statistical intuitions may be misleading us: Why we need
robust statistics. Language Teaching, 45, 460–474.
76 Geoffrey T. LaFlair et al.
Larson-Hall, J., & Herrington, R. (2010). Improving data analysis in second language
acquisition by utilizing modern developments in applied statistics. Applied Linguistics,
31, 368–190.
Lee, W.-C., & Rogers, J. L. (1998). Bootstrapping correlation coefficients using univariate
and bivariate sampling. Psychological Methods, 3, 91–103.
Plonsky, L., Egbert J., & LaFlair, G. (in press). Bootstrapping in applied linguistics: Assess-
ing its potential using shared data. Applied Linguistics.
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In I. Olkin,
S. G. Ghwyne, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds), Contributions to prob-
ability and statistics: Essays in honour of Harold Hotelling (pp. 448–485). Stanford: Stanford
University Press.
Yung,Y.-F., & Chan, W. (1999). Statistical analyses using bootstrapping: Concepts and
implementation. In R. H. Hoyle (Ed.), Statistical strategies for small sample research
(pp. 82–105). Thousand Oaks, CA: Sage.
Note
1. Chernick (1999) recommends 5,000–10,000 for most cases and reviews other methods
for estimating the number of replications needed in a bootstrap analysis (pp. 112–122).
References
Beasley, W. H., & Rogers, J. L. (2009). Resampling methods. In R. E. Millsap & A.
Maydeu-Olivares (Eds.), The Sage Handbook of quantitative methods in psychology
(pp. 362–386). London: Sage.
Canty, A. J., Davison, A. C., Hinkley, D. V., & Ventura, V. (2006) Bootstrap diagnostics and
remedies. The Canadian Journal of Statistics, 34, 5–27.
Canty, A. J., & Ripley, B. (2013). boot: Bootstrap R (S-Plus) Functions [Computer Soft-
ware]. R package version 1.3–9.
Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: When, which, what?
A practical guide for medical statisticians. Statistics in Medicine, 19, 1141–1164.
Chernick, M. R. (1999). Bootstrap methods: A practitioners guide. New York: John
Wiley & Sons.
Crawley, M. J. (2007). The R book. West Sussex, England: John Wiley & Sons.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge:
Cambridge University Press.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence limits. Statistical Science, 11,
189–228.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 26.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York:
Chapman & Hall.
Gass, S. (2009). A survey of SLA research. In T. Ritchie & W. Bhatia (Eds.) Handbook of
second language acquisition (pp. 3–28). Bingley, UK: Emerald.
Keselman, H. J., Algina, J., Lix, L. M., Wilcox, R. R., & Deering, K. N. (2008). A generally
robust approach for testing hypotheses and setting confidence intervals for effect sizes.
Psychological Methods, 13, 110–129.
Komsta, L., & Novomestky, F. (2012). Moments: Moments, cumulants, skewness, kurtosis
and related tests [Computer Software].R package version 0.13.http://CRAN.R-project.
org/package=moments
A Practical Guide to Bootstrapping 77
Lansing, L. (1999). Bootstrapping versus the Student’s t: The problems of Type I error and power.
Unpublished master’s thesis, Lehigh University, Bethlehem, PA.
Larson-Hall, J. (2015). A guide to doing statistics in second language research using SPSS and R.
New York: Routledge.
Larson-Hall, J., & Herrington, R. (2010). Improving data analysis in second language
acquisition by utilizing modern developments in applied statistics. Applied Linguistics,
31, 368–390.
Lazaraton, A. (2005). Quantitative research methods. In E. Hinkel (Ed.), Handbook of research
in second language teaching and learning (pp. 109–224). Mahwah, NJ: Erlbaum.
Lee, W.-C., & Rogers, J. L. (1998). Bootstrapping correlation coefficients using univariate
and bivariate sampling. Psychological Methods, 3, 91–103.
Nassaji, H. (2012). Significance tests and generalizability of research results: A case for
replication. In G. Porte (Ed.), Replication research in applied linguistics (pp. 92–115). Cam-
bridge: Cambridge University Press.
Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and
quantitative meta-analysis. Language Learning, 50, 417–528.
Norris, J. M., & Ortega, L. (2006).The value and practice of research synthesis for language
learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on lan-
guage learning and teaching (pp. 3–50). Philadelphia, PA: Benjamins.
Phakiti, A. (2010). Analysing quantitative data. In B. Paltridge & A. Phakiti (Eds.), Contin-
uum companion to research methods in applied linguistics (pp. 39–49). London: Continuum.
Plonsky, L. (2011).The effectiveness of second language strategy instruction: A meta-analysis.
Language Learning, 61, 993–1,038.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes:
The case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., Egbert J., & LaFlair, G. (in press). Bootstrapping in applied linguistics: Assessing
its potential using shared data. Applied Linguistics.
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In I. Olkin, S. G.,
Ghwyne, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds), Contributions to probability
and statistics: Essays in honour of Harold Hotelling (pp. 448–485). Stanford, CA: Stanford
University Press.
Wickham, H. (2011). The Split-Apply Combine Strategy for Data Analysis. Journal of Sta-
tistical Software, 40(1), 1–29.
Wilcox, R. (2001). Fundamentals of modern statistical methods: Substantially improving power and
accuracy. New York: Springer.
Wolfe, E. W., & McGill, M. T. (2011). Comparison of asymptotic and bootstrap item fit indices
in identifying misfit to the Rasch model. Paper presented at the National Conference on
Measurement in Education, New Orleans, LA.
Yung,Y.-F., & Chan,W. (1999). Statistical analyses using bootstrapping: Concepts and imple-
mentation. In R. H. Hoyle (Ed.), Statistical strategies for small sample research (pp. 82–105).
Thousand Oaks, CA: Sage.
5
PRESENTING QUANTITATIVE
DATA VISUALLY
Thom Hudson
As soon as you have collected your data, before you compute any statistics, look at
your data.
(Wilkinson and the Task Force on Statistical
Inference, 1999, p. 597)
Background
Graphical charts and tables afford important avenues for exploring data and pre-
senting vivid and transparent representations of statistical findings.Their form and
use, however, is often given little conscious attention in discussions of reporting
statistical results. This is an unfortunate state of affairs. Just as statistical reporting
requires reporting of both data centrality and dispersion for an in depth under-
standing of data, it benefits from careful graphic representation that provides clear
visuals representing how data behave. As Cleveland (1985) explains, graphic repre-
sentations can display a large amount of quantitative information in ways that can
be absorbed thoroughly, perhaps more thoroughly and immediately than through
the presentations of means, standard deviations, effect sizes and p-values.There are
numerous graphic forms: tables, histograms, line graphs, box plots, scatter plots,
and more.The choice of which graphic type to use depends upon the type of data
(univariate, bivariate, categorical, continuous) and on the function (comparison,
description, exploration), and on the audience.
Too often graphics are seen as an afterthought in the data analysis process.
This is despite the work by Tufte (1983, 1997, 2006), Cleveland (1985, 1993),
Kosslyn (2006), Klass (2008), Few (2009, 2012), Robbins (2013), Larson Hall and
Herrington (2010), and others who have demonstrated the power of compel-
ling graphic display in communicating information. Many of these writers note
that part of the problem is an overreliance on computers and the software that
Presenting Quantitative Data Visually 79
goes with them. Microsoft Excel can produce eye-catching graphics, and SPSS
will produce bar charts with a few mouse clicks. However, accompanying the
ease with which these may be produced is often a lack of close examination
of the data. “Computers can’t make sense of data; only people can” (Few, 2009,
p. 2). Thoughtful use of graphics can help us make sense of and effectively com-
municate the implications of our data. The thoughtful use of graphics involves
establishing the hierarchy of purposes for the graphic before designing any display.
Graphics can be used for exploration of data, communication of discovery, and
archival of collected data for future use. Each of these uses requires different deci-
sions in the design process. The focus in the present chapter will be on the use of
graphic display for communication of information to those in, or who wish to
join, the community of second language (L2) researchers
Graphics are frequently used in L2 research reporting. During the year prior
to the writing of this paper, 136 data-based research articles in five major jour-
nals in the field presented 514 tables and 207 figures.1 It is noteworthy there are
over twice as many tables as figures. Although many of the tables provide textual
information such as research study procedures, variable definitions, and scoring
rubrics, a large number contain descriptive and inferential statistics results, cor-
relation matrices and results that could easily be more informatively presented in
a graphical chart. A breakdown of the types of graphical charts among the figures
in the journals is presented in Table 5.1 (see also Larson-Hall, in preparation).
TABLE 5.1 Types of graphical charts and frequency of use found in the last four regular
issues of five L2 journals
Line Graph 14 14 6 8 5 47
Grouped Bar Chart 10 11 14 4 1 40
Diagram (Text) 11 5 5 8 2 31
SEM or Path Diagram 5 7 3 4 3 22
Scatter Plot 9 5 0 1 3 18
Pictures 9 1 1 1 1 13
Bar Chart 2 4 2 1 2 11
Spectrogram/Acoustic 4 0 2 0 0 6
Display
IRT Rasch map 0 0 0 5 0 5
Box and Whisker 0 0 0 1 2 3
Dot Chart (CIs) 0 3 0 0 0 3
Stacked Bar Chart 0 0 0 0 3 3
Pie Chart 0 0 0 0 2 2
Stacked Bar Chart 3-D 0 0 0 0 2 2
Forest Plot 0 1 0 0 0 1
Total 64 51 33 33 26 207
From Table 5.1, it can be seen that there is a very strong reliance on line graphs
and bar charts (either regular or grouped). This practice is unfortunate in that
these types of visual displays fail to provide the rich information available in other
chart types. It should also be noted that the Diagram (Text) graphics, the third
most common type of graph, are predominately either visual presentations of
theoretical models or descriptions of study procedures, and are thus not providing
quantitative information.
Waseca
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
Crookston
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
Morris
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
University Farm
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
Duluth
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
Grand Rapids
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
15 30 45 60
FIGURE 5.1 Cleveland’s 1993 graphic display of barley harvest data from Immer,
Hayes, & Powers (1934)
82 Thom Hudson
2009, 2012; Klass, 2008; Kosslyn, 2006; Nicol & Pexman, 2010; Larson-Hall &
Plonsky, 2015; Robbins, 2013; Tufte, 1983, 1997, 2006). Although these authors
are not unanimous in their recommendations, overlap in their recommendations
indicates that quantitative graphical displays should reflect these general rules for
displaying qualitative information:
Tables
Tables allow complex units of data to be presented in an organized way. They are
effective when the reader needs to examine exact values of different variables.
They allow the consumer to make additional calculations and make exact com-
parisons. Additionally, they are useful when there is a need to provide information
together that is in different units of measurement (e.g., proficiency level, years
studying an L2, mean length of utterance). Graphic display generally begins with
a data table of some form. Tables systematically display numbers and effectively
structure and present concentrated data and small data sets.
Tabular data should be presented unambiguously. This requires that descrip-
tive text be presented in the table titles, headings, footnotes, labels, and source
information. Clark (1987) points out that numbers are only one element of the
overall data. She notes that a data set includes the words that link numbers to the
phenomena that are under consideration and ties the content elements as needed
to make clear the who, what, how, where, and when associated with the numerical
information. Titles should be clear, sample sizes should be apparent, quantities
should be specified, and time frames should be obvious.
Wainer (1997) provides four rules for table construction: (1) round heavily,
(2) order rows and columns in a sensible way, (3) include summary rows and
columns when important, and (4) add spacing to aid perception. He argues that
people cannot process more than two digits easily, that we cannot generally justify
more than two digits statistically because of standard errors, and that most people
almost never care about more than two digits. However, the two-digit recom-
mendation is simply a standard against which data reporting can be evaluated. For
example, when data have historically been reported on College Entrance Exami-
nation Board CEEB scales (used for the SAT, GRE, and TOEFL paper-and-pencil
test), three digits may be a convention that is most conveniently followed. The
information in Table 5.2 is an initial, though slightly edited, presentation of infor-
mation on the 2009 National Assessment of Educational Progress Reading scale
results.The results are for grade 12 public schools for the 11 U.S. states reported. It
was produced by the NAEP State Comparisons Tool from the National Center for
Educational Statistics (http://nces.ed.gov/nationsreportcard/statecomparisons).
A number of aspects of this table can be addressed from the perspective of the
general guidelines presented in the previous section as well as the specific guide-
lines for tables proposed by Wainer. First, the shading and grid lines do not add
information to the graphic or aid in perception. They simply serve to increase
the amount of ink without increasing the amount of information. Second, the
“Order” variable is an empty category and the repetition of “2009” and “Scale
Score” in the second and third rows are redundant information provided in the
table title. Generally, if the value in any row is the same across columns (or any
column is the same across rows), the row or column should be removed and per-
haps included elsewhere such as a header or footnote.The label “National public”
84 Thom Hudson
TABLE 5.2 2009 average reading scale score sorted by gender, grade 12 public schools
Male-Female
All students Male Female Difference
2009 2009 2009 2009
Order Jurisdiction Scale Score Scale Score Scale Score Scale Score
N/A National public 287.0595571 280.956378 292.9596502 –12.00327228
N/A Arkansas 279.8846598 271.1364272 288.6513578 –17.51493065
N/A Connecticut 292.3508196 284.9950077 299.7995149 –14.80450727
N/A Florida 282.6334833 275.5068654 289.3160879 –13.8092225
N/A Idaho 290.1409912 284.654741 296.0461276 –11.3913866
N/A Illinois 291.5195945 285.5453884 297.310035 –11.76464663
N/A Iowa 290.6223739 283.895896 297.7043157 –13.80841973
N/A Massachusetts 295.4572734 289.9215732 301.1007774 –11.17920417
N/A New
Hampshire 292.9695062 283.7600856 302.3824946 –18.62240893
N/A New Jersey 288.0905513 281.6284422 294.3658039 –12.73736175
N/A South Dakota 291.9890962 285.7366043 298.5041494 –12.76754502
N/A West Virginia 279.3981132 270.7917682 287.8034791 –17.01171092
Note: The NAEP Reading scale ranges from 0 to 500.
Source: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2009 Reading Assessment.
is not clear since all schools in the sample are public schools rather than private.
Further, the scale scores are carried out to seven decimal places. As noted at the
bottom of the table, the reading scale ranges from 0 to 500. It is unlikely that
anyone reading the table would be interested in decimal places. Finally, the font is
too small. Thus, we can tidy the display as shown in Table 5.3.
The revised table is less cluttered and the NAEP scale scores are easier to pro-
cess without the excessive digits after the decimal. Since the scores are reported
on a scale from 0 to 500, the decimals are unnecessary. Unnecessary information
below the table has been eliminated.
Now, we need to address two additional issues. First, while the table presents
the scale scores for each of the states with available information, no overall com-
parison statistics are provided. Summary statistics below the table would be infor-
mative for comparative purposes. Second, the ordering of the states is alphabetical,
an arbitrary order that is almost never satisfactory or informative. However, in
the present case, it is a judgment call as to whether the table is presented just for
someone to look up his or her own state’s score, in which case alphabetical order
might make sense, or whether there is the need for comparative information
across states, in which case alphabetical order is not informative. Ordering the
state scores in descending order of overall scale score will allow for facilitation of
score comparisons. Likewise, the table would benefit from having the labels more
Presenting Quantitative Data Visually 85
TABLE 5.3 2009 average NAEP reading scale scores by gender for grade 12 public schools
in 11 states (first revision)
All Male-Female
Jurisdiction Students Male Female Difference
TABLE 5.4 2009 average NAEP reading scale scores by gender for grade 12 public schools
in 11 states sorted on state mean scores (second revision)
Male-Female
Jurisdiction All Students Male Female Difference
centrally aligned with the numbers. In addition, the negative sign indication for
the differences between male and female scores is an artifact of how the gender
categories were ordered in calculations and is unnecessary in displaying actual
magnitude. Table 5.4 reflects these changes.
86 Thom Hudson
14
12
10
8 AL
TQ
6 SSLA
LL
4
MLJ
0
1
3
5
7
9
11 AL
13
15
FIGURE 5.2 Types of graphics used over last four regular issues of five applied linguis-
tics journals
Presenting Quantitative Data Visually 87
The chart in Figure 5.2 presents the distribution of chart types across the
different journals described in Table 5.1. It looks quite impressive and fetch-
ing in my opinion, and was especially so in the original color version on
screen. However, this chart violates virtually all of the suggestions for good
graphic design. It does not present the data unambiguously, efficiently, clearly,
or meaningfully. First, it does not have a clear purpose. It is unclear why any-
one would need to see graphically the different numbers of every graphic
type by each journal. The graph is not useful for description, exploration, or
tabulation. Any comparative function would be much better displayed in the
table format rather than a three-dimensional bar chart. Further, if one wanted
to compare the different graph types graphically across journals, it would be
more advisable to take only the five or six most common graph types. Also,
the elements of the graph are not labeled clearly. It is not clear that the x-axis
represents the graph type and the y-axis displays the number of tokens of each
graph type for each of the different journals. The rotation of the graph makes
it difficult to make any comparisons of the data. For example, it is not possible
to compare MLJ and LL for graph type 1 because the bars are hidden behind
other bars.
In short, the graph does a poor job of showing the data. The data do not
stand out because the graph is cluttered. The y-axis scale appears to be distorted
in that it is very extended vertically. This accentuates the apparent differences
between the numbers of occurrences. For example, the difference between TQ
and SSLA for graph type 1 is only 2, but the difference looks more striking in
Figure 5.2. Further, the three-dimensional representation makes it difficult to
interpret actual values. How many occurrences correspond to LL graphic type
15? The three-dimensional cylindrical columns serve no function. The gridlines
are deceptive in that each gridline does not correspond to an actual numerical
difference between the y-axis score numbers. The excessive number of gridlines
creates clutter and adds ink without adding information.
We can see that there are a number of ways to go wrong with graphs. We will
now look at several different graph types and examine their uses. We will also
discuss ways to keep these graphs within the guidelines provided earlier. The data
for the chart types that I describe are based on four data sources: a hypothetical
set of data representing language test performance and questionnaire information
for 45 examinees in a language program; the journal use of graphic information
data in Table 5.1; the NAEP data from Table 5.4; and data from an introductory
L2 studies course I taught online.
Bar charts and histograms. These basic charts that are fairly simple to pro-
duce and read. Bar charts are used for discrete/categorical variables along the
x-axis (e.g., yes/no, country of origin), while histograms are used for continu-
ous variables, sometimes broken into categories (e.g., 10–19, 20–29, 30–39, etc.).
Examples are shown in Figure 5.3 and Figure 5.4. Note that the bar chart has
space between the bars while the histogram typically does not. The bar chart
describes the subjects’ mean scores on a listening test across the three rating
88 Thom Hudson
100
80
Mean scores on listening test
60
40
20
0
1 2 3
Errors Bars: 95% CI
Grouping on self-confidence rating
FIGURE 5.3 Bar chart showing means of listening scores for each category of self-rated
confidence ratings with 95% CI (N = 45)
There are variations on the simple bar graph and histogram. First, there are
grouped bar charts as shown in Figure 5.5. Additionally, there are stacked bar
charts such as that in Figure 5.6. However, the stacked bar charts in Figure 5.6
point out the difficulty of comparing lengths that do not have a common base-
line. For example, it is difficult to accurately compare the gender composition
of proficiency group 2. This is a common problem with using information in
stacked bar charts.
Presenting Quantitative Data Visually 89
0.0
5.0
0.0
−1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Speaking Scores
20 Gender
Female
Male
15
Mean Speaking Score
10
0
1 2 3
FIGURE 5.5
Grouped bar chart for speaking scores by gender with 95% CI course level
by gender with 95% CI
Total Group
1.00
60.0% 2.00
3.00
4.00
50.0% 5.00
40.0%
Percent
30.0%
20.0%
10.0%
0.0%
0 1
Gender
The discussion of Table 5.1 indicated that bar charts represent a very large pro-
portion of the graph types that appear in the journals surveyed. However, Tufte
(1983, 1997) criticizes bar charts for several reasons. First, they frequently have too
much ink and too little information. On one hand, in Figure 5.5, some form of
texture or shading is provided for both gender categories. Less ink would be used
if one category were simply left without shading. On the other, the Figure 5.3
and Figure 5.5 bar charts provide CIs along with the mean scores. This at least
provides minimal additional information about the precision of measurement.
However, bar charts do not show much information about the distribution of the
scores.They do not provide information about the range of actual scores or about
the standard deviations. For the most part, they only focus on means.
In order to improve on some of the shortcomings of bar charts, some writ-
ers encourage the use of box-and-whisker plots such as those in Figure 5.8
(see Larson-Hall & Plonsky, 2015). The graph displays the median value for all
scores in addition to the median for all scores above the median (the 55th per-
centile) and for all scores below it (the 25th percentile). The area between these
last two medians represents the middle 50% of all scores (i.e., the 25th to 75th
Presenting Quantitative Data Visually 91
12 Gender 2
Female
Male
10
8
Count
0
Basic Plus
Intermediate
Upper
Inter. Plus
Basic
Course Membership
percentile). This area is represented as the box in the plot. The whiskers extend
out to the upper and lower extreme scores. Some authors refer to the top end
of the box as the upper hinge and the lower end of the box as the lower hinge. The
lower hinge is also frequently referred to as the lower quartile (the first through
25th percentiles) and the upper hinge as the top quartile (the 75th to 100th per-
centiles). Figure 5.8 represents two box-and-whisker plots for the speaking test
results by gender.
One way to look at a box plot is to see it as a histogram turned on its side.
This comparison, however, highlights one criticism of box plots, namely, that a
single box plot does not provide as much information as a histogram in cases in
which the data do not cluster around the median score (Klass, 2008). For example,
the histogram in Figure 5.4 provides more information than a single box plot.
Yet, box plots are very useful when several box plots are used to compare some
distribution across groups. For example, Figure 5.9 provides a better sense of
how the data are distributed around proficiency levels than the histogram shown
previously in Figure 5.4. The box plots show more decreasing internal dispersion
in scores as the proficiency level increases. This information is not available in a
single histogram across speaking scores, nor would it be apparent in a bar graph.
92 Thom Hudson
16
15
14
13
Speaking Scores
12
11
9
8
7
5
4
3
Female Male
Gender
Because the marked points in box plots represent percentiles, box plots can also be
criticized in comparison with bar charts that contain 95% CIs because the latter
would enable a visual determination of statistically significant differences between
groups. Given the multiple and severe weaknesses of statistical significance (e.g.,
Plonsky, Chapter 3 in this volume), however, the informational richness provided
by box plots is a worthwhile trade-off.
Line graphs. This type of graph frequently displays data in a time series. Fig-
ure 5.10 shows that across five tests, the student scores increased each administra-
tion except at time 3. The line indicates a continuum along which the students
develop. A consideration for Figure 5.10 is whether the graph should in fact begin
with a zero point along the y-axis. It is generally good practice to include a zero
point. In order to provide the reader with an accurate baseline, an argument could
be made in the present case that the zero point is needed to provide perspective,
particularly if the test happened to be a commonly known test and readers would
have reference to what a score of 55 or 70 means. However, it could also be
argued that the real comparison here is over time and that beginning the y-axis
scale with zero is a waste of space.
It is not uncommon to find line graphs with categorical data types. However,
this practice should be employed cautiously because the line between categories
can imply a continuum that is not warranted. The decision is not always clear,
however. Figure 5.11 presents scores for students across three levels on three sub-
tests. The graph has several problems. In some instances, it may be acceptable to
interpret the different levels as acting as proxies for ordered scale intervals. How-
ever, at other times, it may not be warranted. For example, the three different
20
15
Speaking Scores
10
0
1.00 2.00 3.00 4.00 5.00
FIGURE 5.9 Box-and-whisker plots for the five proficiency levels across the speaking
test scores
80
Total test percentage scores with 95% CI
60
40
20
0
1 2 3 4 5
Test Times
FIGURE 5.10 Student scores (means and CIs) on five tests administered three weeks
apart over a semester (N = 45)
94 Thom Hudson
100
Reading
Listening
Grammar
80
Mean Score
60
40
20
0
1 2 3
Proficiency Level
FIGURE 5.11 Mean scores and 95% CIs on reading, listening, and grammar for three
proficiency levels
Presenting Quantitative Data Visually 95
100
90
Reading
Subtest scorces
80 Listening
Grammar
70
60
50
100
o Reading
Listening
Score distribution by subset
90 Grammar
o
80
70
60
o
50
1 2 3
Level
FIGURE 5.12 Graphic representation of score data across levels with box chart display
of distributions
90
Reading
80
70
60
50
50 60 70 80 90 100
Grammar
FIGURE 5.13 Scatter plot for the relationship between reading scores and grammar
scores (N = 45)
West Virginia
South Dakota
New Jersey
New Hampshire
National Mean
Jurisdiction
Massachusetts
Iowa
Illinois
Idaho
Florida
Connecticut
Arkansas
FIGURE 5.14 Mean state scores for NAEP data in Table 5.4
Presenting Quantitative Data Visually 97
Massachusetts
New Hampshire
South Dakota
Illinois
Connecticut
Iowa
Idaho
New Jersey
National Mean
Florida
Arkansas
West Virginia
FIGURE 5.15 Mean state scores for NAEP data in Table 5.4 ordered by state score
Reading
Listening
Grammar
Speaking
FIGURE 5.17 Number of weekly online posts with sparklines showing the online post-
ing activity for each student
make. The same information would be easily envisioned from a simple table or
numbers in the text. However, the pie chart on the right is very difficult to inter-
pret. Its elongation makes it difficult to compare the darkest area with the lightest
area. Further, the depth dimension adds no information at all. It leads the eye away
from the data.
Presenting Quantitative Data Visually 99
Level
1 1
2 2
3 3
15 15
15
10.0
Mean Speak
7.5
5.0
2.5
0
1 2 3
Level
FIGURE 5.19 Initial SPSS bar chart for speaking mean scores by level
10
0
Beggining Intermediate Upper
Course Level
FIGURE 5.20 Edited SPSS bar chart for speaking mean scores by level
Presenting Quantitative Data Visually 101
you appreciate my particular graphic decisions or not, it is clear that it is not nec-
essary to accept the default graphics provided by SPSS and most other programs.
Finally, the R programming language has incredibly powerful graphing capa-
bilities. This is not to downplay the steep learning curve involved with R. How-
ever, two R statistical graphics packages in particular stand out: lattice (Sarkar,
2008) and ggplot2 (Wickham, 2009). An example ggplot2 line graph of a statisti-
cal interaction between gender and level is presented in Figure 5.21.
This line graph shows that there is a change in relative position of scores with
the hi level in contrast to the int and low levels. Additionally, the graph provides
the score points and CIs for each level with gender classification. The ggplot2 R
package allows for extensive customization of graphic components.
Closing Remarks
When presenting quantitative data visually, it is important for authors not to sim-
ply rely on the default graphics that are available through a computer program.
Additionally, just as it is not wise to uncritically adopt a research design from a
100
Gender
Male
Female
90
Score on Listening
80
70
60
50
Hi Int Low
Level
Background
This study addresses the questions, “What makes a graph better or worse at
communicating relevant quantitative information?” and “How can students
learn to interpret graphs more effectively?” It reviews the cognitive literature
on how viewers comprehend graphs and the factors that influence viewers’
interpretations.
The Study
Shah and Hoeffner (2002) note that analyses of graph comprehension have
looked at three major component processes: Viewers must encode the visual
array and identify important features, they must relate the visual features to
the conceptual relations represented by the features, and they must deter-
mine the referent of the concepts being quantified. In processing graphics,
viewers are more likely to describe x-y trends and retrieve the information
accurately when viewing line graphs than when viewing bar graphs. The
literature tends to indicate that line graphs are good for depicting x-y trends,
bar graphs for discrete comparisons, and pie charts for relative proportions.
Three-dimensional displays proved better than two-dimensional displays
when integration of information across three dimensions was needed. How-
ever, despite the potential benefits of three-dimensional displays, the use of
three-dimensional linear perspective drawings can degrade information. In
addition to global decisions about the general format, a graph can involve
additional visual features: color, size, and aspect ratio.
Knowledge about graphs affects how viewers encode and remember the
graphics. Viewers expect dependent variables to be plotted as a function
of the y-axis and independent variables on the x-axis. Additionally, viewers
rely on prior knowledge in interpreting graphs. Graph viewers are better at
understanding some types of content compared with others such as those
representing change.
Presenting Quantitative Data Visually 103
Implications
Shah and Hoeffner infer nine principles from the research review. Six of these
are relevant for the current discussion:
1. Choose the format depending upon the communication goal.
2. Use multiple formats to communicate the same data.
3. Use the “best” visual dimensions to convey metric information when
possible.
4. Reduce working memory demands.
5. Choose aspect ratio and data density carefully.
6. Make graphs and text consistent.
• The Work of Edward Tufte and Graphics Press provides excellent resources
about graphic display: http://www.edwardtufte.com/tufte/.
• The Gallery of Data Visualization: The Best and Worst of Statistical Graphics
presents examples with the view that the contrast may be useful, inform cur-
rent practice, and provide some pointers to both historical and current work:
http://www.datavis.ca/gallery/.
• Visual Statistics: Seeing Data with Dynamic Interactive Graphics: http://
www.uv.es/visualstats/Book/.
• Statistical Graphs, Charts and Plots: Statistical Consulting Program: http://
pages.csam.montclair.edu/~mcdougal/SCP/statistical_graphs1.htm.
• Hadley Wickam is interested in gaining better understanding of statistical
models through data visualization. His website (http://had.co.nz/) is an
excellent resource, particularly for the R programming language.
• The Top Ten Worst Graphs: http://www.biostat.wisc.edu/~kbroman/
topten_worstgraphs/.
Further Readings
Few, S. (2009). Now you see it: Simple visualization techniques for quantitative analysis. Oak-
land, CA: Analytics Press.
Few, S. (2012). Show me the numbers. Burlingame, CA: Analytics Press.
Kistler, S. J., Evergreen, S., & Azzam, T. (2013). Toolography. In T. Azzam & S. Evergreen
(Eds.), Data visualization, part 1. New Directions for Evaluation. #139: 73–84.
Tufte, E. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press.
Tufte, E. (1997). Visual explanations. Cheshire, CT: Graphics Press.
104 Thom Hudson
Wainer, H. (1997). Improving tabular displays, with NAEP tables as examples and inspira-
tions. Journal of Educational and Behavioral Statistics, 22(1), 1–30.
Wainer, H. (2005). Graphic Discovery. Princeton, NJ: Princeton University Press.
Discussion Questions
1. Of the suggestions on page 82, which do you identify as most important for
the graphic display of quantitative data? Which are the least important? Why?
2. When should a table be used instead of a graph and vice versa? When might
you want to include both a graph and a table for a given data set?
3. What problems can you identify in the following graph?
100
80
60
40
Reading
20
Listening
0
1 4
7 10
13 16
19 22
25
28 31
34
37
40
43
Reading
Note
1. The selected journals were Language Learning issue 62(4)–63(3), The Modern Language
Journal issues 96(4)–97(3), TESOL Quarterly issues 46(2)(4), 47(1–2), Studies in Second
Language Acquisition issues (34[3–4]), 35(1)(3), Applied Linguistics issues 34(1–4). Tables
and figures were counted from regular articles, excluding special issues of the journals,
and excluding graphics from article appendices or additional article information pro-
vided on internet sites.
References
Anscombe, F. J. (1981). Computing in statistical science through APL. New York: Springer.
Clark, N. (1987). Tables and graphs as a form of exposition. Scholarly Publishing, 19 (1),
24–42.
Cleveland, W. S. (1985). The elements of graphing data. Monterey, CA: Wadsworth Advanced
Books and Software.
Presenting Quantitative Data Visually 105
Cleveland, W. S. (1993). Visualizing data. Murray Hill, NJ: AT&T Bell Laboratories.
Cleveland, W. S. (1994). The elements of graphing data (revised ed.), Murray Hill, NJ: AT&T
Bell Laboratories.
Daniel, C. (1976). Applications of statistics to industrial experimentation. New York: Wiley.
Few, S. (2004). Show me the numbers: Designing tables and graphs to enlighten. Oakland, CA:
Analytics Press.
Few, S. (2009). Now you see it: Simple visualization techniques for quantitative analysis. Oakland,
CA: Analytics Press.
Few, S. (2012). Show me the numbers. Burlingame, CA: Analytics Press.
Fisher, R. A. (1966). The design of experiments (8th ed.). Edinburgh: Oliver and Boyd, Ltd.
Immer, R. F., Hayes, H. K., & Powers, L. (1934). Statistical determination of barley varietal
adaptation. Journal of the American Society of Agronomy, 26, 403–419.
Klass, G. M. (2008). Just plain data analysis: Finding, presenting, and interpreting social science data.
Lanham, MD: Rowman & Littlefield Publishers, Inc.
Kosslyn, S. M. (2006). Graph design for the eye and mind. Oxford: Oxford University Press.
Lane, D.M. & Sandor, A. (2009). Designing better graphs by including distributional
information and integrating words, numbers and images. Psychological Methods, 14,
239–257.
Larson-Hall, J. (in preparation). Graphics and data accountability in L2 acquisition research.
Larson-Hall, J., & Herrington, R. (2010). Improving data analysis in second language
acquisition by utilizing modern developments in applied statistics. Applied Linguistics,
31, 368–390.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research
findings: What gets reported and recommendations for the field. Language Learning, 65,
Supp. 1, 125–157.
NAEP State Comparisons Tool from the National Center for Educational Statistics
at: http://nces.ed.gov/nationsreportcard/statecomparisons/. Retrieved 26 Novem-
ber 2013.
Nicol, A.A.M., & Pexman, P. M. (2010). Displaying your findings: A practical guide for creating
figures, posters, and presentations. Washington, DC: American Psychological Association.
Robbins, N. B. (2013). Creating more effective graphs. Wayne, NJ: Chart House.
Sarkar, D. (2008). lattice: Multivariate data visualization with R. New York: Springer.
Tufte, E. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press.
Tufte, E. (1997). Visual explanations. Cheshire, CT: Graphics Press.
Tufte, E. (2006). Beautiful evidence. Cheshire, CT: Graphics Press.
Wainer, H. (1997). Improving tabular displays, with NAEP tables as examples and inspira-
tions. Journal of Educational and Behavioral Statistics, 22(1), 1–30.
Wikham, H. (2009). ggplot2: Elegant graphics for data analysis. New York: Springer.
Wilkinson, L. and the Task Force on Statistical Inference. APA Board of Scientific Affairs.
(1999). Statistical methods in psychology journals. American Psychologist, 54(8), 594–604.
6
META-ANALYZING SECOND
LANGUAGE RESEARCH*
Luke Plonsky and Frederick L. Oswald
Before we outline the major steps and key considerations when conducting a
meta-analysis, we will define the term meta-analysis in both a narrow and broad
sense. The narrower definition of meta-analysis refers to a statistical method for
calculating the mean and the variance of a collection of effect sizes across studies,
usually correlations (r) or standardized mean differences (d). The broader defini-
tion of meta-analysis includes not only these narrower statistical computations,
but also the conceptual integration of the literature and the findings that gives the
meta-analysis its substantive meaning. This integration involves the meta-analyst’s
expert understanding, translation, and communication of the research studies and
samples involved, along with the best theory that researchers offer across (and
beyond) the set of studies. The current chapter focuses primarily on the practi-
cal aspects of meta-analysis under this broad definition, where we describe how
meta-analysis addresses (if not solves) three major problems inherent to narrative
or qualitative reviews in second language (L2) research.
The first problem is that narrative reviews are qualitative in nature: They may
survey and describe study effects verbally, but in doing so, they do not consider
how sampling error variance confounds the interpretation of variation in research
findings. Specifically, small samples alone can contribute to variation (statistical
imprecision) in study effects, independent of the particular theories, samples, mea-
sures, or settings that also contribute to variation (substantive variance) in study
effects. Rather than treating effect sizes and the accompanying study narrative in
a qualitative manner, a meta-analysis is a more objective method in which study
effects with larger sample sizes are more statistically precise and therefore con-
tribute more heavily to meta-analytic results. The second problem with narrative
reviews is their general overreliance on the ritual of null hypothesis significance
testing (NHST; see, for example, Plonsky, Chapter 3 in this volume). If a narra-
tive review focuses narrowly on p values from NHST instead of effect sizes, two
Meta-analyzing L2 Research 107
dangers are likely to arise: Some statistically significant results will be given too
much attention (i.e., when the actual effect is negligible, but with a small p value
because it is based on a large sample) and nonsignificant results may be ignored,
yet many nonsignificant results across studies may be suggestive of a practically
and statistically significant effect if they are aptly combined in a meta-analysis.The
third problem with narrative reviews is that although experts in L2 research have
a vast storehouse of discipline-specific knowledge, as humans, they are fallible and
subject to the foibles of human memory and emotion, making imperfect or incon-
sistent decisions and interpretations regarding a body of research. To be sure, the
expertise and judgment of L2 researchers remain essential to any literature review
process, yet meta-analysis serves as one critical quantitative tool that supplements
expertise and judgment and is more objective and systematic in nature. Without
such tools, narrative reviews may pay greater attention to those empirical findings
that are accompanied with more compelling verbal rationale or are published in
prestigious journals, even when other empirical findings are equally legitimate.
How to Do a Meta-analysis
Meta-analysis has many parallels with the primary studies it attempts to sum-
marize. In both cases, the researcher must define the domain of interest, develop
measures, collect and analyze data, and interpret the theoretical and practical sig-
nificance of those findings.
108 Luke Plonsky and Frederick L. Oswald
Although no one coding sheet will work for everyone, certain information
will be common to almost all meta-analyses (see Table 6.1; see Lipsey & Wilson,
2001, for an example that is not domain specific). Other information particular
to the domain being meta-analyzed will also need to be coded. For example, a
meta-analysis of reading comprehension intervention studies might code for vari-
ables such as a study’s text length and genre, learners’ L2 vocabulary knowledge,
and first-language (L1) reading ability. A coding manual that defines each variable
and its associated values is also needed in order to train coders, resolve inter-
coder ambiguities, and generally ensure that the coding stage leads to a reliable
and justifiable data set. (See Wilson, 2009, for a thorough discussion of decision
points and procedures related to developing a valid and reliable coding scheme
for meta-analysis.)
Finally, “the first draft of a coding guide should never be the last” (Cooper,
2010, p. 86). The meta-analyst should be prepared to pilot, revise, and repilot the
coding sheet before and even during the coding process (e.g., Aytug et al., 2012;
Kepes, McDaniel, Brannick, & Banks, 2013).
By keeping a log, the meta-analyst can then report the extent to which data for
certain variables were inferred, imputed, or left out.
At least one additional rater should be trained, and then raters are asked to
code as many of the studies being meta-analyzed as possible. Lipsey and Wilson
(2001) recommends double coding of at least 20 but ideally 50 or more studies.
However, with a median sample of only 17 studies in the 91 L2 meta-analyses
reviewed by Plonsky and Oswald (2014), it may often be possible to double
code all of the studies in the meta-analysis (see Lee et al., in press). It is then very
important to report some measure of interrater agreement to determine coding
accuracy (e.g., intraclass correlation, Cohen’s kappa, percent agreement), along
with some description of the number and nature of rating discrepancies and how
their resolution was achieved. Additionally, we also urge L2 meta-analysts to make
their coding procedure and all coding sheets directly accessible to their readership
available as supplementary material (e.g., Microsoft Excel sheets). These docu-
ments can be made available through journals’ or individual researchers’ websites
by providing a link in the written report or a footnote similar to that in Plonsky
(2011), which states “In order to facilitate replication and/or re-analysis, the data
set used in this study will be made available upon request.” Template versions of
coding schemes can and should also be made available in the aforementioned
venues and/or through the IRIS database for L2 instruments.
Analysis
As we stated at the outset of this chapter, meta-analysis essentially involves cal-
culating a mean effect size and its corresponding variance from a particular body
of research. Whereas the literature searching and coding stages help ensure the
body of research and corresponding effect sizes are appropriate, the analysis stage
is where the meta-analyst estimates this overall mean and variance. Despite the
seeming simplicity of calculating a mean and variance, there can be some impor-
tant challenges and decisions to make. A single study, for example, may report
multiple effect sizes on the same relationship, based on multiple settings, mul-
tiple groups, multiple measures, and/or multiple time points. It may be justifiable
merely to average them prior to the meta-analysis. But the multiple effects in stud-
ies like these are often complex, and the underlying heterogeneity is important to
understand. For instance, caution must be exercised when handling a set of studies
where some effects are pretest-posttest designs and others are between-groups
designs. Although most meta-analyses of L2 research have treated effects from
both types of studies as comparable, they should generally be treated sepa-
rately, because pretest–posttest designs tend to produce larger effects (see Mor-
ris, 2008). A related issue is how L2 meta-analyses have mistakenly applied the
between-groups formula for the d value to pretest–posttest designs. This is a mis-
take because in the latter case, calculation of an appropriate d value requires the
correlation between pre- and posttests. This correlation is almost never reported
Meta-analyzing L2 Research 113
in primary studies, but without its value (or some reasonable estimate), the effect
size will be biased (Cheung & Chan, 2004; Gleser & Olkin, 2009). In Plonsky
and Oswald’s (2014) synthesis of effects across 91 meta-analyses of L2 research, the
researchers provide empirical evidence for this bias. The median meta-analytic d
values resulting from between-groups (independent samples) and within-groups
(pretest-posttest) contrasts were .62 versus 1.06, respectively.
Another common issue in the analysis phase involves dealing with missing
data. Studies often lack information critical to meta-analysis. Sometimes the only
option is to exclude such studies, the choice made by most L2 meta-analyses to
date. However, if the number of available studies for meta-analysis itself is pre-
ciously small, then a second option might be to estimate unreported values (cf.
Higgins,White, & Wood, 2008). A meta-analyst must weigh the benefits of retain-
ing studies that at least provide partial information (e.g., means) by estimating the
data that they lack (e.g., standard deviations), with the potential drawbacks of esti-
mating or assuming too much out of the missing data. A third option is to request
missing data directly from the study’s researchers. Although this last decision may
be the ideal solution, it may be a challenge to contact researchers successfully and
have them comply with data requests (see Orwin, 1994; McManus et al., 1998).
A small number of L2 meta-analyses have reported using this strategy, leading
generally to a positive response of approximately 30% (e.g., Lee et al., in press;
Plonsky, 2011; but cf. Plonsky, Egbert, & LaFlair, in press).
incorporating other factors such as rated study quality (see Hunter & Schmidt,
2014, and Schmidt, Le, & Oh, 2009, for detailed information on this approach).
This method may be worth pursuing once meta-analysis in L2 research has
matured and studies routinely report information on measurement reliability. In
general, it is the meta-analyst’s responsibility to strike a balance between choosing
a meta-analysis method that is too simple versus one that is too complex in order
to summarize the data in a reliable and maximally informative manner.
Oswald & Johnson, 1998; although see Sutton & Higgins, 2008). It is much better
to take an a priori approach to understanding variance in effect sizes by dividing
study effects into a priori subgroups determined by theory and/or coded vari-
ables, meta-analyzing the subgroups, and then comparing the meta-analytically
weighted average effects. This approach is far superior to the post hoc approaches
of estimating effect sizes in the RE model or testing for effect size heterogeneity
with the Q test.
Finally, in line with the saying that “a picture is worth a thousand words,”
graphs and plots of data serve as critical tools in any data analysis (Wilkinson &
Task Force on Statistical Inference, 1999), and the forest plot and funnel plot are
the primary visualization tools used in meta-analysis (Borenstein, Hedges, Hig-
gins, & Rothstein, 2009). A forest plot presents the size of the effect on the x-axis
with the names of the studies being ordered (alphabetically or by the magnitude
of the effect) on the y-axis (see Figure 6.1). The plotted points usually bisect a
symmetric horizontal bar that shows the 95% CI, and in the bottom row is the
meta-analytic mean and its 95% CI. A funnel plot provides similar information to
a forest plot: It is a scatter plot of the effect size on the x-axis, with some func-
tion of measurement precision associated with the effect on the y-axis (e.g., the
sample size, the inverse of the sampling error variance). If the level of imprecision
in some studies is much larger than the variance in the effects of the underlying
study populations (as is usually the case), then this plot will tend to show a funnel
shape, hence the name (see Figure 6.2). Asymmetries in the funnel plot can serve
as an indicator of publication bias, such as when authors, editors, and reviewers
suppress small or statistically nonsignificant effects (see Figure 6.3). Asymmetries
can also indicate the need to examine moderator effects (subgroup analyses) or
other anomalies, such as the question of whether effect sizes from one research
team tend to be much larger than the rest. In short, the forest plot and funnel plot
for publication bias are indispensable visualization tools that can indicate mean-
ingful patterns in the meta-analytic database.
In closing this section on the analysis stage in meta-analysis, we want to point
out our bottom-line intent. Our goal is for L2 researchers to understand how
studies are weighted in meta-analysis and how FE and RE meta-analytic models
assume or estimate the variance across study effect sizes. However, we ultimately
recommend that meta-analytic estimates be considered in combination with
graphs of the effects (forest or funnel plots), and of course, a solid knowledge of
the research associated with the effects under study. Only then can meta-analysts
attempt to give partial insight into three fundamental questions: (a) Are all stud-
ies similar enough to be considered replicates of one another? (b) Do subgroups
of effect sizes differ in meaningful ways (e.g., lab vs. classroom studies)? (c) Are
there effect sizes that are outliers or that otherwise show unique characteristics
(e.g., a single large-sample military language-learning study within a sample of
college-classroom studies)?
Study 1 (d = .2)
Study 2 (d = .4)
Study 3 (d = .2)
Study 4 (d = .8)
Study 5 (d = −1.3)
Study 6 (d = .06)
Study 7 (d = −.37)
Study 8 (d = −.2)
Study 9 (d = −1.5)
Study 10 (d = .25)
−2 −1 0 1 2 3
160
140
120
100
Sample Size
80
60
40
20
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
Effect Size (d)
FIGURE 6.2 Example of a funnel plot without the presence of publication bias
Meta-analyzing L2 Research 117
160
140
120
100
Sample Size
80
60
40
20
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
Effect Size (d)
FIGURE 6.3 Example of a funnel plot with the presence of publication bias
correspond roughly to the 25th, 50th (median), and 75th percentiles of effects.
(For within-groups contrasts, we suggest the same three general descriptors for
d = .60, 1.00, and 1.40, respectively.) Observed correlation coefficients (r) were
.25 (25th percentile), .38 (50th), and .60 (75th). We are not suggesting that these
values should be applied universally to the breadth of L2 research but rather as a
single step away from Cohen and toward more field-specific interpretation of the
practical significance of effect sizes from L2 meta-analyses (and primary studies).
Plonsky and Oswald (2014) also discuss in depth additional factors worthy of
consideration when interpreting effect sizes. The findings of previous syntheses,
for example, can help researchers gauge the relative magnitude of observed effect
sizes. Plonsky (2011), for example, discussed the findings of his meta-analysis of
L2 strategy instruction in relation to meta-analyses of strategy instruction in L1
educational contexts (e.g., Hattie, Biggs, & Purdie, 1996). Another way to refine
these benchmarks is to examine the historical trajectory of effects in the area
being investigated. Effect sizes that decrease over time may indicate that the vari-
ables of interest are being examined at an increasingly more nuanced level (Kline,
2013), in light of theoretical refinements and as research shifts from the laboratory
setting (where independent variables are strongly manipulated) to the classroom
setting (where independent variables are more naturalistic). Plonsky and Gass
(2011), for example, reviewed 174 studies of L2 interaction, finding that average
d values tended to decrease steadily over time (1980–1989, d = 1.62; 1990–1999,
d = .82; 2000–2009, d = .52). This finding was attributed in part to increasingly
subtle models of interaction that have been introduced, developed, and tested over
the last 30 years. In a similar vein, Mackey and Goo (2007) and Plonsky (2011)
calculated effects for subgroups based on research context, and substantially larger
d values were found in both meta-analysis for lab over classroom studies (.96 vs.
.57 and .79 vs. .43, respectively; see Sample Study 2).
In an alternative scenario of how effect sizes may change over time, improve-
ments to design and measurement in a particular research area might overcome
the flaws of past research and lead to larger effect sizes (Fern & Monroe, 1996).
Meta-analyses by Spada and Tomita (2010) and Mackey and Goo (2007) found
that more recent studies have used open-ended test formats more often, which
were found to produce larger effect sizes than more constrained test formats in
Li (2010; see Sample Study 2) and Lyster and Saito (2010) (but larger effects
were not found for open-ended formats in Mackey & Goo, 2007, or Norris &
Ortega, 2000). It should be noted that the two trends described here may occur
simultaneously and cancel each other out or lead to increased variation in effects
over time.To be sure, locating and explaining patterns related to the maturity of a
domain is complex, and the data will not speak for themselves, necessitating once
again the substantive knowledge and perspective of the expert reviewer.
One final consideration with respect to interpreting meta-analytic effect sizes
is the degree to which independent variables in primary research are manipulated.
From a practical standpoint, a particular intervention may not be feasible (despite
producing a large effect) if it is excessively involved, financially prohibitive, or
Meta-analyzing L2 Research 119
SAMPLE STUDY 1
Plonsky, L. (2011). The effectiveness of second language strategy instruction:
A meta-analysis. Language Learning, 61, 993–1038.
Background
Research on L2 strategy instruction has been extensive, but methods and
results in this area have been inconsistent. The goals of this study were to
summarize current findings and examine theoretical moderators of the
effects of strategy instruction.
Research questions
• How effective is L2 strategy instruction?
• How is strategy instruction affected by different learning contexts,
treatments, outcome variables, and research methods?
Method
Conventional database searches, Web of Science, and Google Scholar were
used to locate a total of 95 unique samples from 61 studies (N = 6,791) that
met all the inclusion criteria. Each study was then coded on 37 variables. Five
of 15 authors who were contacted provided missing data for studies report-
ing insufficient information to calculate an effect size.
Statistical tools
Effect sizes (Cohen’s d) were weighted by sample size and combined to cal-
culate the meta-analytic average, standard error, and CIs. Publication bias
was examined using a funnel plot. Summary effects were also calculated for
subgroups based on study characteristics (i.e., moderators).
Results
The (weighted) meta-analytic d value for the effects of L2 strategy instruc-
tion was .49, smaller than most effects in the L2 domain but comparable to
120 Luke Plonsky and Frederick L. Oswald
SAMPLE STUDY 2
Li, S. (2010). The effectiveness of corrective feedback in SLA: A meta-analysis.
Language Learning, 60, 309–365.
Background
The theoretical and practical centrality of corrective feedback has led to
extensive research testing its effects, yet disagreement remains over how
empirical findings can inform L2 theory and practice. It is also unclear how
different types of feedback, learning contexts, and targeted L2 features
might relate to its effectiveness.
Research questions
• What is the overall effect of corrective feedback on L2 learning?
• Do different feedback types impact L2 learning differently?
• Does the effectiveness of corrective feedback persist over time?
• What are the moderator variables for the effectiveness of corrective
feedback?
Method
Li searched two academic databases, manually searched the archives of over
a dozen journals of L2 research, and scanned the references of review arti-
cles. This study also included 11 dissertations for a total of 33 unique study
reports.
Statistical tools
The Comprehensive Meta-Analysis software program enabled a relatively
sophisticated meta-analysis, statistically speaking. All results were calcu-
lated and presented using both RE and FE models, and availability and
publication bias were addressed using a funnel plot and a trim-and-fill anal-
ysis. (Trim-and-fill is a nonparametric statistical technique that adjusts the
meta-analytic mean. It does so by estimating effects that appear to be miss-
ing if a FE model and no systematic bias are assumed.) Additionally, Li tested
for several subgroup differences between studies.
Meta-analyzing L2 Research 121
Results
The overall d value for CF according to the FE model was .61 (RE = .64).
Moderator results were also found for feedback types, delayed effects, differ-
ent contexts (e.g., classroom vs. lab). There was some evidence of publica-
tion bias, yet the effect sizes from the 11 nonpublished dissertations in this
study were larger on average than in published studies.
Conclusion
Meta-analysis has immense potential to summarize L2 research in a systematic
manner, adding clarity to the current status of theoretical claims while provid-
ing critical insights and directions for future research. Along with the benefits,
however, taking on a meta-analytic approach introduces a set of challenges that
include both those inherent to the method as well as particular to the field. In
light of these challenges, we close with suggestions that summarize the approach
and perspective that we have presented throughout this chapter.
First, despite the inherently greater objectivity embodied in the meta-analytic
approach, there is no single or best way to do a meta-analysis. Each step involves
multiple decisions that must be made in accordance with the researcher’s goals, the
substantive domain being synthesized, and the practical constraints of the available
data. As a principle, we believe that better decisions are usually the simpler ones,
such as analyses that are clear and understandable as opposed to more sophis-
ticated analyses that are technically correct but confusing and without practical
benefit. Second, as each of these important decisions is made, it is essential that the
meta-analyst maintain precise records so the results are understood appropriately in
the context of the entire process that led to them.Third and last, we have attempted
to identify and translate some of the general insights that other disciplines have
gained through decades of experience with meta-analysis, and we hope that other
L2 researchers will do the same in these critical formative years for meta-analysis in
the field.With some confidence, we can predict for L2 research what has happened
in all other major disciplines that have been exposed to meta-analysis:The coming
years will continue to show an exponential gain in the publication of meta-analytic
results. Meta-analysis will begin to be the microscope through which past L2
research is interpreted as well as the telescope through which theoretical develop-
ments and future L2 research efforts will be directed. Exciting times lie ahead as
meta-analysis becomes an essential tool in the L2 researcher’s toolbox.
Further Reading
History
Current Methods
• Database searches for meta-analysis: In’nami and Koizumi (2010), Plonsky &
Brown (2015).
• Introduction to research synthesis: Ortega (in press).
• Timeline of research synthesis and meta-analysis: Norris and Ortega (2010).
• Review of meta-analysis in L2 research: Oswald and Plonsky (2010).
• Meta-analysis and replication: Plonsky (2012).
• Guide to interpreting effect sizes in meta-analysis: Plonsky & Oswald (2014)
Discussion Questions
Specific
on the x-axis and the d values on the y-axis. In SPSS, select Graphs >
Legacy Dialogs > Scatter/Dot > Simple Scatter, then move the vari-
ables into their respective boxes. How would you describe the pattern of
change, if any, in relation to the scenarios described in the earlier section on
Interpreting the Results?
7. Analyses-3: Examine and compare the funnel plots in Norris and Ortega
(2000, p. 452), Li (2010, p. 331), and Plonsky (2011, p. 1007). Do you see any
evidence for publication bias in those plots? If so, which one(s)? Do you see
any other irregularities? How might they be explained?
8. Interpreting the results:The overall findings in Plonsky’s (2011) meta-analysis
of strategy instruction are interpreted in a variety of ways (e.g., compared to
a meta-analysis of LI strategy instruction, Cohen’s benchmarks, the bench-
marks described in Oswald & Plonsky, 2010, standard deviation units).Which
one(s) do you find most informative or relevant to the discussion? Why?
General
9. Which steps in carrying out a meta-analysis are the most/least objective and
subjective? How might each step of a meta-analysis affect the results that are
obtained?
10. Which areas of L2 research do you think might be good candidates currently
for meta-analysis? Why?
11. Describe the most important similarities between primary research and
meta-analysis.
12. Meta-analyses depend entirely on past research, but they can also be used to
direct future research. Select an L2 meta-analysis and consider its implica-
tions for future empirical efforts.
13. Imagine that you were carrying out a meta-analysis in a particular area of L2
research and wanted to investigate the quality of studies in your sample. How
would you operationalize and measure study quality?
14. What are some of the benefits and drawbacks of using benchmarks such as
Plonsky and Oswald’s (2014) to explain the magnitude of effects found in a
meta-analysis?
Note
∗ This chapter is an updated and adapted version of Plonsky, L., & Oswald, F. L. (2012).
How to do a meta-analysis. In A. Mackey & S. M. Gass (Eds.), Research methods in second
language acquisition: A practical guide (pp. 275–295). London: Basil Blackwell.
References
APA Publications and Communications Board Working Group on Journal Article Report-
ing Standards. (2008). Reporting standards for research in psychology:Why do we need
them? What might they be? American Psychologist, 63, 839–851.
Meta-analyzing L2 Research 125
Aytug, Z. G., Rothstein, H. R., Zhou, W., & Kern, M. C. (2012). Revealed or concealed?
Transparency of procedures, decisions, and judgment calls in meta-analyses. Organiza-
tional Research Methods, 15, 103–133.
Borenstein, M., Hedges, L. V., Higgins, J.P.T., & Rothstein, H. R. (2009). Introduction to
meta-analysis. Chichester, UK: Wiley.
Cheung, S. F., & Chan, D. K-S. (2004). Dependent effect sizes in meta-analysis: Incorporat-
ing the degree of interdependence. Journal of Applied Psychology, 89(5), 780–791.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum.
Cohn, L. D., & Becker, B. J. (2003). How meta-analysis increases statistical power. Psychologi-
cal Methods, 8(3), 243–253.
Cooper, H. (2010). Research synthesis and meta-analysis: A step-by-step approach (4th ed).
Thousand Oaks, CA: Sage.
Cooper, H., Hedges, L. V., & Valentine, J. C. (Eds.). (2009). The handbook of research synthesis
and meta-analysis (2nd ed.). New York: Russell Sage Foundation.
Cooper, H. M., & Rosenthal, R. (1980). Statistical versus traditional procedures for sum-
marizing research findings. Psychological Bulletin, 87(3), 442–449.
Dalton, D. R., & Dalton, C. M. (2008). Meta-analyses: Some very good steps toward a bit
longer journey. Organizational Research Methods, 11(1), 127–147.
Fern, E. F., & Monroe, K. B. (1996). Effect-size estimates: Issues and problems in interpreta-
tion. Journal of Consumer Research, 23(2), 89–105.
Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher,
5, 3–8.
Gleser, L. J., & Olkin, I. (2009). Stochastically dependent effect sizes. In H. Cooper,
L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis
(second ed., pp. 357–376). New York: Sage.
Harzing, A. W. (2007). Publish or perish, available from http://www.harzing.com/pop.htm.
Hattie, J. A., Biggs, J., & Purdie, N. (1996). Effects of learning skills interventions on student
learning: A meta-analysis. Review of Educational Research, 66(2), 99–136.
Hedges, L. V. (2008). What are effect sizes and why do we need them? Child Development
Perspectives, 2(3), 167–171.
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic
Press.
Hedges, L. V., & Pigott,T. D. (2001).The power of statistical tests in meta-analysis. Psychologi-
cal Methods, 6(3), 203–217.
Higgins, J.P.T., White, I. R., & Wood, A. M. (2008). Imputation methods for missing out-
come data in meta-analysis of clinical trials. Clinical Trials, 5(3), 225–239.
Hunter, F. L., & Schmidt, J. E. (2014). Methods of meta-analysis: Correcting error and bias in
research findings (third ed.). Thousand Oaks, CA: Sage.
In’nami,Y., & Koizumi, R. (2009). A meta-analysis of test format effects on reading and lis-
tening test performance: Focus on multiple-choice and open-ended formats. Language
Testing, 26(2), 219–244.
In’nami, Y., & Koizumi, R. (2010). Database selection guidelines for meta-analysis in
applied linguistics. TESOL Quarterly, 44(1), 169–184.
In’nami, Y., & Koizumi, R. (Eds.) (2014). Research synthesis and meta-analysis in second
language learning and testing. Special issue of English Teaching and Learning.
Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its correlates:
A meta-analysis. Language Learning, 64, 160–212.
126 Luke Plonsky and Frederick L. Oswald
Kepes, S., McDaniel, M. S., Brannick, M. T., & Banks, G. C. (2013). Meta-analytic reviews
in the organizational sciences: Two meta-analytic schools on the way to MARS (the
meta-analytic reporting standards). Journal of Business Psychology, 28, 123–143.
Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and
Psychological Measurement, 56(5), 746–759.
Kline, R. B. (2013). Beyond significance testing: Statistics reform in the behavioral sciences (2nd
ed.). Washington, DC: American Psychological Association.
Lee, S-K., & Huang, H-T. (2008). Visual input enhancement and grammar learning:
A meta-analytic review. Studies in Second Language Acquisition, 30(3), 307–331.
Lee, J., Jang, J., & Plonsky, L. (in press). The effectiveness of second language pronunciation
instruction: A meta-analysis. Applied Linguistics.
Li, S. (2010). The effectiveness of corrective feedback in SLA: A meta-analysis. Language
Learning, 60(2), 309–365.
Li, S., Shintani, N., & Ellis, R. (Eds.) (forthcoming). The complementary contribution of
meta-analysis and narrative review in second language acquisition research. Applied
Linguistics, special issue.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
Lyster, R., & Saito, K. (2010). Oral feedback in classroom SLA: A meta-analysis. Studies in
Second Language Acquisition, 32(2), 265–302.
Mackey, A., & Goo, J. (2007). Interaction research in SLA: A meta-analysis and research
synthesis. In A. Mackey (Ed.), Conversational interaction in second language acquisition: A col-
lection of empirical studies (pp. 407–451). New York: Oxford University Press.
McManus, R. J., Wilson, S., Delaney, B. C., Fitzmaurice, D. A., Hyde, C. J., Tobias, R. S.,
Jowett, S., & Hobbs, F.D.R. (1998). Review of the usefulness of contacting other experts
when conducting a literature search for systematic reviews. British Medical Journal, 317,
1562–1563.
Morris, S. B. (2008). Estimating effect sizes from pretest–posttest-control group designs.
Organizational Research Methods, 11(2), 364–386.
Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and
quantitative meta-analysis. Language Learning, 50(3), 417–528.
Norris, J. M., & Ortega, L. (2006a). Synthesizing research on language learning and teaching.
Philadelphia, PA: John Benjamins.
Norris, J. M., & Ortega, L. (2006b). The value and practice of research synthesis for lan-
guage learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on
language learning and teaching (pp. 3–50). Philadelphia, PA: John Benjamins.
Norris, J. M., & Ortega, L. (2007). The future of research synthesis in applied linguistics:
Beyond art or science. TESOL Quarterly, 41(4), 805–815.
Norris, J. M., & Ortega, L. (2010). Research Timeline: Research synthesis. Language Teach-
ing, 43, 61–79.
Ortega, L. (in press). Research synthesis. In B. Paltridge & A. Phakiti (Eds.), Companion to
research methods in applied linguistics. London: Continuum.
Orwin, R. G. (1994). Evaluating coding decisions. In H. Cooper & L. V. Hedges (Eds.),
Handbook of research synthesis (pp. 139–162). New York: Russell Sage Foundation.
Orwin, R. G., & Cordray, D. S. (1985). Effects of deficient reporting on meta-analysis:
A conceptual framework and reanalysis. Psychological Bulletin, 97(1), 134–147.
Oswald, F. L., & Johnson, J. W. (1998). On the robustness, bias, and stability of statistics from
meta-analysis of correlation coefficients: Some initial Monte Carlo findings. Journal of
Applied Psychology, 83(2), 164–178.
Meta-analyzing L2 Research 127
Oswald, F. L., & McCloy, R. A. (2003). Meta-analysis and the art of the average. In
K. R. Murphy (Ed.), Validity generalization: A critical review (pp. 311–338). Mahwah, NJ:
Lawrence Erlbaum.
Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and
challenges. Annual Review of Applied Linguistics, 30, 85–110.
Pearson, K. (1904). Report on certain enteric fever inoculation statistics. British Medical
Journal, 3, 1243–1246.
Plonsky, L. (2011). The effectiveness of second language strategy instruction: A meta-
analysis. Language Learning, 61, 993–1038.
Plonsky, L. (2012). Replication, meta-analysis, and generalizability. In G. Porte (Ed.), Repli-
cation research in applied linguistics (pp. 116–132). New York: Cambridge University Press.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L., & Brown, D. (2015). Domain definition and search techniques in meta-analyses
of L2 research (or why 18 meta-analyses of feedback have different results). Second Lan-
guage Research, 31, 267–276.
Plonsky, L., Egbert, J., & LaFlair, G. T. (in press). Bootstrapping in applied linguistics:
Assessing its potential using shared data. Applied Linguistics.
Plonsky, L., & Gass, S. M. (2011). Quantitative research methods, study quality, and out-
comes: The case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Oswald, F. L. (2014). How big is ‘big’? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Poltavtchenko, E., & Johnson, M. D. (2009, March). Feedback and second language writ-
ing: A meta-analysis. Poster session presented at the annual meeting of TESOL,
Denver, CO.
Rosenthal, R. (1978). Combining results of independent studies. Psychological Bulletin,
85(1), 185–193.
Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of
experiential factors. Language Testing, 15(1), 1–20.
Rothstein, H. R., Sutton,A. J.,& Borenstein, M. (Eds.). (2005). Publication bias in meta-analysis:
Prevention, assessment and adjustments. Chichester, England: Wiley.
Russell, J., & Spada, N. (2006). The effectiveness of corrective feedback for the acquisition
of L2 grammar: A meta-analysis of the research. In J. M. Norris & L. Ortega (Eds.),
Synthesizing research on language learning and teaching (pp. 133–164). Philadelphia: John
Benjamins.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of
validity generalization. Journal of Applied Psychology, 62(5), 529–540.
Schmidt, F. L., Le, H., & Oh, I-S. (2009). Correcting for the distorting effects of study arti-
facts in meta-analysis. In H. Cooper, L. V. Hedges, & J. C.Valentine (Eds.), The handbook
of research synthesis and meta-analysis (second ed., pp. 317–333). New York: Russell Sage
Foundation.
Schmidt, F. L., Oh, I-S., & Hayes, T. (2009). Fixed versus random effects models in
meta-analysis: Model properties and an empirical comparison of differences in results.
British Journal of Mathematical and Statistical Psychology, 62(1), 97–128.
Spada, N., & Tomita, Y. (2010). Interactions between type of instruction and type of lan-
guage feature: A meta-analysis. Language Learning, 60(2), 263–308.
Stukas, A. A., & Cumming, G. (in press). Interpreting effect sizes: Towards a quantitative
cumulative social psychology. European Journal of Social Psychology.
128 Luke Plonsky and Frederick L. Oswald
Sutton, A. J., & Higgins, J. P. T. (2008). Recent development in meta-analysis. Statistics in
Medicine, 27(5), 625–650.
Taylor, A., Stevens, J. R., & Asher, J. W. (2006). The effects of explicit reading strategy train-
ing on L2 reading comprehension: A meta-analysis. In J. M. Norris & L. Ortega (Eds.),
Synthesizing research on language learning and teaching (pp. 213–244). Philadelphia, PA:
John Benjamins.
Truscott, J. (2007). The effect of error correction on learners’ ability to write accurately.
Journal of Second Language Writing, 16(4), 255–272.
Valentine, J. C., Pigott, T. D., & Rothstein, H. R. (2010). How many studies do you need?:
A primer on statistical power for meta-analysis. Journal of Educational and Behavioral
Statistics, 35(2), 215–247.
White, H. D. (2009). Scientific communication and literature retrieval. In H. Cooper,
V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis (second ed.,
L.
pp. 51–71). New York: Russell Sage Foundation.
Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychol-
ogy journals: Guidelines and explanations. American Psychologist, 54(8), 594–604.
Wilson, D. B. (2009). Systematic coding. In H. Cooper, L. V. Hedges, & J. C.Valentine (Eds.),
The handbook of research synthesis (second ed., pp. 159–176). New York: Russell Sage
Foundation.
PART III
Advanced and
Multivariate Methods
This page intentionally left blank
7
MULTIPLE REGRESSION
Eun Hee Jeon
when Y′ is the predicted value of the CV, A is the intercept, Bs indicate the
parameters (regression coefficients) being estimated, Xs indicate the PVs, and k
represents the number of the PVs. This equation is also thought of as a “predic-
tion equation” (Tabachnick & Fidell, 2012, p. 123) as it yields the predicted (not
observed) value, Y′. Predicted Y′s are then correlated with observed values, Ys,
to obtain the multivariate correlation, R. As its name indicates, this is a multi-
variate equivalent of the bivariate correlation, r. The squared value of the mul-
tivariate correlation R, namely R2, denotes the amount of variance in the CV
accounted for by the set of PVs in the equation. Less technically speaking, MRA
is a means to explain variance in the CV as a function of one or more PVs. Once
a well-fitting regression model is generated, it enables the researcher to closely
predict the value of the CV from the values of the PVs.
132 Eun Hee Jeon
To use an example from second language (L2) research, MRA can be used
to examine how different components of reading comprehension such as L2
decoding, vocabulary knowledge, and grammar knowledge individually and col-
lectively predict the reading comprehension of an L2 reader. MRA can also be
compared to analysis of covariance (ANCOVA) in that it can be used to exam-
ine the predictive power of an individual PV after the variance in the CV due
to a certain PV or PVs has been partialled out. Going back to the example of
L2 reading research, if the researcher is interested in finding out whether L2
vocabulary knowledge still stands as an important predictor of L2 reading ability
after the variance due to L2 grammar knowledge is partialled out, he or she can
simply enter L2 grammar knowledge first and L2 vocabulary knowledge second
into the equation, then check whether L2 vocabulary still manages to explain a
statistically significant amount of reading variance, which is indicated by the R2
change between the first model (with grammar only) and the second model (with
grammar and vocabulary). As can be seen in the example of L2 reading research,
this feature of a (hierarchical) regression analysis makes it possible to compare CV
variances accounted for by different models comprised of different sets of PVs,
thereby determining the best-fitting model.
MRA is not one analysis, but a family of analyses. This means that depending
on the purpose of the study and the type of research questions posed, there are
different analyses the researcher can choose. One of the primary factors determin-
ing the type of MRA to be used is the nature of the variables under investigation:
Is your CV categorical (e.g., admitted to a degree program vs. not admitted to a
degree program) or continuous (e.g., TOEFL score)? If the former, the appropri-
ate analysis would be logistic regression. In a similar vein, if your PVs include cat-
egorical variables with more than two levels, you would still use MRA. However,
because MRA can only handle categorical variables that are dichotomous, an
additional intermediate step, namely dummy variable coding, would be needed.
Let’s say, for example, that you hypothesize that first language (L1) background
(e.g., Spanish vs. Chinese vs. Russian) will affect reading comprehension of L2
English. In this case, rather than including a single three-level PV, you would cre-
ate two new dichotomous variables such as L1_Spanish and L1_Chinese, each
with possible values of 0 (not L1 Spanish/Chinese) or 1 (Spanish/Chinese).There
is no need to create a third variable for Russian because those participants would
be represented in the model by 0 in both of the other two newly created variables.
(In other words, when dummy coding, the number of new variables that need
to be created is equal to one fewer than the number of levels of the categorical
variable.)
Although few L2 researchers may be aware of this, if all the PVs are categori-
cal, the mathematical equation yielded by MRA equals ANOVA or ANCOVA
(Tabachnick & Fidell, 2012) and their effect sizes, despite the difference in the
labels (R2 in MRA and eta-squared, or η2, in ANOVA or ANCOVA), represent
Multiple Regression 133
the same concept (see Cohen, 1968, and Chapter 11 in Howell, 2012, for more
detailed explanation on the link between R2 in MRA and eta-squared, or η2, in
ANOVA a part of the general linear model, or GLM.) Given the availability of
different types of analyses developed to address variables of a different nature, it is
crucial that the researcher not compromise the true nature of each variable and
select the most suitable analysis for the variables under investigation; for example,
converting a continuous variable (e.g., L2 proficiency level) to a categorical vari-
able (e.g., low, intermediate, high) and adopting an ANOVA instead of adopting
an MRA should be avoided whenever possible to preserve variance in continuous
PVs (for a relevant discussion, see Plonsky, 2013).
Because of the range of analyses that fall under the umbrella of MRA, it is not
uncommon for some graduate programs in quantitatively oriented disciplines (e.g.,
educational psychology, sociology) to offer semester-long seminars on this topic.
Therefore, before I progress further, I would like to note that this chapter should
be considered only as a guide to the most frequently used types of MRA in L2
research. Specifically, I will focus on MRAs that involve continuous PVs and a con-
tinuous CV. The additional steps necessitated by different members of the MRA
family will also be integrated into the discussion when appropriate. Last, I would
like to note that many of my explanations in this chapter are based on Cohen,
Cohen, West, and Aiken (2003) which, while being arguably the most definitive
volume on MRA in the market, may come across as too technical for novice to
intermediate users of MRA. In addition, Cohen et al. (2003) does not include a
section on how to use statistical packages to run relevant analyses.This chapter aims
to render the information in Cohen et al. (2003) more accessible to novice to inter-
mediate users of MRA and provide directions on how to run MRA using SPSS.
PVs have been removed) and the confidence interval (CI) set by the researcher
(e.g., 80%, 95%, 99%). Let’s imagine a situation where the researcher is inter-
ested in the amount of reading variance accounted for by the reader’s L1 lit-
eracy and L2 language knowledge (e.g., vocabulary, grammar). Based on previous
research or theory (e.g., Bernhardt & Kamil, 1995), the researcher knows that
three variables—namely, L1 literacy, L2 vocabulary knowledge, and L2 grammar
knowledge—explain about 50% of individual variance (R2) in L2 reading com-
prehension. Let’s suppose that the researcher currently has access to at least 120
participants available for data collection but wonders if this is a big enough sam-
ple. Based on these two conditions, the researcher can compute a standard error
of R2 (i.e.,SE R2 2 ). The formula for SE R2 2 is as follows, where n and k respectively
denote the number of currently available study participants (i.e., sample size) and
the number of PVs (Cohen et al., 2003, p. 88).
4R 2 (1 − R 2 ) ( n − k − 1)
2 2
SE 2
=
R2
(n 2
− 1) ( n + 3 )
Now, substituting R2 with .50, n with 120, and k with 3 (L1 literacy, L2 vocabu-
lary, L2 grammar), we get the following:
SE 2
=
R2
(120 2
− 1) (120 + 3 )
= .0038
Given that the R2 is .50, if we decided to choose the conventional 95% prob-
ability, the CI would be .492–.508. The interpretation of this is that assuming our
R2 (i.e., .50) was drawn from the distribution composed of population parameters
of R2, 95% of the time, the R2 would be a value within the range of .492–.508.
Multiple Regression 135
If we chose the less conventional but more stringent level of 99% probability, the
CI would be .490–.510, indicating that under the same assumption as the one
earlier, 99% of the time (therefore, with more confidence), R2 would be a value
within the range of .490–.510. Since neither of the 95% or the 99% CIs includes
0, which would discount the reliability of the observed value of R2, we can con-
clude that the current sample size and the number of PVs are appropriate to yield
a reliable value.
Another point to consider when examining the CI is the range. If the CI is
too wide, for example, it fails to provide precision of observation. To illustrate,
let’s suppose a situation where the 95% CI of R2 was .10–.90. Such a large CI
fails to offer useful information. In such a case, the researcher can adjust the CI
by increasing the sample size or by decreasing the number of predictor variables.
Once the data are collected and entered into a statistical software package such as
SPSS, the researcher can easily compute CIs of various probability levels. Step-by-
step instructions for computing a CI of the researcher’s choice are provided later
in this chapter.
Power. The technical definition of power is the probability of correctly reject-
ing a false null hypothesis, or more simply, the probability of finding statistical
significance of the observation when such a relationship exists in truth. To use an
example from MRA, the probability of finding the R2 to be significantly different
from 0 when it is in fact different from 0, is power.The problem with low a priori
power (i.e., an estimated power that is lower than .80 with the prospective sample
size prior to actual data collection) is evident. Even if the researcher somehow
managed to find a (seemingly) statistically significant finding, if the power was
very low to start with, the researcher risks claiming a statistical relationship where
it may not exist. Much like the procedures involved in the examination of preci-
sion discussed earlier, the computation of a priori power in the case of an MRA
also begins with locating the expected value of R2, based on previous research
and/or theory. For the sake of simplicity, let’s continue with the same example
we used earlier, namely, R2 of .50. The researcher selects the suitable probability
level—or to follow Cohen, 1992, significance criterion—(e.g., α = .01 or α =.05).
For now, let’s go with the more conventional value of .05. The minimum sample
size (N = 120) and the number of predictors (k = 3) are the remaining determin-
ers we need to compute power.
With these determiners in hand, we now first compute the population effect
size ( f ) using the following formula (p. 92, Cohen et al., 2003):
R2
f =
1 − R2
Replacing R2 with .50, we get the following:
.50
f = =1
1 − .50
136 Eun Hee Jeon
Now that we have the f value, we use its value to determine L, which is a value
we need to identify power in the L table of the selected probability level (or sig-
nificance criterion, i.e., α = .01 or α =.05) (Cohen et al., 2003). L is determined
using the following formula (p. 92, Cohen et al., 2003):
L = f 2 (N – k – 1)
Continuing on our previous example of a reading MRA study, let’s now replace
f with 1, N with 120, and k with 3. We then get:
L= 12 (120 – 3 – 1) = 116
With this L value and df (which is equal to k, and therefore equal to 3, in our case),
we now identify predicted power in the L table. For convenience in Figure 7.1
I provide the part of the L table relevant to this example. The values in the top
row are power and the values in the first column are df or k.
For easier navigation, follow the arrows drawn on the figure. Since our df or k
value is 3 and the L value is 116, we now locate the L value on the row listed next
to the df or k value of 3 that is closest to 116, which, in our case here, is 23.52 and
corresponds to the predicted power value of .99. Our a priori power, therefore,
is well above the .80 standard and there is no need to increase the sample size.
Once the a priori checks are done and the data have been collected, we can
now submit the data to statistical analyses. However, prior to main analyses, the
researcher must first make sure that the data meet the assumptions of multivari-
ate analyses such as MRA (i.e., data screening), transform the data if they do not
meet the assumptions, and finally submit the data to the analysis proper, namely,
0.1 0.3 0.5 0.6 0.7 0.75 0.8 0.85 0.9 0.95 0.99
1 0.43 2.06 3.84 4.90 6.17 6.94 7.85 8.98 10.51 13.00 18.37
2 0.62 2.78 4.96 6.21 7.70 8.59 9.64 10.92 12.65 15.44 21.40
3 0.78 3.30 5.76 7.15 8.79 9.77 10.90 12.3 14.17 17.17 23.52
4 0.91 3.74 6.42 7.92 9.68 10.72 11.94 13.42 15.41 18.57 25.24
5 1.03 4.12 6.99 8.59 10.45 11.55 12.83 14.39 16.47 19.78 26.73
FIGURE 7.1 Partial L value table (shortened from Cohen et al, 2003, p. 651)
Multiple Regression 137
an appropriate type of MRA. The same preparatory steps should be followed for
all types of MRA.
and all PVs). For example, if two PVs are highly correlated (r equal to or higher
than .90 or –.90) (Allison, 1999; Tabachnick & Fidell, 2012), you have a multi-
collinearity problem. In such a case, consider either collapsing the highly cor-
relating PVs into one variable or eliminating one of them from the analysis. This
old-fashioned approach to checking multicollinearity, however, is not a foolproof
solution because it is possible for all bivariate correlations to be in an acceptable
range even when multicollinearity is present (Allison, 1999). In order to avoid
such an oversight, Allison (1999) recommends that researchers refer to the Toler-
ance statistic or variance inflation factor (VIF), which is the multiplicative inverse
of Tolerance (VIF = 1/tolerance).
Use these SPSS commands to obtain the Tolerance and VIF (see Figure 7.4):
Analyze > Regression > Linear. For Dependent, select one of the independent
variables (IVs). For Independent(s), select all other IVs under investigation. Click
Statistics in the Linear Regression dialogue box. Remove the check mark from
all items except Collinearity Diagnostics, then click Continue. Click the Plots tab on
Multiple Regression 139
the Linear Regression dialogue box, and make sure that nothing is selected, then
click Continue.
Now, in the Output view, you will get the following table.
Allison (1999) suggests as a rule of thumb, a tolerance value lower than .40
(VIF higher than 2.50) indicates multicollinearity. As shown in Table 7.1, neither
the Tolerance values nor the VIFs are out of the acceptable range and there-
fore do not indicate a concern. Please note that this is only the first step of the
Tolerance/VIF statistic check. Now, we need to reiterate this process alternat-
ing the Dependent variable: We used “Voc” as the Dependent variable in the
first analysis, so this time we enter “Grm” as the Dependent variable and “Voc”
and “Metacog” as Independent variables. In the final step, “Metacog” will be
the Dependent variable and “Voc” and “Grm” will be Independent variables. If
140 Eun Hee Jeon
Coefficientsa
Tolerance VIF
multicollinearity is detected, you will have to decide how to handle this problem;
the simpler solutions include removing the most intercorrelated variable(s) from
the analysis or combining the two variables and using them as one variable. One
must take care, however, to avoid compromising the theoretical motivation of the
research by eliminating or combining variables.
Step 5, ensure a linear relationship. Check to see if the CV and PVs have
a linear relationship when observed pairwise and collectively. Linearity is one of
the assumptions of multivariate normality as Pearson’s r only captures linear rela-
tionships (Tabachnick & Fidell, 2012). You can check linearity by checking the
bivariate scatter plots of variables and residual plots. If some relationships are not
linear despite the removal of univariate and multivariable outliers and transforma-
tion of problem variables (both of which have been completed in previous steps),
you might consider transforming the problem variable further to ensure linearity.
Multiple Regression 141
Yes Yes
Logistic Multiple
Regression Regression
Analysis
Are you
Do you want interested Standard
to determine No in the unique contribution Yes Multiple
the order of of each and Regression
PV entry? every PV?
Yes No
L2 vocabulary first, L2 grammar second, and metacognition scores last into the
equation, then examine the amount and statistical significance of incremental
reading variance at each step (i.e., change in total R2). This procedure is akin to
ANCOVA, where the effects of one independent variable (the covariate) are
removed or partialled out in order to isolate the effects of another.
MRA Type 3: stepwise regression analysis. Of the three types of MRA
introduced here, the most caution is advised when using stepwise regression
analysis. This is because unlike the first two types of MRA, the model specifica-
tion in stepwise regression analysis relies strictly on statistical criteria, namely,
the size of the correlation between the CV and PVs. To illustrate this point, let’s
take an example from forward selection (one of the three methods of stepwise
multiple regression, which include forward selection, backward deletion, and
stepwise regression). Let’s say that the PV with the highest correlation with the
CV, L2 reading comprehension, was L2 vocabulary knowledge. In the forward
selection method, the first PV to enter the equation is thus determined to be
L2 vocabulary. The contribution of L2 vocabulary includes both the unique
contribution made by L2 vocabulary and the potentially overlapping area with
another PV to be selected shortly. Next, in order to select the second PV, mod-
els including all possible pairs of PVs with L2 vocabulary as the default PV of
the two PVs (e.g., L2 vocabulary and L2 grammar, L2 vocabulary and meta-
cognition) are compared for their predictability, and the higher contributing
PV is selected as the second PV of the equation. Only the unique contribution
of the second PV is considered. As can be illustrated in this example, due to
its strictly statistical nature (the reason why stepwise regression analysis is also
called statistical regression analysis), should a researcher choose stepwise regres-
sion analysis over other types of MRA, the observed relative importance of a
PV should be considered with caution and in the context of previous research
findings, theory, and sample size (see also Tabachnick & Fidell’s, 2012 advice
on this matter).
To help you choose the appropriate type of MRA, in Figure 7.5 I present a
decision tree designed for this purpose. As depicted in the diamond in the upper
left corner of the diagram, your first decision hinges on whether the CV is cat-
egorical or continuous. If the CV is categorical, the appropriate analysis is logistic
regression. If the CV is continuous, however, the researcher should determine the
type of MRA by navigating further along the tree. The two types of MRA that
will be further discussed in this chapter are marked with ovals.
enter the CV of your choice. For Independent(s), simultaneously select all the
PVs of your choice. Click the Statistics tab to make selections for statistics of
interest. Here I selected model fit (probably the most important information),
CIs (notice you can adjust the probability level of CIs), Durbin-Watson (to check
for the independence of observation/independence of residuals). Click Continue.
In the Linear Regression dialogue box, click the Plots tab and select *ZRESID
(short for z residual) for the y-axis and *ZPRED (short for z predictor) for the
x-axis as illustrated in Figure 7.7.
By making these selections, you can create a residual scatter plot using stan-
dardized scores (thus the labels “z residual” and “z predictor”) and can check the
normality of residual distribution; if you have normality, the residual scatter plot
should reveal a pile of residuals in the center of the plot, which should resemble
a rectangular shape with residuals trailing off symmetrically in all four directions
from the center of the rectangle. In the next two figures I present two plots,
one of which shows normality (Figure 7.7) and the other a lack of normality
(Figure 7.8). If a lack of normality is detected, it is recommended that the
researcher transform the data appropriately to achieve normality.
FIGURE 7.6 SPSS standard multiple regression dialogue boxes: the first dialogue box
and selections in the Statistics tab
FIGURE 7.7 SPSS standard multiple regression dialogue boxes: selections in the Linear
Regression Plots dialogue box
Scatterplot
Dependent Variable: gtelprc
2
Regression Standardized Residual
−1
−2
−3
−2 −1 0 1 2 3
Regression Standardized Predicted Value
Scatterplot
Dependent Variable: psedcomp
4
Regression Standardized Residual
−2
−3 −2 −1 0 1 2 3
Regression Standardized Predicted Value
1. Model: As noted earlier, for standard multiple regression, the number should
be 1, indicating one model was generated.
2. R: This is what we call multiple correlation coefficient. This can be consid-
ered as a kind of multivariate equivalent of r (correlation coefficient between
two variables). Just like r, R ranges from 0 to 1, and is an index of how well
the CV is predicted by the set of PVs.
3. R Square (R2): As the name indicates, this is computed by multiplying R by
itself (.691 × .691), and is the proportion of variance in the CV accounted
Multiple Regression 147
Variables Entered/Removed b
Model Summaryb
for by the PVs. In other words, an R2 of .478 indicates that 47.8% of the vari-
ance in the CV is accounted for by the PVs.
4. Adjusted R Square: R2 is based on the study sample, not on the population
from which the sample was drawn. For this reason, the R2 value has a ten-
dency to be inflated (or positively biased). Adjusted R2 takes into account this
bias (thus the term, “adjusted,”) and provides a more conservative value.
The third table you should pay attention to is the ANOVA table. You might
wonder why there is an ANOVA table in the MRA output. The reason for this is
that an R2 value cannot be tested for statistical significance as it simply indicates
the proportion of the variance in the CV accounted for by the PVs. How do we
test, then, the statistical significance of the regression model that we have just gen-
erated? In other words, how can we determine that knowing a value of a certain
PV allows us to statistically significantly predict the value of the CV than when
we don’t know the value of the PV (i.e., when the regression coefficient of this
PV is 0 and creates a flat line with no slope, which is essentially the null hypothesis
of the MRA)? In the case of group comparison (i.e., a categorical PV), we test
whether or not participants’ group membership (treatment group 1 vs. treatment
group 2 vs. treatment group 3) provides extra information about the mean (i.e.,
the null hypothesis of ANOVA). Do you now see that although we use MRA
and ANOVA to investigate different types of research questions, they both rely on
similar principles? In fact, we can think of ANOVA as a type of MRA in which
the PV(s) are all categorical. This is why we use F-ratio to examine the statistical
significance of MRA as well (see Table 7.4).
148 Eun Hee Jeon
ANOVAb
1. Take a look at the “Mean Square” column. This is where the mean sum of
squares of the regression model and that of the residual are reported. The
former divided by the latter (381.573/15.265) is expressed as the F-ratio
(24.996) in the next column.
2. Check the “Sig.” column for the associated significance level of this F-ratio.
It is .000, which is smaller than the typical .05 probability level, indicating
that the chances of the regression line departing from the flat line are beyond
random chance level. Since the model is statistically significant, we can now
continue to report other details of the model.
The next table of interest (Table 7.5), “Coefficients” reports the regression
coefficients (B) and their 95% CIs.
How to read this table:
Coefficientsa
expected to increase by an average of .171 units. Also note that although this
was not the case with the current example, it is possible to have a negative
coefficient (e.g., –.171). In such a case, the interpretation would be in the
reverse direction: e.g., for every additional unit in testing anxiety, reading com-
prehension test performance is expected to decrease by an average of .171 unit.
2. The “Sig.” column shows the significance level of each regression coefficient.
In our case, only the variable “Grm” (Grammar test) has a statistically signifi-
cant coefficient.
3. The “95.0% CI for B” columns show the 95% CI associated with each
regression coefficient. You can see that the CIs of the two nonsignificant
regression coefficients (“Vocabulary” and “Metacognition”) both include 0,
indicating lack of reliability associated with their coefficients.
FIGURE 7.10 SPSS hierarchical regression analysis dialogue boxes: selections of PVs
for the first model
1. You will notice in the “Model” column that two models are presented. This
is because this hierarchical regression analysis examines whether Model 2,
which includes grammar, vocabulary, and metacognition, offers a signifi-
cantly better fit than Model 1, which only includes grammar and vocabulary.
Check the variable names and their corresponding models to make sure that
you entered the PV (or a set of PVs in case you entered multiple PVs) at the
correct step.
Multiple Regression 151
FIGURE 7.11 SPSS hierarchical regression analysis dialogue boxes: selections of PV for
the second and final model and selection of statistics
Now, let us review the next table, Model Summary (Table 7.7).
How to read this table:
1. Model column: Both models here are standard regression models with the
same CV but with different sets of PVs; Model 1 has two PVs (grammar and
vocabulary) while Model 2 has three (grammar, vocabulary, and metacogni-
tion). Interpretation of R, R2, Adjusted R2 for each model is, therefore, the
same as that of standard multiple regression (see above).
2. Change Statistics:This is what distinguishes hierarchical regression from standard
multiple regression.The R2 change of Model 2 indicates the increase in the pro-
portion of the variance in the CV when the full model (i.e., Model 2) includes
metacognition as the third PV.The statistical significance between Model 1 and
Model 2 can also be tested using the F-test, and the result is reported in the “Sig.
F Change” column. In our case, the addition of the third PV did not result in
a statistically significant change. Therefore, including metacognition as the third
PV, although it would be helpful in explaining a small additional amount of vari-
ance in the CV, would not be helpful in pursuing a parsimonious model. Further
evidence of the lack of variance accounted for by metacognition can also be
observed in the nearly identical R2 values for the two models.
152 Eun Hee Jeon
TABLE 7.6 SPSS output for variables entered/removed in hierarchical regression model
Variables Entered/Removed b
Model Summary c
The following ANOVA table (Table 7.8) reports on the statistical significance
of the two models generated in this analysis.
How to read this table: Since both models are essentially standard multiple
regression models, you can read this table just as you would read the ANOVA
table of standard regression analysis output described earlier. Here we can see
from the last column, “Sig.”, that both Model 1 and Model 2 are statistically sig-
nificant, although as I discussed previously, the latter model lacks parsimony and
therefore is not recommended.
The last table to note is the Coefficients table (Table 7.9). Again, the interpre-
tation of coefficients for each model is the same as that of the previously reviewed
standardized regression model. Since the focus of a study that employs a hierar-
chical regression analysis is often on the full model with all PVs included, the
reporting of the coefficients of Model 2 is likely to be your primary task.
How to read this table:
1. The regression weights for each PV in Model 1 and Model 2 are reported
in the “B” column under “Unstandardized Coefficients.” The statistical
significance of each regression weight is reported in the “Sig.” column, indi-
cating that in both models, only grammar had a statistically significant regres-
sion weight.
Multiple Regression 153
TABLE 7.8 SPSS output for ANOVA resulting from hierarchical regression
ANOVAc
Coefficientsa
2. The 95% CIs associated with each regression coefficient are reported in the
rightmost column. Here we can see that in Model 1, the 95% CI for vocabu-
lary included 0, and that in Model 2, the 95% CIs for both vocabulary and
metacognition included 0, indicating a lack of reliability associated with their
regression weights.
STUDY BOX 1
Jeon, E. H. (2012). Oral reading fluency in second language reading, Reading in a
Foreign Language, 24 (2), 186–208.
154 Eun Hee Jeon
Background
Despite increasing interest in fluency and its role in L2 reading, investigation
of fluency in the context of other key reading components is scarce. This
study aimed to (a) expand the current understanding of L2 oral reading flu-
ency by identifying its relationship with other key reading predictors (e.g.,
decoding, vocabulary knowledge, grammar knowledge, and metacogni-
tion), and (b) to examine the predictive power of oral reading fluency on L2
reading comprehension, thereby examining the potential of reading fluency
as a proxy for L2 reading comprehension.
Research Questions
1. How does oral reading fluency relate to other components of L2
reading?
2. Are word-level reading fluency and passage reading fluency substan-
tially different from each other? If so, why?
3. Can oral passage reading fluency be considered a proxy for L2 reading
comprehension among the present study participants?
Method
255 10th graders in South Korea who had been studying English for 7.5 years
were assessed on nine variables (three fluency variables, five other key read-
ing components, and reading comprehension): pseudoword reading, word
reading, passage reading, morphological awareness, word knowledge,
grammar knowledge, listening comprehension, metacognitive awareness,
reading comprehension.
Statistical Tools
Pseudoword reading, word reading, and passage reading scores were used
as predictor variables and reading comprehension was used as the criterion
variable in an MRA. Four hierarchical regression analyses were carried out,
alternating the entry order each time.
Results
The regression analysis results showed that the three reading fluency vari-
ables collectively explained a statistically significant 21.2% (p < .001) of vari-
ance in silent reading comprehension and that passage reading fluency was
a more potent explanatory variable than word-level fluency variables. As the
first variable to enter the regression, oral passage reading fluency explained
a significant 20.9% (p < .001) of reading variance. When entered follow-
ing the Pseudoword Reading Test and the Word Reading Test, the Passage
Multiple Regression 155
STUDY BOX 2
Jin, T., & Mak, B. (2013). Distinguishing features in scoring L2 Chinese speaking
performance: How do they work? Language Testing, 30 (1), 23–47.
Background
Research on the link between distinguishing features (fluency, vocabulary)
and overall oral proficiency is well-established in L2 English but not in L2
Chinese. This study aims to investigate the predictive power of seven dis-
tinguishing features representing four constructs (pronunciation, fluency,
vocabulary, grammar) on holistically graded speaking performance.
Research Questions
1. What is the relationship between each individual distinguishing feature
and the speaking test scores?
2. What is the contribution of distinguishing features to speaking test scores?
Method
66 advanced L2 Chinese learners and two raters participated in the study.
Pronunciation (number of target-like syllables per 10 syllables), fluency
(speech rate and pause time), vocabulary (word tokens and word types),
and grammar (grammatical accuracy and grammatical complexity) were
assessed. Speaking ability was measured through three test tasks, each of
which included integrated and independent tasks.
Statistical Tools
A bivariate correlation matrix showed that six of the seven distinguishing fea-
tures were significantly correlated with speaking test scores. As a result, two
standard multiple regressions were carried out with those six distinguishing
features as predictor variables and speaking test scores as the criterion vari-
able. In the first regression, one of the two vocabulary measures (i.e., word
tokens) was used and in the second regression, the other vocabulary mea-
sure (i.e., word types) was used.
156 Eun Hee Jeon
Results
Total R2s yielded by the first and second regression model were very high at
.79 and .77, respectively. In both regressions, target-like syllables, grammatical
accuracy, and word tokens and types were found to be significant predictor vari-
ables. These results provided empirical support that distinguishing features and
holistic speaking test scores are linked among advanced L2 Chinese learners.
Further Reading
General Textbooks
1. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/cor-
relation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum
Associates.
This is probably the most detailed volume on MRA for psychologists and social
scientists. The volume is full of in-depth discussion on theoretical and mathematical
aspects of MRA, but has little connection with statistical packages.
2. Allison, P. D. (1998). Multiple regression: A primer. Thousand Oaks, CA: Pine Forge Press.
This volume is entirely devoted to MRA. Compared to Cohen et al. (2003), it is
slightly less technical and geared toward novice users of MRA.
3a. Field, A. (2013). Discovering statistics using IBM SPSS statistics. Los Angeles: Sage.
Multiple Regression 157
3b. Howell, D.C. (2012). Statistical methods for psychology (8th ed.). Pacific Grove, CA:
Wadsworth Publishing.
3c. Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS.
New York: Routledge.
3d.Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). Mahwah,
NJ: Lawrence Erlbaum Associates.
3e. Tabachnick, B. G., & Fidell, L. S. (2012). Using multivariate statistics (6th ed.). Boston,
MA: Pearson.
All four volumes listed are comprehensive statistics textbooks and include a chapter
on MRA. They include conceptual and mathematical explanations as well as com-
mands for statistical packages (e.g., SPSS, SES). These textbooks are used widely for
graduate level statistical courses for students in psychology, social sciences, and in the
case of Larson-Hall (2010), for applied linguistics.
Discussion Questions
1. Review the past five years’ issues of one or more L2 journals to locate studies
that used MRA. For the studies that used MRA, tally the types of MRA based
on their frequency of use. Is there a particularly frequently used type of MRA
within a certain subdiscipline of applied linguistics (e.g., language testing, soci-
olinguistics, language proficiency research)? If so, why do you think this is?
2. When using an MRA (and when using any type of modeling type of analysis,
actually), researchers care about identifying a model that fits the data well but
that is, at the same time, parsimonious. Why is model parsimony important?
Review the MRA studies collected for Discussion Question 1. Do you think
158 Eun Hee Jeon
all of them struck a happy medium between model fit and parsimony? Did
any of the studies sacrifice one for the other?
3. Jeon (2012) and Jiang, Sawaki, and Sabatini (2012) both used hierarchical
regression analysis to investigate a similar issue. Read both articles as a set and
see how the two articles converse with each other both theoretically and meth-
odologically.To what extent are their respective uses of MRA informed by and
justified according to the predictions of theory and of previous research?
4. Jin and Barley (2013) showcases the use of standard multiple regression in a
testing setting.What were the study’s PVs and CV and why was standard mul-
tiple regression chosen for this study? Can you imagine other instances when
MRA might be appropriate in the context of research in L2 assessment?
Note
1. For example, I know from previous research (Jeon & Yamashita, 2014) that L2 grammar
and L2 vocabulary are the two strongest correlates of L2 reading comprehension. How-
ever, other researchers have also noted that metacognition is also an important reading
predictor. For this reason, I am entering vocabulary and grammar as the two covariates
in the first block of this analysis.
References
Allison, P. D. (1999). Multiple regression. Thousand Oaks, CA: Pine Forge Press.
Bernhardt, E. B., & Kamil, M. L. (1995). Interpreting relationships between first language
and second language reading: Consolidating the Linguistic Threshold and the Linguis-
tic Interdependence Hypotheses. Applied Linguistics, 16, 15–34.
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin,
70, 426–443.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Howell, D.C. (2012). Statistical methods for psychology (8th ed.). Pacific Grove, CA: Wads-
worth Publishing.
Jeon, E. (2012). Oral reading fluency in second language reading. Reading in a Foreign Lan-
guage, 24(2), 186–208.
Jeon. E.,&Yamashita, J. (2014). L2 reading comprehension and its correlates:A meta-analysis.
Language Learning, 64, 160–212.
Jiang, X., Sawaki,Y., & Sabatini, J. (2012). Word reading efficiency and oral reading fluency
in ESL reading comprehension. Reading Psychology, 33, 323–349.
Jin, T., & Mak, B. (2013). Distinguishing features in scoring L2 Chinese speaking perfor-
mance: How do they work? Language Testing, 30(1), 23–47.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs analyses, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Stevens, J. (1996). Applied multivariate statistics for the social sciences. Mahwah, NJ: Lawrence
Earlbaum Associates.
Tabachnick, B. G., & Fidell, L. S. (2012). Using Multivariate Statistics (6th ed.). Boston:
Allyn and Bacon.
8
MIXED EFFECTS MODELING
AND LONGITUDINAL DATA
ANALYSIS
Ian Cunnings and Ian Finlayson
Introduction
Consider a study that investigates two ESL teaching strategies. A researcher might
recruit participants from two schools and administer a course of each teaching
strategy to one of the schools. Participants’ proficiency would be tested at the
start and end of the course, and potentially a number of times in between, and the
relative increase in proficiency over time would be used as an indicator of which
strategy (if any) leads to a greater increase in proficiency. In a statistical analysis of
this type of study, the researcher will of course want to assess whether the influ-
ence of the independent variable, “teaching strategy,” on the English proficiency
of the participants sampled is likely to generalize to the wider population of
English language learners. The influence of “teaching strategy” on the dependent
variable, “English proficiency,” is modeled statistically as a fixed effect. A random
effect parameter in a statistical analysis models the random variance across the
participants tested.The researcher will want to assess whether the influence of the
fixed effect generalizes beyond the participants sampled to the wider population,
while taking into account any random variation observed. Simply put, a mixed
effects model is a statistical model that contains both fixed and random effects.
This hypothetical study is an example of a longitudinal design, as participants
are tested at multiple points in time. Longitudinal studies provide an important
tool to the second language (L2) researcher, as they provide the opportunity to
investigate how any number of factors may affect L2 acquisition over time. In
this chapter, we provide an overview of how longitudinal data can be analyzed
using mixed effects models. Mixed effects models have a number of properties
that mark them as particularly useful for L2 researchers interested in longitudinal
analysis or other research designs. Mixed effects models can be used to analyze a
160 Ian Cunnings and Ian Finlayson
variety of types of data and offer an alternative to the near ubiquitous use of t-tests
and ANOVA in the field (Lazaraton, 2000; Norris & Ortega, 2000; Plonsky, 2013;
Plonsky & Gass, 2011). We first discuss how mixed effects models might benefit
L2 researchers, before providing a practical example of how longitudinal mixed
effects data analysis can be conducted.
participants, but also potential random variation arising from the way students are
clustered into classes.
Observations can also cluster in a nonnested fashion. For example, students
within the same ESL class might come from different first language (L1) back-
grounds, and students with the same L1s might be spread across different classes in
a school. In this case, although students are hierarchically nested into both classes
and L1s, classes and L2s are not nested. Rather, classes and L1s are crossed at the
same level of sampling. In addition to nested random effects, mixed effects models
can also include crossed random effects to model factors that are crossed, as in
classes and L1s in this example, at the same level of sampling (Raudenbush, 1993).
The ability to model nested and crossed random effects provides a new solu-
tion to an old problem in language research, namely Clark’s (1973) “language-as-
fixed-effect fallacy.” Clark argued that in language research, just as participants are
sampled from a wider population, so too the linguistic materials or target features
tested are also sampled from a wider population of materials or features that share
the same properties. As language researchers usually want to test if results general-
ize both to the wider population of people and the wider population of linguistic
materials, Clark argued both sources of random variance need to be taken into
account. A long-standing solution to this issue has been to conduct two separate
analyzes of a given data set, one in which data is averaged over the sampled sub-
jects (the F1 analysis) and a second averaged over the sampled linguistic items (F2).
A result is considered significant if it is reliable by both subjects and items.
However, conducting separate subjects and items analyzes is not a full solution
to Clark’s problem: Although the subjects analysis takes into account random sub-
ject variance and the items analysis random item variance, neither analysis takes
both sources of random variance into account at the same time. On a practical
level, it is also difficult to interpret a result that is reliable in one analysis but not
the other. Mixed effects models offer an alternative solution. In language research,
the subjects sampled are tested on a series of linguistic items, and the same lin-
guistic items are tested across subjects. In this way, subjects and items are crossed
at the same level of sampling. Mixed effects models with crossed random effects
for subjects and items allow both subject and item variance to be accounted
for in a single analysis and thus provide a better solution to Clark’s language-as-
fixed-effect fallacy than separate F1 and F2 analyses (Baayen, Davidson & Bates,
2008; Locker, Hoffman, & Bovaird, 2007).
Another advantage of mixed effects models over ANOVA is its flexibility in
the types of independent variables that may be considered. Like other types of
regression analyzes, mixed effects models allow us to model variance due to con-
tinuous as well as categorical predictors. In a longitudinal study, where changes
over time are of particular interest, this makes mixed effects models an attractive
option for data analysis. Returning to our earlier example, we may wish to take
multiple measurements during the length of our study to explore how proficiency
improves over time. With ANOVA, this could be analyzed as differences between
162 Ian Cunnings and Ian Finlayson
& Nilsson (2013) and Meunier & Littre (2013; see Sample Study 1) have used
mixed effects analysis of longitudinal L2 data. In the next section, we discuss how
such analysis can be carried out. While our analysis involves a fictional longitu-
dinal study taking place over a matter of months, different types of longitudinal
effects can be analyzed with mixed effects models. This can include, for exam-
ple, effects relating to how participants perform over the course of an individual
experiment, as well as investigations of change over longer periods of time.
SAMPLE STUDY 1
Meunier & Littre (2013). Tracking learners’ progress: Adopting a dual ‘corpus cum
experimental data’ approach. The Modern Language Journal, 97, 61–76.
Background
The acquisition of tense and aspect marking in L2 English has been
well-researched in second-language acquisition (SLA). Meunier and Littre
conducted a longitudinal corpus-based analysis to investigate which prop-
erties of tense and aspect marking remain difficult to master even after a
number of years of exposure to English.
Results
The results showed that tense and aspect error rates reduced over time.
Certain properties of the English progressive, however, continued to present
considerable difficulties. Meunier and Littre used the results of the mixed
effects corpus analysis to inform construction of an experimental gram-
maticality judgment task testing the acquisition of specifically those struc-
tures that were found particularly difficult to acquire. This type of combined
approach to the study of L2 acquisition, facilitated by mixed effects analy-
sis of longitudinal data, thus provides an opportunity to gain an in-depth
understanding of developmental patterns in L2 acquisition that is not pos-
sible with traditional analyses that solely rely on cross-sectional designs.
164 Ian Cunnings and Ian Finlayson
Practical Example
The example data set we discuss in this section is longitudinal, although the issues
raised are of general relevance to mixed effects analysis.The example uses the R soft-
ware package (R development core team, 2014). Mixed effects analyses can also be
conducted in SPSS, SAS and STATA. R is a command-line driven application that
readers used to the menu system of SPSS might initially find taxing. It is beyond the
scope of this chapter to provide a comprehensive introduction to R syntax, but the
reader is directed to the Further Reading section for some recommended reading.
In addition to the functionality of the basic installation of R, additional pack-
ages can be downloaded to perform specific analyses.The main focus of this chap-
ter will employ the lme4 package (Bates, 2005), which provides an up-to-date
implementation of linear mixed effects models. Our analysis uses lme4 version
1.1–7. Different versions may display slightly different results. We also note useful
functions in the psych (Revelle, 2014) and car (Fox & Weisberg, 2011) packages.
Consider again our fictional study that tests two English language teaching
strategies (Strategy A and Strategy B). To test the strategies, one group of L2
English learners are taught using Strategy A and a second taught using Strategy B.
The two groups’ English proficiency is assessed at the start of the course and also
four additional points over the course of instruction. English proficiency is used as
the dependent variable to assess the relative effectiveness of each teaching strategy.
A simulated data set for this study can be found in the Longitudinal.RData
supplementary file, available on the companion website for this book (http://oak.
ucc.nau.edu/ldp3/AQMSLR.html). Longitudinal.RData contains a data frame
called scores which contains the longitudinal data. A data frame is an R object
that contains a table of rows, each containing an individual observation, and col-
umns, which each contain a different variable. To display the first six rows we can
use the function head().
> head(scores)
student class time course gender L1 age exp prof
1 1 1 0 A M J 27 3 12
2 1 1 6 A M J 27 3 22
3 1 1 12 A M J 27 3 27
4 1 1 18 A M J 27 3 36
5 1 1 24 A M J 27 3 36
6 2 1 0 A F J 31 4 15
The first column, Student, identifies the study’s 156 participants. The Class
column groups these students into six classes. Proficiency is graded at five points
in time from the start of the course onwards in the Time column (0 months, 6
months, 12 months, 18 months, and 24 months), which is why the data for Student
1, for example, occupies five rows.The data include cells missing at random to sim-
ulate students missing particular tests (e.g., Student 13 was tested four only times).
Mixed Effects Modeling and Data Analysis 165
The Course column identifies the main independent variable, “teaching strat-
egy” (A or B). In this between-groups design, Classes 1–3, comprising Students
1–74, were tested on Strategy A and Classes 4–6, comprising Students 75–156,
were tested on Strategy B. The next three columns contain additional informa-
tion about the participants, including their gender, L1, and age. The Exp column
provides a measure of previous exposure to English, in terms of the number of
months that each participant has spent in an English-speaking country. Finally
“prof ”, the dependent variable, provides the proficiency score for each student at
each of the five test points.
Before analyzing the data we use the describeBy() function in the psych
package to calculate descriptive statistics. This function provides descriptive sta-
tistics for the Prof column of the Scores data frame as grouped by the Time and
Group columns. Note that describeBy(), which is similar to the Explore func-
tion in SPSS, computes additional statistics, but the output shown next has been
edited to save space.
:0 :0
:A :B
vars n mean sd median vars n mean sd median
1 1 71 18.1 7.59 18 1 1 78 20.17 8.68 20
:6 :6
:A :B
vars n mean sd median vars n mean sd median
1 1 72 25.49 9.62 25 1 1 79 37.95 14.79 38
: 12 : 12
:A :B
vars n mean sd median vars n mean sd median
1 1 72 37.68 13.94 38 1 1 81 57.4 15.34 56
: 18 : 18
:A :B
vars n mean sd median vars n mean sd median
1 1 71 48.86 16.65 50 1 1 77 73.91 13.88 76
: 24 : 24
:A :B
vars n mean sd median vars n mean sd median
1 1 70 58.81 17.01 58 1 1 78 83.56 11.54 86
166 Ian Cunnings and Ian Finlayson
The data show the average proficiency scores for Strategy A and B at five test
points. While proficiency is similar at month 0 (18 and 20 for Strategy A and B
respectively), by month 24 the average proficiency for Strategy B (84) is higher
than Strategy A (59), suggesting Strategy B is more effective. Due to limitations of
space, we do not discuss how these data could be visualized in detail. The sources
mentioned in the Further Reading section provide detail on how data can be
visualized in R (see also Hudson, Chapter 5 in this volume, for general discussion
of data visualization).
To test for differences between teaching strategies we use the lmer() function
in the lme4 package to fit a mixed effects model to the data. Before fitting the
statistical model, first consider the steps required in this analysis. The first step is
to consider the distribution of the dependent variable and decide which type of
model to fit. In this study, assume the researcher had access to the students’ pro-
ficiency scores as graded by the class teacher and as such we use a linear mixed
effects model. We first check whether the dependent variable follows a normal
distribution. We visually check the distribution using the qqnorm() function.
> qqnorm(scores$prof)
This function plots the proficiency scores as in Figure 8.1 (left panel) which,
if normal, would form a straight line. We can see that this is not the case. We thus
transform the variable to more closely resemble a normal distribution. As in stan-
dard analyses, there are different ways to transform variables. As the grades were
out of 100, we perform the logit transformation.We transform the variable in the
Prof column using the logit() function from the car package and create a new
column called “l_prof.”
100
80
Sample quantiles
60
40
20
0
−3 −2 −1 0 1 2 3
Theoretical quantiles
2
Sample quantiles
−2
−3 −2 −1 0 1 2 3
Theoretical quantiles
FIGURE 8.1 Q-Q plots for untransformed (left) and transformed (right) proficiency
scores
168 Ian Cunnings and Ian Finlayson
two levels and you want to compare each level to a baseline condition. Treatment
coding is, however, different to the coding scheme of standard ANOVA, and does
not produce ANOVA-style main effects. To obtain main effects, sum coding is
used, which requires the two levels of our fixed effect “course” (A and B) to be
recoded as –0.5 and 0.5. We recode “course” into the sum coded column called
“s_course” as below. For further information on how different coding schemes alter
the interpretation of results in mixed effects models and regression analysis in gen-
eral, see Gillespie (2010) and Chen, Ender, Mitchell, and Wells (2003, Chapter 5).
The fixed effect for the “time” variable is a continuous predictor rather than a
categorical factor. In this analysis, we assume the effect of “time” on l_prof is lin-
ear. Mixed effects models can, however, also model the effect of time in a nonlin-
ear fashion. For further discussion of different ways to model time in longitudinal
analysis, see Mirman, Dixon, and Magnuson (2008), Mirman (2014) and Singer
and Willet (2003, Chapter 6).
When including a continuous predictor, it is useful to center each value around
the mean, as this helps reduce collinearity in the model (see Jaeger, 2010). Cen-
tering involves subtracting the mean value of the predictor from each individual
value. Below, we add a column called “c_time” that centers the values from the
Time column.
The next step is to consider what random effects to include.We will need ran-
dom effects parameters to model all known sources of random variance amongst
the different participants in our study. As six different classes of students were
tested, we also need random effects parameters to model the variance across
classes. Finally, as students are hierarchically clustered within classes, the model
will need to include a nested random effects structure that specifies that students
are nested under classes. The syntax that follows will fit a mixed effect model to
our data taking these considerations into account.
This fits a mixed effects model called “model.1” (note this name is arbitrary)
in which the dependent variable l_prof is analyzed in terms of the fixed effects
parameters s_course*c_time. This notation is a shorthand that specifies both
main effects and all possible interactions, but the notation s_course+c_time+s_
course:c_time could instead have been used, which explicitly specifies main
Mixed Effects Modeling and Data Analysis 169
effects for s_course and c_time and the s_course:c_time interaction. In a more
complex design with several higher order interactions, this flexibility in R syn-
tax allows the researcher to specify which interactions to include based on the
hypotheses being tested.
The next part of the syntax specifies the random effects. These are specified
with parentheses () to distinguish them from the fixed effects. The syntax (1|stu-
dent) specifies a random intercept for students and (1|class) a random intercept for
classes (in R, 1 is used here to signify the presence of an intercept, while 0 could
be used to signify its absence). These random intercepts model how the overall
average proficiency scores for each student and each class vary randomly.The final
part of the syntax, data = scores, specifies which data frame is analyzed. Note that
we have not explicitly specified that the random effects are nested. As we coded
each student and each class with unique identifiers, the model is able to “work
out” the nested structure automatically.This would not be the case if the variables
were coded differently. If the three classes taught with Strategy A were coded as
1–3 and also the three classes with Strategy B as 1–3 (rather than 1–6), the nested
structure would need to be explicitly stated.We suggest adopting a similar coding
scheme to that used here so as to avoid this issue. A summary of the model (i.e.,
output) is obtained as follows.
> summary(model.1)
Scaled residuals:
Min 1Q Median 3Q Max
-3.2484 -0.5629 -0.0208 0.6198 4.8181
Random effects:
Groups Name Variance Std.Dev.
student (Intercept) 0.2334 0.4831
class (Intercept) 0.1090 0.3301
Residual 0.2203 0.4694
Number of obs: 749, groups: student, 156; class, 6
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.146509 0.141280 -1.04
170 Ian Cunnings and Ian Finlayson
This tells us that the mixed effects model is fit using a restricted maximum
likelihood technique (REML). The model formula (syntax) is then given, fol-
lowed by the REML criterion.This is a measure of how much the model deviates
from a “saturated” model (a model with a parameter for each observation). This
is known as deviance, and gives an indication of how well the model fits the data.
A number closer to 0 indicates a well-fitted model. Note that the absolute value
here is difficult to interpret, but the difference in values between two different
models fit to the same data can be used to assess which model provides a better fit
(see page 172). The scaled residuals provide a summary of the distribution of the
per-observation error (i.e., how the observed data differ from the values predicted
by the model). These values should be approximately symmetrical if the assump-
tion of normality has been met.
The summary then provides information about the random effects. The sum-
mary shows that we have included random intercepts for “student” and “class”
and provides information about the variance associated with each. The summary
then shows the residual variance, which is the amount of variance that is not
explained by the model. This is followed by information about the number of
observations and how they are grouped. Finally, we get information about the
fixed effects, including model estimates, standard errors, and t values. Note that p
values are not shown. We discuss this in more detail next.
The model estimates also provide an estimate of the size of the effects. That
the estimate of the main effect of “c_time” is positive indicates that proficiency
increased over time. The estimate for “c_time” indicates that for every one unit
increment in “c_time,” the (logit transformed) proficiency scores increased by
0.116 units. Note that these absolute values are perhaps difficult to interpret in
this instance as the dependent variable was transformed. In other analyses, the
estimates may be more easily interpretable. For example, if the dependent mea-
sure was a reaction time in milliseconds, the estimates would indicate differences
between conditions in milliseconds. The estimates could then be used to gauge
the magnitude of an effect in order to understand the extent to which the two
conditions differed.
It is important to emphasize again that the random effects as specified in this
model are random intercepts. This allows the average proficiency score of each
Mixed Effects Modeling and Data Analysis 171
“student” and “class” to vary and will model, for example, that some students
might on average score less than others, while some classes might on average score
higher than others. In a between-groups design, the random variance across con-
ditions can be modeled with random intercepts. In a repeated measures design,
however, it is important to consider not only random intercepts but also random
slopes.
In this example study, “course” varies between students and classes. That is,
each student and class is tested on either Strategy A or Strategy B but not both.
In other words, a student or class is only tested on one level of the independent
variable “teaching strategy” (A or B). However, whereas “course” varies between
students and classes, “time” varies within them, as each student and each class
was tested at multiple points in time. As such, students and classes may not only
differ in overall average proficiency, but also in their sensitivity to the change
in proficiency over time. Some students (and classes) may greatly increase over
time, while others may only increase slightly. Currently, this type of variance is
not modeled in model.1. The random intercepts that this model includes only
model variance in average scores across students and classes, not variance in the rate
of change over time. Random slopes are required to model this type of variance.
Random slopes can be included for any repeated measures variable. It is impera-
tive that random slopes are included when required, as not including a random
slope for a repeated measures variable when there is considerable random slope
variance can lead to overconfident estimates of fixed effects and spurious results
(Barr, Levy, Scheepers, & Tily, 2013; Schielzeth & Forstmeier, 2009).
We add random slopes as follows. We first create a second model with a
random slope of “c_time” varying by student and then a third that addition-
ally includes the random slope of “c_time” varying by class. As “s_course” is not
repeated within participants and classes (no participant or class was tested on both
strategies A and B), it is not necessary to include the main effect of “s_course” or
the s_course:c_time interaction in the random slope terms. Random slope inter-
actions would be needed for any interaction that involves only repeated measures
variables (Barr, 2013).
anova() function. Note that we have used the anova() function here specifying
refit = FALSE. The reason for this will be discussed in more detail next (the out-
put here has been edited for space).
The results in the final column show that model.2 provides a significantly
improved fit over model.1, and model.3 provides a significantly improved fit
over model.2, indicating that the random slopes are accounting for a significant
amount of the random variance. Indeed, the summary for model.3 in the next
code sample (edited to save space) shows that the REML criterion value for
model.3 (1164) is lower than model.1 (1307), indicating that model.3 provides a
better fit. The addition of the random slopes for “c_time” in model.3 has also led
to an increase in the standard errors for the fixed effects in model.3 compared
to model.1, indicating that the random intercept only model was providing an
overconfident estimate of these parameters.
> summary(model.3)
Random effects:
Groups Name Variance Std.Dev. Corr
student (Intercept) 0.1947303 0.44128
time 0.0006193 0.02489 -0.13
class (Intercept) 0.0083835 0.09156
time 0.0004230 0.02057 1.00
Residual 0.1412285 0.37580
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.142194 0.144181 -0.986
s_course 0.873302 0.288362 3.028
c_time 0.116553 0.008789 13.261
s_course:c_time 0.058102 0.017578 3.305
There has been some debate in the literature regarding how one should decide
on whether or not a random slope should be included in the analysis. Some
researchers have adopted a data-driven approach (e.g., Baayen et al., 2008) in
Mixed Effects Modeling and Data Analysis 173
which random slopes are included only if they significantly improve model fit
(as shown earlier). Data-driven approaches are ideal for exploratory research. For
example, large corpora may have many independent variables. In such cases, it
may be unrealistic to include all fixed and random effects at once, and as such it
may make sense to adopt a data-driven approach.
However in confirmatory research, the researcher designs a study to test a spe-
cific set of hypotheses. Barr et al. (2013) argued that in this case, the random
effects should reflect the design of the study and the hypotheses being tested
such that random intercepts and slopes should be included for all theoretically
relevant fixed effects. They dubbed this the maximal model. Our example study
here was devised to examine how two teaching strategies influence proficiency
over time. In the design used, the random effects structure in model.3 contains
all the theoretically relevant random intercepts and slopes for the included fixed
effects to test these aims, and thus would be the maximal model. We suggest
researchers follow Barr et al. in use of maximal models in confirmatory research.
Note that the term confirmatory here is not intended to mean that random slopes
are needed only in replication research (i.e., research that attempts to confirm
existing results), rather it relates to research that tries to confirm (or reject) spe-
cific hypotheses.
Another issue that arises even in confirmatory research is whether to include
random slopes for control predictors (i.e., a predictor that is not of prime theo-
retical interest but which may affect the results; see Barr et al., 2013). There is
little consensus in this case. Given that a model may become overly complex if all
possible fixed and random effects for control predictors are included by default, a
data-driven approach might be appropriate in such cases to decide whether such
parameters should be included.
Recall that t values are reported in the model summaries, but not p values.The
calculation of exact p values for mixed effects models is not straightforward as it is
not obvious how the degrees of freedom should be counted (Baayen et al., 2008;
Bates, 2006).There are different ways to estimate p values and determine statistical
significance, although here too there is no current consensus on which method to
use. One way is to estimate p values from the t distribution as shown next (from
Baayen, 2008, p. 248):
2 * (1 - pt(abs(X), Y - Z))
[1] 0.002546729
Note that this p value can be overly liberal for small data sets (Baayen, 2008;
Baayen et al., 2008). The degrees of freedom are estimated by subtracting the
number of fixed effects from the number of observations. Consequently, when
a data set is small, subtracting the number of fixed effects from the number of
observations can have a large impact on the p value. However, in the case of the
current example study, the difference between 749 and 749 – 4 is largely incon-
sequential. For further discussion of ways to assess statistical significance in mixed
models, see Baayen (2008, pp. 247–248), Baayen et al. (2008, pp. 396–399) and
Barr et al. (2013, pp. 276–277).
Although this hypothetical study was designed to examine the effects of two
teaching strategies, the researcher may want to consider if potentially confound-
ing variables are influencing the data. As mentioned earlier, one may not want
to include such control predictors in the analysis by default, as including too
many variables can lead to a model that is overly complex and difficult to inter-
pret. As we compared different random effects structures, we can also use model
comparisons to test whether the inclusion of fixed effects for control variables
significantly improves model fit. The researcher can then include or remove a
control predictor based on whether or not it provides a better fit to the data. As
an example, we create a model with a fixed main effect of gender to test for any
differences between male and female participants (note gender is first sum coded
into s_gender as above with course). We then compare model.4 to model.3 using
the anova() function (note refit = FALSE is not specified this time) to see if the
inclusion of s_gender improves model fit.
Here, the results of the model comparison in the last column suggest that
model.4 does not provide a significantly improved fit to the data (p = .229) com-
pared to model.3, and as such we do not need to include the fixed effect for
gender.
Note that when a model is fit using REML, as here, model comparisons are
appropriate only when comparing models with different random effects (Pin-
heiro & Bates, 2000). To compare two models that differ in fixed effects, mod-
els should be fit using maximum likelihood. The anova() function refits the
model with maximum likelihood to allow comparison of models differing in
fixed effects.The output shown earlier illustrates this by stating refitting model(s)
with ML (instead of REML). When we compared different random effects for
model.1, model.2 and model.3 earlier we explicitly specified refit = FALSE to
ensure that the anova() function did not refit these models using maximum
likelihood (ML), as comparing models with different random effects can be con-
ducted on models fit using REML. To compare models with different fixed
effects, however, ML should be used. Although the anova() function can auto-
matically do this, it is also possible to compare the same models’ fit with ML.
Next we suppress the default option of lmer() to fit models by REML with the
code REML=F.
> model.5 = lmer(l_prof ~ s_course*c_time + (1+c_
time|student) + (1+c_time|class), data = scores,
REML=F)
Note that this time the anova() function does not give the refitting model(s)
with ML warning. The comparison here is still nonsignificant and has the same
chi-square and p values as before. For conciseness, we suggest researchers fit mod-
els using ML (REML=F) when comparing models with different fixed effects,
rather than relying on refitting using the anova() function.
At this point, we are ready to report our results. The results could be reported
as follows:
time interaction. The fixed effect factor “course” was sum coded while the
continuous fixed predictor “time” was centered. The dependent variable
(“proficiency”) was transformed using a logit transformation. Students and
classes were treated as random effects with students nested under classes.
Random intercepts for subjects and classes were included, as were random
slopes for time varying by both students and classes, using a maximal ran-
dom effects structure. Statistical significance was assessed by calculating p
values from the t distribution.
This model revealed a significant main effect of course (estimate = 0.87,
SE = 0.29, t = 3.03, p = .003), with those taught with Strategy B dem-
onstrating a higher average proficiency than those taught with Strategy A.
There was also a significant main effect of time (estimate = 0.12, SE = 0.01,
t = 13.26, p < .001), with the positive estimate indicating that the average
proficiency across both groups increased at a rate of 0.12 points on the logit
scale for every one unit increment in time (i.e., every month). Importantly,
these main effects were qualified by a significant course by time interaction
(estimate = 0.06, SE = 0.02, t = 3.31, p < .001), indicating that the increase
in proficiency over time was reliably larger for teaching Strategy B than
Strategy A, suggesting Strategy B is more effective. Indeed, although both
groups started with similar proficiency scores, after 24 months of teaching,
students taught with Strategy B had a proficiency score 25 points higher
(1.46 on the logit scale) than those taught with Strategy A. The addition of
a fixed main effect for gender did not lead to an improvement in model fit
compared to the model without (χ2 [1] = 1.45, p = .229), suggesting gender
did not affect proficiency in this study.
a correlation between the random intercepts and random slopes (e.g., a stu-
dent with a higher than average proficiency may learn faster than average over
time). The summary of model.3 indicates the correlation between the random
intercept and random slope for class is very high. High correlations can often
occur with models that fail to converge. If this were the case, the model could
be simplified by removing the correlation parameter with the syntax (1|class)
+ (0+c_time|class). Unfortunately, there is little consensus in best practice
when dealing with convergence errors (see Barr et al., 2013, pp. 275–276 for
discussion).
• Whenever possible, consider making raw data sets, and the R scripts
used to prepare and analyze them, available for reanalysis by other
researchers.
Further Reading
We are aware of four introductions to R with an emphasis on language data, all of
which provide a strong foundation to both linguistic/quantitative analysis and to
178 Ian Cunnings and Ian Finlayson
the use of the R statistical package. Field (2013, Chapter 25) provides a practical
introduction to using mixed effects models in SPSS.
Further discussion of mixed effects models can be found in the 2008 special
edition of the Journal of Memory and Language on emerging data analyzes (Baayen
et al. 2008; Barr, 2008; Dixon, 2008; Jaeger, 2008; Mirman et al. 2008; Quene &
van den Burgh, 2008). Cunnings (2012) and Linck and Cunnings (in press) pro-
vide additional introductions aimed at L2 researchers. Existing L2 studies using
mixed effects models for longitudinal analysis include Ljungberg et al. (2013) and
Meunier and Littre (2013; see Sample Study 1).
Discussion Questions
1. Think about the variables of a study you have read about or that you are
conducting. Would a mixed effects model be appropriate? If not, why not? If
appropriate, which factors would you consider to be fixed vs. random? Why?
2. In the analysis in this chapter, the main effect of gender did not improve
model fit. Other potentially confounding variables in the study are L1, age,
and length of exposure. Consider whether these variables should be included
while bearing the following questions in mind.
a) How should these variables be coded?
b) Do any of these variables lead to a significant improvement in model fit?
c) Should you only include main effects of each of these variables, or could
they potentially interact with other independent variables?
d) If any of these variables do provide a significantly improved model fit,
should you also consider including random slopes? If so, what random
slopes should be included?
3. The Categorical data frame in the supplementary data file (Longitudinal.
RData), available on this book’s companion website (http://oak.ucc.nau.
edu/ldp3/AQMSLR.html), contains a similar set of data with a different
dependent variable. Imagine participants took part in a formal test at each
point in time. With 50 questions per test, this equates to students answer-
ing 250 questions in total over the course of teaching. Responses to each
Mixed Effects Modeling and Data Analysis 179
References
Baayen, H. (2008). Analyzing linguistic data. A practical introduction to statistics using R. Cam-
bridge: Cambridge University Press.
Baayen, H., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random
effects for subjects and items. Journal of Memory and Language, 59, 390–412.
Barr, D. (2008). Analysing “visual world” eyetracking data using multilevel logistic regres-
sion. Journal of Memory and Language, 59, 457–474.
Barr, D. (2013). Random effects structure for testing interactions in linear mixed-effects
models. Frontiers in Psychology, 4, 328. doi: 10.3389/fpsyg.2013.00328
Barr, D., Levy, R., Scheepers, C., & Tily, H. (2013). Random-effects structure for con-
firmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68,
255–278.
Bates, D. (2005). Fitting linear models in R: Using the lme4 package. R News, 5, 27–30.
Bates, D. (2006). Post to the R-help mailing list, 19 May 2006. https://stat.ethz.ch/piperm
ail/r-help/2006-May/094765.html
Boyle, M., & Willms, J. (2001). Multilevel modelling of hierarchical data in developmental
studies. Journal of Child Psychology and Psychiatry and Applied Disciplines, 42, 141–162.
Chen, X., Ender, P., Mitchell, M. & Wells, C. (2003). Regression with SAS. http://www.ats.
ucla.edu/stat/sas/webbooks/reg/default.htm
180 Ian Cunnings and Ian Finlayson
Ortega, L., & Byrnes, H. (2008). The longitudinal study of advanced L2 capacities. New York:
Routledge.
Ortega, L., & Iberri-Shea, G. (2005). Longitudinal research in second language acquisition:
Recent trends and future directions. Annual Review of Applied Linguistics, 25, 26–45.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. New York:
Springer.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyzes, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L., Egbert, J., & LaFlair, G. T. (in press). Bootstrapping in applied linguistics: Assess-
ing its potential using shared data. Applied Linguistics.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes:
the case of interaction research. Language Learning, 61, 325–366.
Quene, H., & van den Bergh, H. (2008). Examples of mixed-effects modelling with crossed
random effects and with binomial data. Journal of Memory and Language, 59, 413–425.
R development core team. (2014). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing.
Raudenbush, S. (1993). A crossed random effects model for unbalanced data with appli-
cations in cross-sectional and longitudinal research. Journal of Educational Statistics, 18,
321–349.
Raudenbush, S. (2001). Comparing personal trajectories and drawing causal inferences
from longitudinal data. Annual Review of Psychology, 52, 501–525.
Raudenbush, S., & Bryk, A. (2002). Hierarchical linear models: Applications and data analysis
methods (2nd ed.). Thousand Oaks, CA: Sage.
Revelle, W. (2014) Psych: Procedures for Personality and Psychological Research, North-
western University, Evanston, IL, http://CRAN.R-project.org/package=psych
Version=1.4.8.
Schielzeth, H., & Forstmeier, W. (2009). Conclusions beyond support: Overconfident esti-
mates in mixed models. Behavioral Ecology, 20, 416–420.
Singer, J. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models,
and residual growth curve models. Journal of Educational and Behavioral Statistics, 23,
323–355.
Singer, J., & Willett, J. (2003). Applied longitudinal data analysis: Modeling change and event
occurrence. New York: Oxford University Press.
Snijders, T., & Bosker, R. (1999). Multilevel analysis. London: Sage.
Turner, J. L. (2014). Using statistics in small-scale language education research. New York:
Routledge.
9
EXPLORATORY FACTOR
ANALYSIS AND PRINCIPAL
COMPONENTS ANALYSIS
Shawn Loewen and Talip Gonulal
Proponents feel that factor analysis is the greatest invention since the double
bed whereas its detractors feel it is a useless procedure that can be used to
support nearly any desired interpretation of the data. The truth, as is usually
the case, lies somewhere in between. Used properly, factor analysis can yield
much useful information; when applied blindly and without regard for its
limitations, it is about as useful and informative as tarot cards. (p. 144)
Conceptual Motivation
As in many other disciplines, L2 researchers often explore large data sets. For
instance, researchers interested in teachers’ and students’ beliefs about grammar
instruction may collect data from numerous participants using a survey with many
individual questions. Alternatively, researchers might investigate the occurrence of
various linguistic structures in different discourse types in L1 and/or L2 corpora. In
such research studies, a frequent objective is to reduce the initial data set by identi-
fying variables, such as the survey questions or linguistic structures mentioned ear-
lier, that behave similarly. Factor analysis can be used to investigate the correlations
present in the data and to consolidate variables in a principled manner.
Factor analysis is not a single statistical method but a series of complex structure
analyzing procedures, like structural equation modeling (see Schoonen, Chapter 10
in this volume), that investigates the potentially unobserved relationships amongst
variables in a data set; as such, factor analysis can be used for a variety of purposes.
One common use is to explore the underlying relationships in a set of variables
by deriving a more parsimonious number of related variables, referred to as factors
or components (Gorsuch, 1983; Kline, 2002; Tabachnick & Fidell, 2013; Thompson,
2004). These factors are argued to represent underlying constructs (also known as
latent variables) in the data. For example, Loewen et al. (2009) used factor analysis
to group 37 questionnaire items into six conceptually related factors. Additionally,
factor analysis can be applied to data sets with large numbers of items or variables
in order to reduce the data to a more manageable size (Field, 2009; Gorsuch, 1983,
1990). For instance, Asención-Delaney and Collentine (2011) used factor analysis
to investigate how 78 different linguistic structures in a written L2 Spanish corpus
grouped into different discourse types. Moreover, factor analysis can be used for
conducting item analysis to strengthen tests or questionnaires by identifying items
that are relatively unrelated to the overall test (see Gorsuch, 1983; Kline, 2002).
Finally, as explained later, the factors generated from a factor analysis can be used
in subsequent analyses such as ANOVA and regression.
(a) reading and writing tests, (b) effective administration, (c) impacts on cur-
riculum and learning, (d) speaking test, and (e) listening test. These factors would
have been difficult to identify simply looking at the 40 items in the questionnaire.
CFA, however, is used when researchers have specific expectations regarding
the underlying structure of the data. For example, Mizumoto and Takeuchi (2012)
used CFA in their adaptation and validation study of Tseng, Dörnyei, & Schmitt
(2006) analysis of a self-report questionnaire investigating the self-regulating
capacity in vocabulary learning in a Japanese English as a foreign language set-
ting. Because the researchers were basing their analysis on Tseng et al.’s (2006)
previously conducted analysis, Mizumoto and Takeuchi had clear expectations
regarding what and how many factors would underlie in the questionnaire. Con-
sequently, it was appropriate for them to conduct a CFA.
As seen from the previous examples, the selection between EFA and CFA
depends primarily on whether researchers have specific theoretical expectations
regarding the number and nature of factors present in the data. (See Thompson,
2004 for more detail on the differences between EFA and CFA.)
One practical difference between EFA and CFA lies in the software programs
used for statistical analyses. When conducting an EFA, more common statistical
computer software packages (e.g., SPSS, R, and SAS) are used, whereas in CFA,
more recent (and less common) statistical packages (e.g., AMOS, LISREL, and
EQS) are used. Considering the different assumptions and purposes of CFA, and
due to limited space, this chapter will focus exclusively on EFA. See Harrington
(2009) for more details on conducting CFA. Nevertheless, conceptual knowledge
of EFA is helpful in understanding CFA.
In addition to the differences between CFA and EFA, there is some ambiguity
in the terminology used within EFA itself because it is often used as an umbrella
term covering both principal components analysis (PCA) and EFA. However,
there are two schools of thought on the differences between EFA and PCA (Hen-
son & Roberts, 2006). Some statisticians view EFA and PCA as completely dif-
ferent types of analyses, whereas other statisticians treat PCA as a type of EFA that
differs only in its method of factor extraction.
In conceptual terms, the difference between PCA and EFA lies in how they
treat the variance that is present in the data; PCA analyzes variance whereas EFA
analyzes covariance (Tabachnick & Fidell, 2013). That is to say, PCA includes all
variance (i.e., the variability or spread within a data set) including (a) variance
unique to each variable, (b) variance common among variables, and (c) error
variance (Gorsuch, 1983; Kline, 2002; Tabachnick & Fidell, 2013). In contrast,
EFA includes only the variance in the correlation coefficients (i.e., the variance
common among variables), whereas the error variance and the variance unique to
each variable are excluded from the analysis. In sum, PCA does not differentiate
between common and unique variance, but EFA does.
The importance of the distinction between EFA and PCA is controversial
(Field, 2009). Often PCA results may be very similar to EFA results; however, in
some instances, there may be meaningful and substantial differences between the
two (Conway & Huffcutt, 2003). For instance, in PCA the weight with which
variables load on to factors may be too high, whereas EFA loadings are more
accurate when the data meet the assumptions of EFA (Widaman, 1993). Fabri-
gar, Wegener, MacCallum, and Strahan (1999) investigated several data sets and
showed that there were a number of cases in which EFA and PCA solutions were
different. Gorsuch (1990) argued that it is better to use EFA because it produces
better solutions some of the time and similar results the rest of the time. Conway
and Huffcutt (2003) note that:
2. Factor extraction
method
5. Results
5.2. 6. Use in
5.1. Factor
Factor subsequent
Loadings
scores analysis
6. Interpretation
7. Reporting the
results
FIGURE 9.2 Overview of the steps in a factor analysis (adapted from Rietveld & Van
Hout, 1993, p. 291)
Factor Analysis 187
In many cases, the software programs used for conducting EFA contain default
settings; however, overdependence on such settings, as is sometimes seen in fac-
tor analytic L2 research (Plonsky & Gonulal, 2015), may not provide the most
accurate analyses. Therefore, it is crucial for researchers to be informed about the
various options in conducting an EFA and to follow a decision pathway to obtain
the best results. The flow diagram (Figure 9.2) adapted from Rietveld and Van
Hout (1993, p. 291) illustrates the necessary steps to conduct an EFA. The next
sections will discuss these steps in order, and important decision points will be
explained. Throughout the steps, examples from the LearnerBeliefsData.sav file
(available on the companion website, http://oak.ucc.nau.edu/ldp3/AQMSLR.
html), which was subjected to a FA with principal components analysis extraction
in SPSS (version 21), will be provided. Note that different versions of SPSS may
differ somewhat in their format and output.
1. Factorability of Data
The first step in conducting an EFA is to consider if the data are appropriate
for factor analysis. As in other statistical methods, researchers should check the
assumptions of EFA. Specifically, EFA can be used for interval data, including Lik-
ert scale items. Further, the variables used in EFA should be linearly related and
moderately correlated. In addition, sample size should be taken into consideration
because correlations are highly sensitive to N. There are several rules of thumb
regarding the appropriate sample size for factor analysis. In some cases, research-
ers propose minimum sample sizes such as 100 (Hair, Anderson, Tatham, & Black,
1995), 300 (Tabachnick & Fidell, 2013), or 500 (Comrey & Lee, 1992). Alterna-
tively, recommendations regarding sample size relate to the specific number of
subjects or items per variable. The exact number required is disputed, with esti-
mates ranging from 3 to 20 subjects or items per variable (Gorsuch, 1983, 1990,
2003; Pett, Lackey, & Sullivan, 2003;Tabachnick & Fidell, 2013;Thompson, 2004).
That being said, 10 to 15 is the most common suggestion (Field, 2009). However,
following a rule of thumb can sometimes be misleading because a large sample
size is not always necessary for accurate factor solutions or correlations. Accord-
ing to MacCallum,Widaman, Zhang, and Hong (1999),“when communalities are
high (greater than .60) and each factor is defined by several items, sample sizes can
actually be relatively small” (p. 402).
Because the suggested sample size for factor analysis varies considerably (for
further detail on sample size in factor analysis see Gorsuch, 1983, 1990, 2003;
MacCallum et al., 1999), one additional approach is to conduct a post hoc analysis
to investigate the appropriateness of a given sample for a specific analysis. One
such method is the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy.
KMO values range from 0 to 1, with higher values representing better sampling
adequacy (see Figure 9.3). More specifically, “values between 0.5 and 0.7 are
mediocre, values between 0.7 and 0.8 are good, values between 0.8 and 0.9 are
great and values above 0.9 are perfect” (Field, 2009, p. 679). Thus, the KMO
188 Shawn Loewen and Talip Gonulal
FIGURE 9.3 Example of KMO measure of sampling adequacy and Bartlett’s Test of
Sphericity (SPSS output)
value of 0.897 in Figure 9.3 represents a very good sample size for the specific
study (which had 754 participants and 24 variables or roughly 30 participants per
variable).
Although there is no suggested sample size for L2 research, Plonsky and
Gonulal (2015) reported that in L2 factor analytic research EFA was used for a
median of 24 variables, with a median of 252 participants.The median variable-to-
participant ratio was 12.
In addition to determining the appropriate sample size, researchers need to
examine the correlations and communalities among the variables entered into
the EFA. There might be two possible problems here: (a) correlations can be
quite low (or even nonexistent), or (b) correlations can be quite high. Neither
situation is desirable because both indicate a lack of variation in the data. To
test for undesirably low correlations, researchers can employ Bartlett’s Test of
Sphericity, which tests the hypothesis that the correlation matrix is an identity
matrix, meaning that all correlation coefficients are close to 0 (Field, 2009).
Such a scenario is undesirable because if no variables are correlated, then it
is not possible to find clusters of related variables. Therefore, Bartlett’s Test
indicates whether the correlations between variables are significantly different
from 0 (Field, 2009), and a significant result with p < .05 indicates that the
variables are correlated and thus suitable for EFA, as is seen in the Sig. value of
.000 in Figure 9.3.
In addition to low correlations, another potential problem is multicollinearity,
which is the presence of variables that are too highly correlated, with a correla-
tion coefficient around ±.90. A simple solution to check for multicollinearity
is to inspect the correlation matrix (or R-matrix) and the determinant of the
R-matrix for highly correlated variables. Correlation coefficients beyond ±.90
indicate that the two variables are essentially identical and measure the same
thing, thereby adversely affecting the computation of the EFA. The determinant
of the R-matrix should be greater than 0.0001 (Field, 2009); thus, the determi-
nant of .001 in Figure 9.4 indicates that multicollinearity is not a problem for
this data set. If, however, multicollinearity is a problem, it is advisable to remove
one of the highly correlated variables from the analysis. Experimenting with the
Correlation matrixa
Correlation Q1 1.000 .433 .360 .147 .358 .252 .225 .118 .243 .314 .176 .077
Q2 .433 1.000 .469 .170 .358 .242 .370 .157 .203 .338 .135 .114
Q3 .360 .469 1.000 .206 .402 .239 .389 .078 .186 .459 .110 .094
Q4 .147 .170 .206 1.000 .109 .328 .119 .195 .141 .174 .355 .385
Q6 .358 .358 .402 .109 1.000 .136 .305 .095 .155 .401 .016 –.034
Q7 –.270 –.175 –.176 –.055 –.167 –.103 –.182 –.090 –.183 –.172 –.216 –.110
Q11 .305 .365 .367 .075 .376 .166 .425 .182 .107 .354 .051 .044
Q12 .246 .236 .185 .172 .198 .189 .173 .200 .211 .229 .180 .139
Q13 .122 .132 .109 .212 .119 .218 .174 .143 .083 .109 .188 .276
Q16 .387 .328 .272 .141 .372 .185 .252 .079 .237 .353 .134 .010
Q17 .444 .412 .425 .191 .425 .302 .299 .129 .273 .433 .231 .140
Q18 .374 .378 .367 .247 .357 .359 .349 .189 .233 .355 .231 .140
Q21 .472 .447 .493 .192 .448 .320 .308 .172 .264 .447 .198 .137
Q22 –.042 .010 .002 .078 –.101 .087 –.119 .021 .013 –.094 .054 –.031
Q23 .315 .303 .327 .234 .390 .235 .310 .169 .188 .363 .153 .149
Q26 .177 .129 .132 .059 .312 .068 .181 .116 .087 .156 –.043 –.101
Q27 .379 .323 .445 .134 .382 .261 .322 .106 .194 .428 .206 .123
Q31 .252 .242 .239 .328 .136 1.000 .195 .269 .180 .198 .338 .310
Q32 .225 .370 .389 .119 .305 .195 1.000 .198 .150 .369 .081 .121
Q33 .118 .157 .078 .195 .095 .269 .168 1.000 .198 .138 .139 .120
Q36 .243 .203 .186 .141 .155 .180 .150 .198 1.000 .216 .155 .053
Q37 .314 .338 .459 .174 .401 .198 .369 .138 .216 1.000 .148 .481
RVQ8 .176 .135 .110 .355 .106 .338 .081 .139 .155 .148 1.000 .481
RVQ28 .077 .114 .094 .385 –.034 .310 .121 .120 .053 .143 .481 1.000
a. Determinant = .001
removal of different variables will help determine which variable is having the
largest negative impact (Field, 2009).
Finally, examining the communalities (h2) can provide an indication of the
relationship of each variable to the entire data set. Communalities represent
the amount of common variance in a variable that is accounted for by all of
the extracted factors. For example, in Figure 9.5 the communality for Q1
(h2 = .482) indicates that the six extracted factors in Loewen et al.’s (2009) study
explain 48.2% of the variance in the variable. High communalities are desired
because they indicate that the EFA results perform well in accounting for vari-
ance within the variables. Researchers may wish to exclude variables with low
communalities since one purposes of factor analysis is to investigate the common
underlying relationships in a data set.
Figures 9.6, 9.7, and 9.8 illustrate the initial steps for conducting an EFA
in SPSS.
Initial Extraction
Ql 1.00 .482
Q2 1.00 .549
Q3 1.00 .670
Q4 1.00 .472
Q6 1.00 .520
Q7 1.00 .568
Qll 1.00 .573
012 1.00 .542
Q13 1.00 .448
Q16 1.00 .527
Q17 1.00 .665
Q18 1.00 .551
Q21 1.00 .597
Q22 1.00 .720
Q25 1.00 .519
Q26 1.00 .558
Q27 1.00 .494
Q31 1.00 .466
Q32 1.00 .578
Q35 1.00 .704
Q36 1.00 .510
Q37 1.00 .464
RVQ8 1.00 .609
RVQ28 1.00 .682
FIGURE 9.5 Communalities
Factor Analysis 191
Start by selecting Analyze > Dimension Reduction > Factor, which will
bring up the main dialogue box for factor analysis.
Select the variables of interest from the main dialogue box and move them
into the Variables dialogue box, then click the Descriptives button.
selecting Fixed number of factors and then entering the desired number of factors.
In most cases, the default value of 25 for Maximum Iterations for Convergence is ade-
quate, although a larger value might be needed for larger data sets (Field, 2009).
Scree plot
6
Eigenvalue
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Component Number
In the main Factor Analysis dialogue box, click on Rotation and select Direct
Oblimin for an oblique rotation. The default Delta value of 0 is recommended
(Field, 2009). In the Display section, select the Rotated solution in order to produce
the rotated factor-loading matrix (Figure 9.12).The Maximum Iterations for Conver-
gence option specifies how many times SPSS will attempt to find a solution for data
set. The default value of 25 is usually adequate; however, in cases of large data set,
it is possible to increase the number of iterations, as done here for the N of 750.
In addition to the steps mentioned earlier there are several additional options
in conducting EFA (Figure 9.13). The first box described addresses missing data,
and it allows researchers to Exclude cases listwise, which means that any case with
missing data for any variable is excluded from the entire analysis. Alternatively,
Exclude case pairwise includes all cases, even if they have missing scores from one
or two variables. The missing scores for each case are simply eliminated from the
analysis, while the remaining scores are included in the analysis. Because factor
analysis is based on correlations across the data set, it is recommended to eliminate
listwise rather than pairwise; however, listwise elimination may result in substan-
tial data loss if numerous cases have missing scores.
Other options in the dialogue box include sorting variables according to the
size of their loadings on each factor, with the highest absolute scores placed first
on the list. Finally SPSS also allows the suppression of absolute values less than a
specified value, for example .30. This option aids in factor interpretation because
it identifies only the variables that contribute substantially to the factor.
Once all the desired options have been chosen, click OK in the main Factor
Analysis dialogue box.
5. Results
5.1 Factor Loadings
The next step after conducting the factor rotation and producing the rotated
component matrix (i.e., after producing the SPSS output for your factor analysis)
is to examine the factor loadings, which indicate the strength of the association
between each variable and each factor. Ideally, each variable should have a high
loading on only one factor, with small loadings on the remaining factors. Of
course, the interpretation of what constitutes a high loading is subjective, and
not surprisingly, there are different opinions about the optimal factor loading
score. One suggestion is to consider all loadings greater than .30 as important
(Comrey & Lee, 1992; Field, 2009); however, a cutoff score of .40 has also been
proposed (Pett et al., 2003). Finally, Stevens (2009) offers different guidelines for
evaluating factor loadings depending on the sample size. For instance, for a sample
size of 300, loadings should be larger than .298 whereas for a sample size of 600,
a loading of .21 is considered important (see Stevens, 2009, for further detail).
200 Shawn Loewen and Talip Gonulal
Once the factor loading cutoff level has been determined, the variables with
high loadings can be inspected. One useful aid in this process is the option in
SPSS that suppresses factor loadings lower than a specified cutoff point. As seen
in Figure 9.14, Loewen et al. (2009) suppressed factor loadings from –.29 to .29,
meaning that loadings beyond .30, such as Item 1 on Factor 1, are visible, while
loadings less than .30, such as Item 1 on Factor 2, are hidden. It is possible for
a variable to have low loadings on all factors, indicating that the variable is not
strongly associated with the other variables. In such cases, it is often desirable to
exclude the variable and rerun the analysis, keeping in mind that when an item
is excluded from a subsequent analysis, the factor loadings of the remaining items
will change. It is therefore important to exclude one item at a time and check the
new factor loadings accordingly.
In addition to variables that do not have high loadings on any factors, it is also
possible to have complex variables that have high loadings on more than one
Component Matrixa
1 2 3 4 5 6
Q1 .619
Q2 .622 .332
Q3 .639 .405
Q4 .361 .553
Q6 .604 –.353
Q7 –.369 .601
Q11 .577 –.374
Q12 .459 –.310
Q13 .325 .344 –.314
Q16 .589
Q17 .758
Q18 .705
Q21 .744
Q22 .576 .474
Q23 .620
Q26 .329 .427
Q27 .656
Q31 .472 .453
Q32 .545 –.423
Q33 –.540 .412
Q36 .386 .401 –.384
Q37 .631
RVQ8 .345 .647
RVQ28 .699
Factor/Component
1 2 3 4 5 6
FIGURE 9.15 Rotated factor loadings (pattern matrix) (adapted from Loewen et al., 2009)
factor, making interpretation difficult. For example, in Figure 9.15, Item 33 has
a loading of –.566 on Factor 4 and .661 on Factor 5. There are several suggested
solutions to this problem (Field, 2009; Henson & Roberts, 2006). One suggestion
is to simply assign the item to the factor that it loads most highly on. Another
option is to try different extraction and rotation methods to see if a stronger dif-
ferentiation of loadings across factors can be obtained.
each of the nine items on Factor 1; however, it is possible to combine the nine
scores into one factor score, which then provides a single numeric value for the
individual’s position on the factor. Thus, participants with higher variable scores
will have higher factor scores, while individuals with lower variable scores will
have lower factor scores (Rietveld & Van Hout, 1993).
There are several different statistical methods for computing factor scores. The
simplest is to sum or average each individual’s score on the variables that comprise
the factor, but such a method does not take into account the fact that variables
load on multiple factors. For example, Item 27 in Figure 9.15 has loadings above
.30 on factors 1 and 4. Rather than counting the item twice, or omitting it from
one factor, it is possible to calculate factor scores that reflect the weight of loadings
across the factors. In SPSS there are three primary methods of calculating factor
scores: the Regression method, the Bartlett method, and the Anderson-Rubin
method (see Figure 9.16). These three methods generally produce similar fac-
tors; however, they differ slightly in their mathematical calculations. (See Field,
2009 and Thompson, 2004 for further details.) Click Scores from the main Fac-
tor Analysis dialogue box (Figure 9.16) and select Save as variables. Select which
method to use to calculate the factor scores, which will appear as variables in the
data view section of SPSS.
Rietveld and Van Hout (1993) list several situations in which factor scores can
be very useful:
An example of factor score use comes from Loewen et al. (2009), who fol-
lowed their EFA with a discriminant function analysis. Loewen et al. (2009) used
factor scores to examine differences in L2 learners’ beliefs about grammar instruc-
tion and error correction according to the target languages that they were study-
ing.Thus rather than relying on 37 item responses for each individual, the analysis
incorporated only the factor scores for the six factors produced by the EFA.
Factors
Item 1 2 3 4 5 6 h2
Factors
Item 1 2 3 4 5 6 h2
33. I feel cheated if a teacher does not –.57 .66 .70
correct the written work I hand in.
36. Second language writing is not good if .71 .51
it has a lot of grammar mistakes.
VI. Negative Attitudes to Grammar Instruction
18. I like it when my teacher explains .32 .55
grammar rules.
23. When I have a problem during –.52 .52
conversation activities, it helps me to
have my teacher explain grammar rules.
26. There should be more formal study of –.75 .56
grammar in my second language class.
7. What to Report?
Given the number of options and subjective decisions involved in each step of
an EFA, readers must be able to assess researchers’ processes and results (Com-
rey & Lee, 1992; Conway & Huffcut, 2003; Field, 2009; Ford, MacCallum, &
Tait, 1986; Pett et al., 2003). However, a great majority of L2 factor analytic
studies fail to provide sufficient information regarding their factor analytic pro-
cedures and results (Plonsky & Gonulal, 2015). In addition, some researchers are
advised by journal reviewers and editors not to provide too much statistical detail
(e.g., Loewen et al., 2014). This issue is symptomatic of more general problems
related to reporting practices and transparency in L2 research (e.g., Plonsky, 2013;
Larson-Hall & Plonsky, forthcoming).
Fortunately, there are guidelines regarding what to report for a factor analysis.
Pett et al. (2003), for example, offer a comprehensive set of guidelines that can be
used by researchers, reviewers, and editors who wish to evaluate the quality of a
published factor analysis study. These recommendations for reporting include the
following items, many of which have been exemplified in this chapter:
Conclusion
EFA has several important uses and has the potential to greatly inform L2 theory
and practice. Conducting an EFA, however, poses various challenges due in part
to (a) its complex nature, (b) researchers’ limited experiences with EFA, and
(c) the realities of conducting L2 research. Throughout this chapter we have
attempted to provide some useful insights and have presented a step-by-step
treatment of EFA.
We end our discussion now with three principles that we hope will guide
researchers employing this technique: First, each data set should be treated sepa-
rately, with researchers evaluating which EFA options are most appropriate for
the data in question. Second, it is always useful to try out different factor extrac-
tion, retention, and rotation methods to see which ones account for the largest
percentage of variance and provide the most interpretable solutions. Researchers
could begin with the default SPSS settings, and then alter procedures according
to the guidelines discussed throughout this chapter. Conducting multiple analyses
will not only strengthen the results, it will also help provide researchers with a
better understanding of the implications of selecting various EFA options. Third,
it is essential that factor analysts report sufficient information to allow for replica-
tion, evaluation, and accumulation of knowledge. Following these guidelines will
help researchers use EFA to its full potential in investigating various aspects of L2
learning and teaching.
SAMPLE STUDY 11
Loewen, S., Li, S., Fei, F., Thompson, A., Nakatsukasa, K., Ahn, S., & Chen, X.
(2009). Second language learners’ beliefs about grammar instruction and error
correction. The Modern Language Journal, 93(1), 91–104.
Background
The role of grammar instruction and error correction in the L2 classroom
has been a topic of considerable debate, centering in large part around the
feasibility and efficacy of meaning-focused instruction versus form-focused
instruction. Although previous studies have taken into consideration both
teachers’ and students’ beliefs on this issue, learner beliefs have received less
Factor Analysis 207
attention than teacher beliefs, even though such beliefs may influence the
effectiveness of classroom instruction. It is therefore important to investi-
gate, in detail, L2 learners’ perspectives on this issue.
Research Questions
• What underlying constructs are present in L2 learners’ responses to a
questionnaire regarding their beliefs about grammar instruction and er-
ror correction?
• To what extent can the underlying constructs of learners’ beliefs distin-
guish L2 learners studying different target languages?
Method
A questionnaire consisting of 37 Likert-scale questions regarding beliefs
about L2 grammar instruction and error correction was used.
Statistical Tools
An EFA was chosen because the researchers had no a priori expectations
regarding the number and nature of underlying factors. PCA was selected
for factor extraction and direct oblimin was used for factor rotation. The fac-
tor scores calculated from the EFA were used in the subsequent discriminant
function analysis to determine if students studying different L2s varied in
their responses to the factors.
RESULTS
The EFA produced six factors with eigenvalues greater than 1. These factors
accounted for 55% of the total variance. After examining the content of the
items loading above .30 on each factor, Factor 1 was labeled “Efficacy of
Grammar” and included items such as “Knowing a lot about grammar helps
my reading” and “I usually keep grammar rules in mind when I write in a
second language.” The remaining five factors were labeled (2) “Negative
Attitudes to Error Correction,” (3) “Priority of Communication,” (4) “Impor-
tance of Grammar,” (5) “Importance of Grammatical Accuracy,” and (6)
“Negative Attitudes to Grammar Instruction.”
SAMPLE STUDY 2
Vandergrift, L., Goh, C.C.M., Mareschal, C. J., & Tafaghodtari, M. H. (2006). The
metacognitive awareness listening questionnaire: Development and validation.
Language Learning, 56(3), 431–462.
208 Shawn Loewen and Talip Gonulal
Background
The metacognitive awareness listening questionnaire (MALQ) is used to
examine the extent to which language learners are conscious of and can
adjust the L2 listening comprehension process. However, developing a valid
instrument that can address language learners’ awareness of the L2 listening
process is not easy and has potential shortcomings, such as being too long
or not comprehensive enough. This study examines the development and
validation of a listening questionnaire aiming to assess L2 listeners’ metacog-
nitive awareness and perceived use of strategies while listening to oral texts.
Method
Vandergrift et al. (2006) examined the relevant and recent literature on meta-
cognition, listening comprehension, and self-regulation. Based on previous
instruments, a comprehensive list of questionnaire items was formed and
then subjected to expert judgment for redundancy, content validity, clarity,
and readability. After this initial fine-tuning, the instrument was piloted with
a few students and revised again for clarity of the items. Finally, a question-
naire of 51 items was adopted.
Statistical tools
Vandergrift et al. (2006) employed an EFA to determine the emerging fac-
tors, followed by a confirmatory factor analysis to validate the items retained.
Principal axis factoring was selected for the factor extraction method with
promax rotation with Kaiser Normalization. Maximum likelihood was
employed for confirmatory factor analysis. Finally, the reliability of each fac-
tor was calculated using Cronbach’s alpha.
Results
The EFA produced a 13-factor solution with eigenvalues larger than 1. How-
ever, after examining the scree plot, five factors were retained, thus increas-
ing the interpretability of the results. These five factors explained 44.5% of
the total variance. The items loading on each factor were carefully exam-
ined, and the factors were labeled as (1) “Person Knowledge,” (2) “Mental
Translation,” (3) “Directed Attention/Concentration,” (4) “Planning,” and
(5) “Problem-Solving.” Based on the results of EFA, a subsequent CFA was
conducted with separate data collected from a different sample. The three
models (i.e., four-factor solution, five-factor solution, and six-factor solution)
were tested using maximum likelihood estimation. The CFA results showed
that the five-factor model was a better overall fit. Based on these analyses,
the MALQ was considered to have robust psychometric properties as a mea-
sure of listening awareness.
Factor Analysis 209
Further Reading
• Discovering statistics using SPSS (Field, 2009)
• Exploratory factor analysis (Fabrigar & Wegener, 2012)
• Making sense of factor analysis:The use of factor analysis for instrument development
in health care research (Pett, Lacky, & Sullivan, 2003)
• Statistical techniques for the study of language and language behavior (Rietveld &
Van Hout, 1993)
• Exploratory factor analysis: A five-step guide for novices (Williams, Onsman, &
Brown, 2010)
Discussion Questions
1. In which kinds of L2 research do you think exploratory factor analysis can
be of importance?
2. Describe the differences between EFA and PCA.
3. What kinds of criteria can be used to ensure that the appropriate numbers of
factors are extracted? Why is it preferable to employ multiple factor retention
criteria?
4. What are some of the advantageous and disadvantageous of using rules of
thumb to check the factorability of the data?
5. Imagine that you carried out an EFA. Due to the page limitations of your
target journal, however, you are not able to justify your decisions or to report
all the results. Which results would you report?
210 Shawn Loewen and Talip Gonulal
6. Factor analysis is often contrasted with cluster analysis (see Staples & Biber,
Chapter 11 in this volume). In what ways are these two procedures similar?
In what ways are they different?
7. What are the advantages of performing an EFA rather than conducting mul-
tiple correlations?
8. Using the data provided on this book’s companion website (http://oak.ucc.
nau.edu/ldp3/AQMSLR.html), attempt to replicate the results from Loewen
et al. (2009). How do the results change if you alter some of the EFA options?
Note
1. The SPSS outputs of this study were used throughout this chapter.
References
Asención-Delaney,Y., & Collentine, J. (2011). A multidimensional analysis of a written L2
Spanish corpus. Applied Linguistics, 32, 299–322.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245–276.
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum.
Conway, J. M., & Huffcutt, A. I. (2003). A review and evaluation of Exploratory factor anal-
ysis practices in organizational research. Organizational Research Methods, 6(2), 147–168.
Costello A., & Osborne J. (2005). Best practices in exploratory factor analysis: Four rec-
ommendations for getting the most from your analysis. Practical Assessment, Research &
Evaluation, 10(7), 1–9.
Fabrigar, L. R., & Wegener, D. T. (2012). Exploratory factor analysis. New York: Oxford Uni-
versity Press.
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the
use of exploratory factor analysis in psychological research. Psychological Methods, 4,
272–299.
Field, A. (2009). Discovering statistics using SPSS. London: Sage.
Ford, J. K., MacCallum, R. C., & Tait, M. (1986). The application of exploratory factor
analysis in applied psychology: A critical review and analysis. Personnel Psychology, 39,
291–314.
Glorfeld, L. W. (1995). An improvement on Horn’s paralel analysis methodology for select-
ing the correct number of factors to retain. Educational and Psychological Measurement,
55, 377–393.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
Gorsuch, R. L. (1990). Common factor-analysis versus component analysis: Some well and
little known facts. Multivariate Behavioral Research, 25(1), 33–39.
Gorsuch, R. L. (2003). Factor analysis. In A. Schinka & W. F. Velicer (Vol. Eds.), Handbook
of psychology:Vol. 2. Research methods in psychology (pp. 143–164). Hoboken, NJ: Wiley.
Hair, J., Anderson, R. E., Tatham, R. L., & Black, W. C. (1995). Multivariate data analysis (4th
ed.). Upper Saddle River, NJ: Prentice Hall.
Harrington, D. (2009). Confirmatory factor analysis. Oxford: Oxford University Press.
Factor Analysis 211
Harshman, R. A., & Reddon, J. R. (1983). Determining the number of factors by compar-
ing real with random data: A serious flaw and some possible corrections. Proceedings of
the Classification Society of North America at Philadelphia, 14–15.
Hayton, J. C., Allen, D. G., & Scarpello,V. (2004). Factor retention decisions in exploratory
factor analysis: A tutorial on parallel analysis. Organizational Research Methods, 7(2),
191–205.
Henson, K. R., & Roberts, J. K. (2006). Use of exploratory factor analysis in published
research: Common errors and some comment on improved practice. Educational and
Psychological Measurement, 66(3), 393–416.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational
and Psychological Measurement, 20, 141–151.
Kline, P. (2002). An easy guide to factor analysis. London: Routledge.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research
findings: What gets reported and recommendations for the field. Language Learning, 65,
Supp. 1, 125–157.
Loewen, S., & Gass, S. (2009). Research timeline: The use of statistics in L2 acquisition
research. Language Teaching, 42(2), 181–196.
Loewen, S., Li, S., Fei, F., Thompson, A., Nakatsukasa, K., Ahn, S., & Chen, X. (2009). Sec-
ond language learners’ beliefs about grammar instruction and error correction. Modern
Language Journal, 93, 91–104.
Loewen, S., Lavolette, B., Spino, L. A., Papi, M., Schmidtke, J., Sterling, S., & Wolff, D.
(2014). Statistical literacy among applied linguists and second language acquisition
researchers. TESOL Quarterly, 48, 360–388.
MacCallum, R. C.,Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in factor
analysis. Psychological Methods, 4, 84–99.
Mizumoto, A., & Takeuchi, O. (2012). Adaptation and validation of self-regulating capacity
in vocabulary learning scale. Applied Linguistics, 33(1), 83–91.
Norman, G. R., & Streiner, D. L. (2003). PDQ statistics (3rd ed.). Hamilton: BC Decker.
Pett, M. A., Lackey, N. R., & Sullivan, J. J. (2003). Making sense of factor analysis: The use of
factor analysis for instrument development in health care research. Thousand Oaks, CA: Sage.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and report-
ing practices in quantitative L2 research. Studies in Second Language Acquisition, 35,
655–687.
Plonsky, L., & Gonulal, T. (2015). Methodological reviews of quantitative L2 research:
A second order synthesis and a review of exploratory factor analysis. Methodological
synthesis in quantitative L2 research: A review of Reviews and a case study of explor-
atory factor analysis. Language Learning, 65, Supp. 1, 9–35.
Rietveld, T., & Van Hout, R. (1993). Statistical techniques for the study of language and language
behavior. New York: Mouton de Gruyter.
Stevens, J. P. (2009). Applied multivariate statistics for the social sciences (5th ed.). Routledge:
New York.
Tabachnick, B., & Fidell, L. (2013). Using multivariate statistics (6th ed.). Boston: Pearson
Education.
Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and
applications. Washington, DC: American Psychological Association.
Tseng, W. T., Dörnyei, Z., & Schmitt, N. (2006). A new approach to assessing strategic
learning: The case of self-regulation in vocabulary acquisition. Applied Linguistics, 27,
78–102.
212 Shawn Loewen and Talip Gonulal
Velicer, W. F., Eato, C. A., & Faca, J. L. (2000). Construct explication through factor or com-
ponent analysis: A review and evaluation of alternative procedures for determining the
number of factors or components. In R. D. Goffin & E. Helmes (Eds). Problems and
solutions in human assessment: Honoring Douglas N. Jackson at seventy (pp. 41–71). Norwell,
MA: Kluwer Academic.
Widaman, K. F. (1993). Common factor analysis versus principal component analysis: Dif-
ferential bias in representing model parameters. Multivariate Behavioral Research, 28(3),
263–311.
Williams, B., Onsman, A., & Brown,T. (2010). Exploratory factor analysis: A five-step guide
for novices. Australasian Journal of Paramedicine, 8(3), n.p.
Winke, P. (2011). Evaluating the validity of a high-stakes ESL test: Why teachers’ percep-
tions matter. TESOL Quarterly, 45(4), 628–660.
Wittenborn, J. R., & Larsen, R. P. (1944). A factorial study of achievement in college Ger-
man. Journal of Educational Psychology, 35(1), 39.
10
STRUCTURAL EQUATION
MODELING IN L2 RESEARCH
Rob Schoonen
If there is one thing that we know in second language (L2) research, it is that there
are many factors involved in L2 learning and use. These factors are found in very
complex relationships, which may even change with increasing language profi-
ciency. These relationships are far more complex than what we can describe with
the computation of a series of simple bivariate correlations. L2 researchers have
to be able to deal with multivariate analyses of data. Structural equation model-
ing provides a framework to investigate these complex multivariate relationships.
Conceptual Motivation
Structural equation modeling (SEM), also known as causal modeling, covari-
ance structure analysis, or LISREL analysis, has as its distinguishing feature that
it requires some sort of modeling. Modeling implies that researchers need to be
explicit about the relationships they envisage between measured variables and
underlying constructs (i.e., latent variables) and between the constructs them-
selves. Therefore, a researcher has to think carefully about the hypothesized rela-
tionships before embarking on a SEM enterprise. SEM provides the researcher
with a toolbox that can uncover complex relationships that go well beyond the
bivariate relations as expressed in a correlation or a simple regression, but also
beyond the multivariate relationships that are usually addressed in a multiple
regression analysis (see Jeon, Chapter 7 in this volume).
SEM can be used at various stages of theory development, ranging from con-
firmatory testing to exploration. More specifically, Jöreskog and Sörbom (1996)
mention three situations for fitting and testing models. First is a strictly confirma-
tive situation, where there is a single model that is put to the test with empirical
data. The model is either accepted or rejected. Second is testing alternative or
214 Rob Schoonen
competing models, when a researcher wants to choose between two or three con-
current models on the basis of a single data set. A third use is a model-generating
situation, when a researcher starts off with an initial model and then tries to
improve it on the basis of (mis)fit results ( Jöreskog & Sörbom, 1996, p. 115).
The result of a model-generating situation should not be taken as a real statistical
testing of the (final) model, and the process of model improvement should not
only be guided by statistical outcomes but also by substantive theoretical consid-
erations. The resulting model should then be put to the test anew with different
data (creating a new, confirmatory situation).
The possibilities in a SEM analysis seem to be unlimited (see Hancock &
Schoonen, 2015), and the flexibility of the approach to address them makes SEM
a very attractive analytic framework, leading to an increase in recent years in the
use of SEM in L2 research (Plonsky, 2014). However, it is not difficult to imagine
that these options also contain a risk for using the technique uncritically (see the
“Pitfalls” section in this paper).Therefore, it is crucial that the user has theoretical
guidance with respect to the research questions he or she wants to investigate and
the analytic choices that need to be made. Lewin’s well-known quote that there
is “nothing so practical as a good theory” applies here for sure.
SEM is a collection of analyses that can be used to answer many research ques-
tions in L2 research. Prominent is the use of SEM to predict (or “explain”) complex
constructs, such as reading and writing proficiency, or the development of these
complex proficiencies, on the basis of scores on component skills. Other studies
investigate the complex relations between related constructs, such as motivation
and attitude toward foreign languages. At the initial stage, modeling these kinds
of relationships, a researcher could start with drawing graphs depicting how con-
structs influence each other, or how they are related, using unidirectional or bidi-
rectional arrows, respectively, to connect the constructs. To make it more concrete,
the constructs could be connected to measured, observed or manifest variables.
Conventionally, underlying or latent variables are represented as circles or ovals, and
observed variables as rectangles (see Figure 10.2–10.3). SEM is also highly flexible,
able to deal with multiple dependent variables and multiple independent variables.
These variables can be continuous, ordinal, or discrete, and they can be indicated as
observed variables (i.e., observed scores) or as latent variables (i.e., the underlying
factor of a set of observed variables) (Mueller & Hancock, 2008; Ullman, 2006).
Examples of complex models in L2 studies can be found in, for instance, Gu (2014),
Schoonen, Van Gelderen, Stoel, Hulstijn, and De Glopper (2011) or Tseng and
Schmitt (2008).Which measured and latent variables, and which relations to include
in the SEM analysis is up to the researcher. We should keep in mind that statistical
techniques per se cannot make substantive decisions. As is the case with nearly all
analyses described in this volume, SEM requires a number of choices to be made by
the researcher, and these choices must be made on solid theoretical grounds.
In the remainder of this chapter a number of examples will be presented to
illustrate the possibilities of SEM. Furthermore, a more detailed sample analysis will
Structural Equation Modeling 215
be provided using two different software packages, LISREL and AMOS. Readers
interested in other packages or more extensive introductions to the available soft-
ware are referred to the corresponding manuals or specialized introductions (Byrne,
1998, 2006, 2010, 2012). Readers who want to learn more about SEM than this
chapter can offer, or who want to know more about the theoretical underpinnings
of SEM, will find suggestions for further reading at the end of this chapter.
Working Working
Memory Memory
LE1 LE2 LE3 LD1 LD2 LD3 LE1 LE2 LE3 LD1 LD2 LD3
Working Working
Memory Memory
relationship between linguistic ability and some other construct, a researcher has
to decide whether linguistic ability can be measured by vocabulary, grammatical
knowledge, and pragmatic knowledge together or whether these three domains
should be kept separate and should be measured each on their own. This latter
type of research question is what is often treated as a confirmatory factor analysis
(CFA) problem (see Ockey, 2014). In other words: Do the measures involved
measure a single construct or do they measure multiple constructs?
Underlying Factors
When one wants to investigate the underlying structure of a set of variables, for
example the subtests of a test battery, one can use SEM to actually test hypotheses
about the number of factors that are underlying and also about their interrela-
tions. Key is the testing of hypotheses, which implies that one has a priori one
or a few (competing) expectations that can be put to the test. This is differ-
ent from, for example, exploratory factor analysis (EFA) or principal component
analysis (PCA), where in a data-driven way the number of underlying factors
(or components) is determined according to a statistical criterion (Ockey, 2014;
Loewen & Gonulal, Chapter 9 in this volume). Using SEM, one has to model the
relationship between the measured variables and the hypothesized factors (i.e.,
latent variables) and subsequently test the fit of the model to the empirical data.
This makes it a CFA. An advantage of the SEM framework is that the relations
between selected factors can be modeled in the structural part of the model.
Imagine, for example, a second-language ability test battery that consists
of nine tests: Grammaticality Judgments (V1), Resolution of Anaphors (V2),
Understanding of Conjunctions (V3), Vocabulary Size (V4), Depth of Vocabu-
lary Knowledge (V5), Knowledge of Metaphors (V6), Sentence Comprehen-
sion (V7), Use of Verb Inflection (V8), and Use of Agreement (V9). A researcher
could question, for example, whether the nine test scores are best described (or
Structural Equation Modeling 217
Advantages of SEM
The example in Figure 10.3 largely deals with the way one defines and measures
the theoretical variables (cf. CFA) and as such is considered part of the measure-
ment model. One of the advantages of SEM is that one can test the fit of the
hypothesized model against one’s data, and one can also compare and test the dif-
ference in fit between the two competing models described later in this chapter.
There are at least two other advantages to using SEM in these kinds of analy-
ses. First, researchers are more or less forced to come up with hypotheses about
relationships between their measurements (observed scores) and underlying con-
structs or latent variables. Most hypotheses in L2 research involve variables that
are not directly observable, such as language proficiency, working memory capac-
ity, speaking proficiency, and so on. However, in the actual empirical investigation
researchers want to test the tenability of their claims about these latent underlying
variables. Putting forward a measurement model makes this part of studies more
explicit and thus more open for empirical scrutiny and discussion. In some cases
theoretically relevant variables can be measured more directly, such as age or
parental education. In such cases, the observed and latent variables coincide.
Another advantage of SEM pertains to the more substantive analyses in the struc-
tural part of the model. Once one has modeled the collected data in a well-fitting
L2 General Metacognitive Lexical- Morpho-
linguistic factor factor semantic factor syntactic factor
V1 V2 V3 V4 V5 V6 V7 V8 V9 V1 V2 V3 V4 V5 V6 V7 V8 V9
e1 e2 e3 e4 e5 e6 e7 e8 e9 e1 e2 e3 e4 e5 e6 e7 e8 e9
FIGURE 10.3 Two competing models: a one-factor model and a three-factor model
Structural Equation Modeling 219
measurement model, one can test substantive hypotheses with latent variables that
are so-called error-free. From Figure 10.3 one can see that the latent variables are
determined by the covariance of the different measured variables (V1–V9 in the
left panel or V1–V3, V4–V7 and V8–V9, respectively, in the right panel) and thus
that the idiosyncrasies of the measurements, including measurement error (e1–e9),
are partialed out (excluded). This way an analysis of the relations of latent variables
in the structural model, not being attenuated by measurement error, can provide
a clearer picture of what these relations are (see Mueller & Hancock, 2008, for
an example). In the structural part of our three-factor model, the researcher can
investigate whether the three factors simply covary as depicted in Figure 10.3 or
show more specific relations. For example: Is the metacognitive knowledge the
result of lexical-semantic and morpho-syntactic proficiency? To test such a hypoth-
esis the relationship between the three factors should be modeled as regressions
(with one-directional arrows) in which metacognitive knowledge is the dependent
variable and lexical-semantic and morpho-syntactic proficiency are the indepen-
dent variables (analogous to Figure 10.2; see also Jeon, Chapter 7 in this volume).
Alternatively, one could also claim that the three factors are unrelated. This would
lead to a model without any connections between the three factors, or—in other
words—covariances of 0. Comparison of the fit of the various models to the avail-
able data as described later in this chapter will suggest which model is most plausible.
The previous example is—for practical reasons—kept simple, but numerous
multiple regression models with single as well as multiple dependent variables
in all kinds of different configurations can be analyzed if there are good substan-
tive reasons to do so (see Tseng & Schmitt, 2008; Schoonen et al., 2003; Gu,
2014). One could say that SEM elegantly combines factor-analytic procedures
with regression-analytic ones (and many more, see Hancock & Schoonen, 2015,
for examples in the L2 field; in addition, Rovine & Molenaar, 2003, show all
kinds of variance-analytic applications of SEM). However, this flexibility requires
substantial sample sizes, data that meet certain requirements, and a clear plan for
the analyses, because the number of possibilities for the analyses are sometimes
overwhelming. In the next section, we will go into more detail as we discuss SEM
analyses step by step. First, we will focus on general principles and considerations
at the successive stages in SEM analyses. Second, we will have a closer look at
what an analysis looks like in two of the available packages for SEM analyses (see
the next section): LISREL, being one of the earlier and well-developed packages,
and AMOS, being part of the IBM SPSS family of packages.
Data Preparation
The data for the SEM analysis have to meet certain requirements for a straightfor-
ward analysis. For the procedures to work well and for the testing and parameter
estimation to be reliable, the continuous variables should be multivariate normally
distributed. Among other things (see Kline, 2010), this means that the individual
variables are univariate normally distributed. So, initial screening of the data is
relevant for a valid interpretation of the outcomes of a SEM analysis.This includes
checks on skewness and kurtosis of variables, but also outliers can affect an analysis
in a detrimental way. Bivariate plots for pairs of variables give a first impression of
possible violations of a multivariate normal distribution. For an overview of mul-
tivariate assumptions and data preparation, see Jeon (Chapter 7 in this volume). If
data violate assumptions for SEM, especially multivariate normality, the researcher
can resort to other estimation methods within the SEM framework or apply cor-
rections to the outcome statistic (χ²) and the standard errors for the estimated
parameters (Satorra-Bentler’s scaled version). See West, Finch and Curran (1995)
or Finney and DiStefano (2013) for an extensive discussion about the assumptions
in SEM and possible alternatives in case these assumptions are violated.
In L2 research, as in other empirical domains, data sets are seldom complete.
There are several ways to deal with missing data, such as listwise deletion of cases
with missing data or estimation of a missing score on the basis of available scores.
Listwise deletion avoids controversial imputation of estimated scores.This approach,
however, is advisable only in cases where (a) data are assumed to be missing com-
pletely at random and where (b) the sample is large enough to endure the resulting
loss of statistical power. Imputation of missing values can be a good alternative, but
has its drawbacks as well. For example, replacing the missing score by the sample
mean will reduce the score variance, an important source of information in model-
ing. Fortunately, there are more advanced procedures for dealing with missing data.
Most software packages for SEM have their own provisions for handling missing
data that are very sophisticated, so it might be wise to consider their options (Kline,
2010; for a more thorough discussion see Enders, 2013).Working with incomplete
data implies that one works with the raw data (including missing value codes), and
not with just a correlation or covariance matrix as input data. However, using a
correlation or covariance matrix as the input data for an analysis is a viable option
if one wants to replicate analyses from the literature and only a covariance matrix
or a correlation matrix (preferably with corresponding means and standard devia-
tions) is available (see the next section and Discussion Question 8).
Structural Equation Modeling 221
Designing a Model
After preparing the data, the most exciting part of the analysis begins: design-
ing the model. This process should be guided by theoretical considerations and
expectations, and can best be split into two stages (Mueller & Hancock, 2008).
The first stage involves testing the measurement model, which helps us deter-
mine whether the presumed latent variables are measured by the observed test
scores in the expected way. At this stage no constraints are implemented regard-
ing the relationships among the latent variables, so that any misfit of the model
is due to the way the latent and observed variables were presumed to be related
in the model.
The latent variables being latent, do not have a scale of themselves. To solve
this, one can either standardize the latent variable by fixing its variance at 1 (cf.
z-values) or equate the scale to that of one of the observed variables, a so-called
reference variable. In the latter case the regression weight for the observed vari-
able on the latent variable is fixed at a value of 1. Both solutions are equivalent.
If the fit of the measurement model is satisfactory (that is, the model fits well)
and all observed measures can—to a reasonable extent—be explained by their
underlying variables, one can move on to the second stage: modeling the relation-
ships among the latent variables. However, if the measurement model does not
fit satisfactorily, the relations between the measured variables and the underlying
variables needs to be reconsidered. A variable might not be related to the underly-
ing variable(s) in the expected way, or a variable may show only a weak relation
to the underlying variable(s).Validity and/or reliability issues could be involved if
a measured variable does not fit the hypothesized relations.
At the second stage, when the structural model is developed, one can test the
substantive hypotheses about the theoretical constructs, either as a single model
or as competing models that can be compared to select the best model. There are
often many possibilities for modeling relationships between variables, especially in
complex data sets. Therefore it is wise to make a plan for the analyses beforehand
to avoid getting side-tracked or to avoid the risk of “overfitting” (i.e., continu-
ously adjusting the model to the data).There is a thin line between testing models
and exploring for new ones. One easily enters the phase of explorations in which
test statistics lose their original interpretation and outcomes require replication.
The building blocks of a model are its parameters and they basically consist
of variances and covariances (i.e., correlations and regressions). When modeling a
parameter, a researcher has three options. The first option is to fix a parameter at
a certain value; for example, a covariance can be set at 0 when it is hypothesized
that there is no covariance between two variables and the parameter does not
need to be estimated, or a variance can be set at 1 when one wants to standardize
a latent variable. If one wants to equate a latent variable’s scale to that of a refer-
ence variable, the regression (“factor loading”) of that particular observed vari-
able on the latent variable can be set at 1 to achieve that. As a second option, the
222 Rob Schoonen
researcher can model a parameter to be “free” and the program will estimate the
value of the parameter such that it fits the data best. This may be the case when,
for example, it is assumed that there is a relationship between latent variables (e.g.,
Metacognitive knowledge, Lexical-semantic knowledge, and Morpho-syntactic
knowledge in the earlier example), and we want an estimate of the size of the
covariance. In such cases, the covariance parameter will be modeled as a free
parameter. A third way in which a parameter can be modeled is to constrain it to
be equal to another parameter. One can postulate that covariances, regressions,
and/or variances are equal. These options for modeling parameters apply to the
structural and measurement part of a model alike. For example, in a test develop-
ment project a researcher could be interested in the question of whether tests
A and B are parallel in a psychometric sense. This—among other things—means
that the error variance in A and B and the regressions for A and B on the latent
variable are equal to each other, respectively (cf. Bollen, 1989; see Schoonen,
Vergeer, & Eiting, 1997 for an application).
simple yes/no matter, because there are multiple ways of evaluating the fit of
a model. There is a statistical way and there are many descriptive ways. The
analysis gives a chi-square (or related) statistic with a corresponding p-value
and degrees of freedom (df ). In conventional null hypothesis testing, research-
ers usually want to reject the null hypothesis (e.g., p < .05). However in SEM
analyses, most of the time one does not want to reject the model. This raises
the question of whether p-values simply greater than .05 suffice. This issue is
further complicated by the fact that the chi-square in SEM analyses is sensi-
tive not only to sample size, but also to the number of parameters that had to
be estimated. Most researchers use the chi-square statistic as a more descrip-
tive indicator of model fit than as a serious statistical significance test. A ratio
of less than 2 for χ² / df is considered a good fit (Kline, 2010; Ullman, 2007).
The degrees of freedom are derived from the number of observed variables in
the input and the number of parameters estimated in the model, and as such
they are also a good check on the model specification. One should be able to
forecast the degrees of freedom for one’s model in a SEM analysis. If the data
set under investigation consists of m variables, the covariance matrix consists of
m (m + 1) / 2 elements. From this number, the number of parameters has to be
subtracted to get the degrees of freedom. Of course, two parameters set to be
equal count as a single estimated parameter. Predicting the degrees of freedom
of one’s model before actually running the analysis is thus a check of the correct
implementation of the model.
In addition to a chi-square value, a SEM analysis will provide the researcher
with many more descriptive fit indices. Some are based on the differences (residu-
als) between the input covariance matrix and the reproduced covariance matrix
(e.g., standardized root mean square residual, or SRMR). Other indices take the
number of estimated parameters into account as well; the more parsimonious
the model is (i.e., the fewer estimated parameters), the better (e.g., the root mean
square error of approximation, or RMSEA). Others are based on a comparison
between the fit of the tested model and a basic or “null” model that assumes the
variables to be unrelated (e.g., the nonnormed fit index, or NNFI, also known
as the Tucker-Lewis index, and the comparative fit index, or CFI). Different fit
indices weight different aspects of the model (sample size, number of parameters,
residuals, etc.) differently (see Kline, 2010). For most of these fit indices both
lenient and strict cutoff criteria can be found in the literature (Hu & Bentler,
1999). As a rule of thumb, the SRMR should be lower than .08, the RMSEA
lower than .06, and the CFI higher than .95 (Hu & Bentler, 1999). As with
determining the number of factors in EFA or the number of clusters in a cluster
analysis (see Loewen & Gonulal, Chapter 9 in this volume, and Staples & Biber,
Chapter 11 in this volume), multiple fit indices should be taken into account to
avoid overprioritizing one particular criterion.
A third (additional) evaluation of a model consists of the inspection of the
model parameters themselves and the residuals. It could well be the case that,
224 Rob Schoonen
generally speaking, a complex model fits the data well, but that at the same time
some “local” misfit exists. Therefore, a check of the residuals and of the meaning-
fulness of individual parameter estimates is advisable. Eyeballing the standardized
residuals (i.e., the standardized differences between the observed covariances of
the input variables and the reproduced covariances) may show outlying residuals
that indicate local misspecifications. In a similar vein, parameter estimates that are
illogical (such as a negative variance or a correlation out of the –1 to 1 range)
could flag a local misfit as well.
Pitfalls
One of the risks of using SEM is that researchers endlessly tweak a model,
helped by the so-called modification indices that indicate how the chi-square
will change if a certain fixed parameter is set free (Lagrange Multiplier test) or
if a free parameter is set fixed (Wald test). It is very tempting to attune a model
according to these indices and in such a way to strive for more acceptable fit
statistics. However, this is also a risky enterprise because researchers are often
inclined to include relationships that are not theoretically supported, and after
a number of modifications the significance testing can no longer be seen as
real hypothesis testing and p-values become meaningless. The researcher might
end up with a hybrid model that most likely will not be replicable. If analyses
cannot be replicated, the study “might as well be sent to the Journal of Irreproduc-
ible Results or to its successor, The Annals of Improbable Research,” according to
Boomsma (2000, p. 464).
A more interesting and useful approach is to compare two competing mod-
els, preferably representing two stances in a theoretical debate. A comparison of
the fit of the two models could point to the model and the theoretical stance
that deserves our support. Consider, for example, the unitarian holistic view on
language proficiency versus the componential view mentioned earlier. A SEM
analysis of test scores could show that a multiple-factor model fits the data much
better than a one-factor model, and that multiple latent variables (components)
should be distinguished, favoring the componential view. Models that are hierar-
chically nested (i.e., the parameters of one model (A) form a subset of the param-
eters of the other (B)), can be compared statistically by the chi-square difference
test.The difference in the two models’ chi-squares is a new chi-square with as the
degrees of freedom the difference in dfs of the two compared models (Δχ² = χ²A
– χ²B; Δdf = dfA – dfB).
In all cases, it is considered best practice to report the steps taken in the devel-
opment of the ultimate model, which parameters were set to be fixed at a certain
value, which ones were freely estimated, and which ones were constrained to be
equal to another parameter (Mueller & Hancock, 2008). A model’s replicability
is one of the points that is stressed by Boomsma (2000), quoting Steiger’s (1990)
adage: “An ounce of replication is worth a ton of inferential statistics” (p. 176).
Structural Equation Modeling 225
In the command lines, the equals sign (=) can be read as “is determined by.”
The pre-final line in Text Box 1 will result in a path diagram that depicts the
hypothesized model and as such provides a nice check on the specification of
the model. By default the program will provide ML estimates. However, data
requirements such as multivariate normality need to be met to get trustworthy
estimates (Kline, 2010). The estimation procedure can be changed from ML to,
for example, GLS by adding an extra SIMPLIS command line: Method of Estima-
tion: General Least Squares just above or under Path diagram in Text Box 1, or by
selecting Output > Simplis outputs.This leads us to options for the method of
estimation and other output features. Of course, there are many more options for
analyses and kinds of output LISREL can produce than can be demonstrated here
(see Jöreskog & Sörbom, 1996–2001 for more detailed descriptions).
The analysis is run by clicking the Run LISREL button in the top bar. If there
are no serious misspecifications or syntactical errors, the model will show the path
diagram with the estimates. One can switch to the output file with all the details
by means of the Window button. The LISREL output file that results from the
analysis echoes the command lines and the covariance matrix for reference. The
most important part of the outcomes consists of the parameter estimates with
their standard errors and the indices for model fit. In this example, fit indices as
reported in Text Box 2 indicate that the model should be rejected and does not
fit the data very well. None of the aforementioned fit indices that are reported
for the one-factor model comes close to the recommended cutoff for good fit.
Degrees of Freedom = 27
Minimum Fit Function Chi-Square = 1177.53 (P = 0.0)
(. . .)
Root Mean Square Error of Approximation (RMSEA) = 0.38
90 Percent Confidence Interval for RMSEA = (0.36 ; 0.40)
P-Value for Test of Close Fit (RMSEA < 0.05) = 0.00
(. . .)
Chi-Square for Independence Model with 36 Degrees of
Freedom = 3764.64
(. . .)
Normed Fit Index (NFI) = 0.69
Non-Normed Fit Index (NNFI) = 0.59
Parsimony Normed Fit Index (PNFI) = 0.52
Comparative Fit Index (CFI) = 0.69
Incremental Fit Index (IFI) = 0.69
(. . .)
Structural Equation Modeling 229
In a similar way one can build a three-factor model; that is, one has to
replace the last six lines of the setup as represented in Text Box 1 and intro-
duce three latent variables (instead of one): Metacognition, Lexical-Semantic,
and Morpho-Syntactic Knowledge (see Text Box 3). Working with the LIS-
REL menu, one can add and rename labels for latent variables via Setup >
Variables as illustrated earlier, and then redesign the model accordingly in
the upper panel (see Figure 10.7). This model specification can be fitted to
the data by clicking the Run LISREL button in the top bar. The results show
that a three-factor model is far more realistic and that it fits the data much
better, although still not very well yet. The fit indices (see Text Box 4) come
close to the required level for good fit. Statistically speaking, the model has to
be rejected (χ² = 120.28, df = 24), but it constitutes an enormous improve-
ment compared to the first model (χ² = 1,177.53, df = 27). At the “cost” of
three extra estimated parameters (these are the covariances between the latent
variables), the reduction in chi-square is remarkable and statistically signifi-
cant (Δχ² = 1,057.25, Δdf = 3, p < .001), which means that the less restrictive
three-factor model is preferred. The RMSEA, which reduced from .38 to .12,
however, indicates that the model fit is still not satisfactory. The normed fit
index (NFI) and the CFI both show a noticeable increase (from .69 to .96) and
both are satisfactory. The SRMR dropped from .22 to .041, which is in the
range of acceptable models.
Degrees of Freedom = 24
Minimum Fit Function Chi-Square = 120.28 (P = 0.00)
( . . . )
Root Mean Square Error of Approximation (RMSEA) = 0.12
90 Percent Confidence Interval for RMSEA = (0.097 ; 0.14)
P-Value for Test of Close Fit (RMSEA < 0.05) = 0.00
( . . . )
Chi-Square for Independence Model with 36 Degrees of
Freedom = 3764.64
( . . . )
Normed Fit Index (NFI) = 0.97
Non-Normed Fit Index (NNFI) = 0.96
Parsimony Normed Fit Index (PNFI) = 0.65
Comparative Fit Index (CFI) = 0.97
Incremental Fit Index (IFI) = 0.97
( . . . )
Root Mean Square Residual (RMR) = 1.56
Standardized RMR = 0.041
( . . . )
research context whether the researcher can defend additional theoretically sup-
ported model improvements, or whether he or she enters the phase of explorations.
For the sake of demonstration, let us assume that all but one test score is derived
from separate test administrations. The exception pertains to V5 and V7, which
are subtest scores derived from one and the same test. As a consequence, distur-
bances during that test will affect both scores. In other words, there might be
so-called correlated error. This phenomenon can be modeled by allowing covari-
ance between the two residuals concerned (e5 and e7); in other words, add the line
Let error covariance between V5 and V7 be free in the model specification. A final
analysis shows that this extra free parameter in the model substantially improves fit
(χ² = 71.16, df = 23, RMSEA = .08, NFI = .98, CFI = .99, SRMR=.034). Not all
indices are completely satisfactory for this model (χ² / df > 2, RMSEA = .08) but
if there are no more plausible parameters to add, the researcher might want to stop
here and inspect the parameter estimates. When the parameter estimates are logi-
cal and within the normal ranges (for example, no negative estimates of variance),
then the researcher can start the substantive interpretation. In this simple model
it is important that the nine observed variables are explained to a large extent by
the three presumed latent variables. The coefficients of determination (R²) range
from .62 to .93. which is reasonably good (see Text Box 5). From a theoretical
point of view the correlations between the latent variables are interesting: How
high are they? Are they different from 0 and—at the other end—sufficiently dif-
ferent from 1? In this case, LISREL reports .31 (.05), .65 (.04) and .63 (.04) with
the corresponding standard errors between brackets for CIs and/or significance
testing. When one takes the standard errors into account it can be concluded
that the estimates are (statistically) different from 0 and 1. In this example, the
focus was on the underlying latent variables of the nine observed variables. In
a next step or dealing with different research questions, one could investigate
whether claims about “causal” relations between the three latent variables of the
kind illustrated in Figure 10.2 can be maintained. One may want to test whether
metacognitive knowledge is the result of lexical-semantic and morphosyntactic
knowledge. To address that question the regression of Metacognitive Knowledge
on the Lexical-Semantic and the Morphosyntactic factors should be specified.
Degrees of Freedom = 23
Minimum Fit Function Chi-Square = 71.16 (P = 0.00)
(. . .)
Root Mean Square Error of Approximation (RMSEA) = 0.080
90 Percent Confidence Interval for RMSEA = (0.060 ; 0.10)
P-Value for Test of Close Fit (RMSEA < 0.05) = 0.0079
(. . .)
Chi-Square for Independence Model with 36 Degrees of
Freedom = 3764.63
(. . .)
Normed Fit Index (NFI) = 0.98
Non-Normed Fit Index (NNFI) = 0.98
Parsimony Normed Fit Index (PNFI) = 0.63
Comparative Fit Index (CFI) = 0.99
Incremental Fit Index (IFI) = 0.99
(. . .)
Root Mean Square Residual (RMR) = 1.32
Standardized RMR = 0.034
(. . .)
The same analyses can be done in AMOS by drawing the required model
with the tools provided in the program. The opening screen of AMOS (Graph-
ics) consists of three parts, with the left-most panel showing a toolbox for model
drawing. Holding the cursor on an icon in the toolbox will show its function.
From this panel one can select the tools needed for drawing the model: circles and
boxes for latent and measured variables, respectively; single- and double-headed
arrows, but also a tool to add measured variables to a latent variable ; and an
eraser to delete parts of a model. Once a model is designed, one can import the
data by clicking , Filename and then browsing the computer for the right data
file (see Figure 10.8)—by default an SPSS file, but other formats can be read as
well. All variables in the model need to be named, and the measured variables in
the model need to be linked to variables in the data file. Double-clicking circles
lets you key in names for latent variables. Note that the “errors” need to be named
as well, for example E1 through E9, because they are treated as latent variables in
AMOS. Desired features of the analysis or outcomes, such as ML and standardized
parameters, can be handled in the Analysis Properties menu, which you can access
by clicking this button . If the model is fully designed, the data and variables
234 Rob Schoonen
are included, and the features for the analysis are set, the Calculate button can
be clicked. The two top buttons in the middle panel now allow the researcher
to toggle between a representation of the model as designed (i.e., input) and a
representation of the model with parameters (i.e., output). However, the details
of the analysis such as fit indices, standard errors, and possible warnings are pro-
vided in text. Clicking View Text ( ) provides access to the text file with a table
of contents at the left (navigation tree) and at the right the corresponding results.
Figure 10.9 shows the fit indices for our model with three factors and correlated
error. The chi-square was identical to that of the LISREL analysis (71.16), as are
the fit indices. In AMOS fit indices are reported next to the fit of an indepen-
dence model and a saturated model. The model of interest is the Default model,
which is labeled this way because we did not enter a name for it.
This is a very superficial introduction of the possibilities of AMOS and LISREL.
Readers who wish to embark on SEM sessions will best familiarize themselves
with the software manual, which is usually embedded in the package in the Help
area, or consult more extensive introductions aiming at a certain packages (see
Byrne, 1998, 2010).
FIGURE 10.9 Output file three-factor model with correlated error in AMOS
In Text Box 6 we briefly present parts of a recent study that uses SEM in
various ways. Here we focus on the underlying structure of the TOEFL iBT that
Gu investigated as part of her doctoral dissertation. In the dissertation and the
article (Gu, 2014), a multigroup analysis was conducted to investigate whether
the underlying structure holds for two different groups, and whether level of
performance was related to studying abroad.
Background
Gu (2014) investigated the structure of scores on the Internet-based Test
of English as Foreign Language (TOEFL-iBT). This study combines several
236 Rob Schoonen
Research Questions
1) Is the factorial structure of academic language ability the same for stu-
dents who have studied abroad and students who have not done so (a
study-abroad group versus a home-country group)?
2) Do the two groups differ in their scores on the underlying factors (i.e.,
latent variables) of academic English?
3) Is there a relationship between length of study abroad and the level on
the underlying factors?
Here we focus on Research Question 1.
Method
The data consisted of the test scores and questionnaire responses of 1,000
and 370 test takers, respectively. The subsample that answered the question-
naire was split in two groups: (a) never lived in an English speaking environ-
ment (n=124) and (b) have lived in such an environment (n=246). Data
for the present analysis were based on test scores of 1,000 candidates for
listening, reading, writing, and speaking. From the questionnaire data, Gu
derived information about exposure to English language and instruction.
Using the Mplus SEM package (Muthén & Muthén, 2010), Gu expli-
cates the check of relevant assumptions such as normality. Since some score
distributions deviated from normality, Gu opted for an adjusted estima-
tion of chi-square, derived indices, and standard errors of parameters (the
Satorra-Bentler correction). The scale for each latent variable is determined
by using a reference variable and fixating its loadings on the latent variable
relative to 1.
Results
Gu postulated three plausible models for the structure of the four skills.
The fit of these models and the comparison thereof was used to choose the
best model. Model 1 follows the scoring procedure of TOEFL-iBT and previ-
ous research. It consists of four factors representing the four skills and one
higher-order, overarching factor (“Language Ability”) that is supposed to
capture the correlations between the four skills. Model 2 is a straightforward
four-factor model with intercorrelated factors, one for each skill. Model 3
Structural Equation Modeling 237
consists of two factors: “Speaking” on the one hand and “Reading, Writing,
and Listening” on the other. This latter model is based on previous research,
but is theoretically speaking less transparent (see Gu’s Figure 4, reproduced
below).
Model fit was evaluated in several ways, as it should be: overall fit
(chi-square test, CFI, RMSEA, SRMR), evaluation of parameter estimates, and
parsimony for equally well fitting models. The SEM analyses showed rea-
sonable fit for all three models, Model 3 being somewhat less well fitting.
Estimation = MLM
Observation = 1000
Chi-Square (df) = 530.73 (118) L1 EL1 0.63
CFI = 0.96
RMSEA = 0.06
SRMR = 0.04 L2 EL2 0.62
L/R/W 0.75
L6 EL6 0.44
0.65
0.77
R3 ER3 0.41
W2 EW2 0.41
S1 ES1 0.40
0.72
S2 ES2 0.47
0.73
0.80
S5 ES5 0.41
S6 ES6 0.36
In Sum
SEM is a flexible approach to data analysis, especially for larger data sets that rep-
resent more complex relationships. The possibilities to apply SEM are enormous,
but the substantive interpretation of models and parameter estimates depends
heavily on carefully conducted analyses, taking into account data requirements
and the risk of overfitting the model.
A number of additional online resources and communities can also provide assistance:
Further Reading
There are many different introductions and advanced volumes dealing with SEM.
A good starting point could be the manual of the software package that one
wants to use. The manual can provide a quick introduction into the theoretical
considerations, many of which are only touched upon here. Byrne (1998, 2006,
2010, 2012) wrote different introductions for different software packages (Mplus,
LISREL, EQS, AMOS). More general introductions include Raykov and Mar-
coulides (2006), Kline (2010), Mueller & Hancock (2008) and Ullman (2007).
These volumes also cover some of the more advanced applications, such as multi-
group analysis in which models are fitted simultaneously in two (or more) groups
(for example, boys and girls, L1 and L2 speakers, or study-abroad and study-home
as in Gu’s study), or latent growth modeling in which different curves of develop-
ment can be modeled and related to predictor variables. Hancock and Mueller
(2013) provide in their edited volume what they call a “second course,” that is, the
contributions take the applications a step further and deal with topics like missing
data, categorical data, power analysis, and so forth.
There is also a journal dedicated to structural equation modeling that pub-
lishes applications from all fields, discusses methodological issues, and has a
“teacher’s corner” that presents brief instructional articles on SEM-related issues:
Structural Equation Modeling: A Multidisciplinary Journal (ISSN 1070–5511 [Print],
1532–8007 [Online]).
There are also a number of introductions and applications in the field of applied
linguistics and language assessment; see Hancock and Schoonen (2015), Kunnan
(1998), Schoonen (2005), In’nami and Koizumi (2011, 2012), and Ockey (2014).
Discussion Questions
1. Select a study that uses SEM and read the abstract, introduction, and research
questions. On the basis of your reading, draw the model you expect the research-
ers to test. In what respect does your model diverge from the model actually
tested? To what extent can you understand the differences between your model
and the author’s? Are there any unexpected differences and are these motivated
(a priori or post hoc)? How logical are the unexpected differences?
240 Rob Schoonen
2. Select a study that uses SEM and that postulates correlated error. Are these
parameters well explained in terms of the measurement procedures?
3. Select two SEM studies.What criteria do they use for model fit? Do they use
criteria from different families of fit indices? Are there any other differences
between the two studies? If you would apply the criteria from one study
to the other, would that affect the model selection (and conclusions) in the
other study? How so?
4. It is claimed that the correlations between latent variables are not attenuated
by measurement error. Can you corroborate that on the basis of the data in
Text Box 1? What is the average correlation between the observed variables
for Metacognitive Knowledge (V1–V3) and observed variables for Morpho-
syntactic Knowledge (V8–V9)? How does that compare to the .65 reported
for the correlation between the latent variables?
5. Using the data set made available along with this chapter (http://oak.ucc.nau.
edu/ldp3/AQMSLR.html), explore whether another structural model for the
three latent variables in the sample analysis is plausible (e.g., Metacognitive
Knowledge as the result of the two latent linguistic variables). How plausible is
a model with Metacognitive Knowledge independent of the two latent linguis-
tic variables? Try to model these “hypotheses” and fit the models to the data.
6. How could you test whether the two latent linguistic variables coincide? In
other words, test a two-factor model with a metacognitive factor (V1–V3)
and a linguistic factor (V4–V9). How does this model compare to the
one-factor model? To the three-factor model?
7. SEM and factor analysis have a lot in common. What similarities and dif-
ferences between the two approaches can you think of ? When would one
approach be more appropriate or informative than the other?
8. Gu (2014) provides the correlation matrix of the measured variables involved in
the models, as well as descriptives statistics. By doing so, the author allows you to
replicate her analysis (consult the AMOS manual for importing a matrix).You
can start a LISREL analysis with the setup provided in Text Box 1, and then
continue by adjusting it. Choose your own title, define the observed variables
(L1–W2), insert “correlation matrix” and replace the matrix with Gu’s matrix,
change sample size, define your latent variables and specify the relations (see also
Text Box 3). As you probably know, correlations are standardized covariances,
and the standardization is based on the standard deviations of the two variables
involved (see Kline, 2010). LISREL can derive the covariances from the cor-
relations on the basis of the standard deviations. So add another command, just
above or under the correlation part, that starts with “Standard deviations” and
then on the next line list all the standard deviations. Now replicate models 2
and 3 from Gu’s study (i.e., the correlated four- and two-factor models).1 What
do you find? There will be small differences due to slightly different algorithms,
but the overall outcome should be highly similar.The difference in chi-square is
also due to a correction Gu applied to account for the slightly nonnormal data
she had. It is beyond the scope of this chapter to go into the details.
Structural Equation Modeling 241
Note
1. If you work with LISREL’s student version, then you are restricted to 16 observed vari-
ables where Gu (2014) has 17.You could either delete the first variable L1 for Listening,
or resort to the 15-day trial version of LISREL. If you delete L1, your results will of
course differ, as well as the degrees of freedom. Can you predict df ?
Acknowledgment
The author wishes to thank Jan Hulstijn, Camille Welie, Luke Plonsky, and two anonymous
reviewers for their helpful comments. All remaining errors are the author’s.
References
Arbuckle, J. L. (2012). IBM® SPSS® AMOS™ 21 User’s Guide. Chicago: IBM Software
Group.
Bentler, P.M. (2006). EQS 6 Structural Equations Program Manual. Encino, CA: Multivariate
Software.
Bollen, K. A. (1989). Structural equations with latent variables. New York: John Wiley & Sons.
Boomsma, A. (2000). Reporting analyses of covariance structures. Structural Equation Mod-
eling: A Multidisciplinary Journal, 7(3), 461–483.
Byrne, B. M. (1998). Structural equation modeling with LISREL, PRELIS, and SIMPLIS: Basic
concepts, applications, and programming. Mahwah, NJ: Lawrence Erlbaum.
Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts, applications, and
programming (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
Byrne, B. M. (2010). Structural equation modeling with AMOS: Basic concepts, applications, and
programming (2nd ed). New York: Taylor & Francis.
Byrne, B. M. (2012). Structural equation modeling with Mplus: Basic concepts, applications, and
programming. New York: Taylor & Francis.
Enders, C. K. (2013). Analyzing structural equation models with missing data. In G. R.
Hancock & R. O. Mueller (Eds.), Structural equation modeling. A second course (2nd ed.,
pp. 493–519). Charlotte, NC: Information Age Publishing.
Finney, S. J. & DiStefano, C. (2013). Nonnormal and categorical data in structural equation
modeling. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling. A sec-
ond course (2nd ed., pp. 439–492). Charlotte, NC: Information Age Publishing.
Fox, J. (2006). Structural equation modeling with the sem package in R. Structural Equation
Modeling, 13(3), 465–486.
Gu, L. (2014). At the interface between language testing and second language acquisition:
Language ability and context of learning. Language Testing, 31(1), 111–133.
Hancock, G. R. & Mueller, R. O. (Eds.) (2013). Structural equation modeling. A second course
(2nd ed.). Charlotte, NC: Information Age Publishing.
Hancock, G. R., & Schoonen, R. (2015). Structural equation modeling: Possibilities for
language learning researchers. Language Learning, 65: Suppl 1, 158–182.
Hu, L., & Bentler, P.M. (1999). Cutoff criteria for fit indexes in covariance structure anal-
ysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1),
1–55.
In’nami, Y., & Koizumi, R. (2011). Structural equation modeling in language testing and
learning research: A review. Language Assessment Quarterly, 8(3), 250–276.
In’nami, Y., & Koizumi, R. (2012). Factor structure of the revised TOEIC® test: A
multiple-sample analysis. Language Testing, 29(1), 131–152.
242 Rob Schoonen
Jöreskog, K. G., & Sörbom, D. (1996). LISREL 8: Structural equation modeling with the SIM-
PLIS command language. Chicago: Scientific Software International.
Jöreskog, K. G., & Sörbom, D. (1996–2001). LISREL 8: User’s Reference Guide (2nd ed.).
Lincolnwood, IL: Scientific Software International.
Kline, R. B. (2010). Principles and practice of structural equation modeling (3rd ed.). New York:
The Guilford Press.
Kunnan, A. J. (1998). An introduction to structural equation modeling for language assess-
ment research. Language Testing, 15(3), 295–332.
Mueller, R. O. & Hancock, G. R. (2008). Best practices in structural equation modeling.
In J. Osborne (Ed.). Best practices in quantitative methods (pp. 488–508). Thousand Oaks,
CA: Sage.
Muthén, L. K., & Muthén, B. O. (2010). Mplus user’s guide. Statistical analysis with latent vari-
ables (6th ed.). Los Angeles: Muthén & Muthén.
Ockey, G. J. (2014). Exploratory factor analysis and structural equation modeling. In A. J.
Kunnan (Ed.), The companion to language assessment. Vol. III: Evaluation, Methodology,
and Interdisciplinary Themes (pp. 1224–1244, Part 10, Chapter 73). Malden, MA: John
Wiley & Sons.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodologi-
cal synthesis and call for reform. Modern Language Journal, 98, 450–470.
Raykov,T., & Marcoulides, G. A. (2006). A first course in structural equation modeling (2nd ed.).
Mahwah, NJ: Erlbaum.
Rosseel,Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statisti-
cal Software, 48(2), 1–36.
Rovine, M. J., & Molenaar, P.C.M. (2003). Estimating analysis of variance models as struc-
tural equation models. In B. H. Pugesek, A. Tomer, & A. von Eye (Eds.), Structural equa-
tion modeling: Applications in ecological and evolutionary biology (pp. 235–280). Cambridge:
Cambridge University Press.
Schoonen, R. (2005). Generalizability of writing scores. An application of structural equa-
tion modeling. Language Testing, 22 (1), 1–30.
Schoonen, R., Van Gelderen, A., De Glopper, K., Hulstijn, J., Simis, A., Snellings, P., &
Stevenson, M. (2003). First language and second language writing: the role of linguistic
fluency, linguistic knowledge and metacognitive knowledge. Language Learning, 53(1),
165–202.
Schoonen, R., Van Gelderen, A., Stoel, R., Hulstijn, J., & De Glopper, K. (2011). Model-
ing the development of L1 and EFL writing proficiency of secondary-school students.
Language Learning, 61, 31–79.
Schoonen, R.,Vergeer, M., & Eiting, M. (1997). The assessment of writing ability: Expert
readers versus lay readers. Language Testing, 14(2), 157–184.
Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimation
approach. Multivariate Behavioral Research, 25(2), 173–180.
Tseng, W.-T., & Schmitt, N. (2008). Toward a model of motivated vocabulary learning:
A structural equation modeling approach. Language Learning, 58(2), 357–400.
Ullman, J. B. (2006). Structural equation modeling: Reviewing the basics and moving for-
ward. Journal of Personality Assessment, 87(1), 35–50.
Ullman, J. B. (2007). Structural equation modeling. In B. G. Tabachnick & L. S. Fidell (Eds.),
Using multivariate statistics (5th ed., pp. 676–780). Boston: Pearson/Allyn and Bacon.
West, S. G., Finch, J. F., & Curran, P. J. (1995). Structural equation models with nonnormal
variables. Problems and remedies. In R. H. Hoyle (Ed.), Structural equation modeling. Con-
cepts, issues, and applications (pp. 56–75). Thousand Oaks, CA: Sage.
11
CLUSTER ANALYSIS
Shelley Staples and Douglas Biber
Conceptual Motivation
Research in applied linguistics typically involves comparisons among groups of
speakers. Those groups can be defined in terms of many different types of cat-
egorical variables, such as students from different first language (L1) backgrounds,
or students in a treatment group versus a control group. Those groups are then
usually compared with respect to quantitative (dependent) variables, such as per-
formance scores on a language test. It is often the case, though, that there is
considerable variation within these groups. For example, while there might be
significant differences in language test scores between a treatment group and a
control group, there will also often be considerable variation among students
within each of those groups. Cluster analysis can be useful for situations like
this, because it provides a bottom-up way to identify new groups that are better
defined with respect to target variables.
Cluster analysis is a multivariate exploratory procedure that is used to group
cases (e.g., participants or texts). Cluster analysis is useful in studies where there is
extensive variation among the individual cases within predefined categories. For
example, many researchers compare students across proficiency level categories,
defined by their performance on a test or holistic ratings. But a researcher might
later discover that there is extensive variation among the students within those
categories with respect to their use of linguistic features or with respect to attitu-
dinal or motivational variables. Cluster analysis provides a complementary way to
group students based directly on such variables. So, for example, cluster analysis
could be used to identify groups of students with positive attitudes and intrinsic
motivations; a group with positive attitudes and extrinsic motivations; a group
with negative attitudes and intrinsic motivations; and so on.Those new categories
could then be described and compared with respect to a range of other linguistic
244 Shelley Staples and Douglas Biber
division of two clusters of L2 speakers: one cluster that primarily used formulaic
language to achieve fluency and the other that used a variety of other strategies,
including filled pauses, repetitions, and discourse markers. Jarvis, Grant, Bilowski,
and Ferris (2003) and Friginal, Li, and Weigle (2014) are also innovative in their
exploration of multiple linguistic profiles of high-scoring L2 writers.
Cluster analysis can also be used to investigate the linguistic development of
learners over time by determining how linguistic features cluster within texts
produced by learners at various points in time. Gries and Stoll (2009), for exam-
ple, focus on clustering individual performances by a single speaker based on
changes in one linguistic feature—mean length of utterance (MLU)—over time.
By identifying performances that cluster together, clear developmental stages can
be identified in the data.This method could be also applied to longitudinal studies
of L2 development and to additional variables (e.g., development of the lexicon).
Two other areas within L2 research where cluster analysis has been applied
are L2 assessment and language policy and planning. Eckes (2012) used cluster
analysis to determine rater types based on their behavior in rating a high-stakes
German foreign language test. Leung & Uchikoshi (2012) investigate the lan-
guage planning and policy profiles (e.g., language use in the home) of parents
of bilingual Cantonese and English-speaking children in relation to their profi-
ciency in each language.
Other studies outside the field of L2 research point to different ways in which
linguistic variables can be used to cluster texts (oral and written). First, studies of
register variation have been very fruitful in identifying text types, which are based
on the clustering of texts that are similar in their use of linguistic variables. For
example, Biber (1989), Biber and Finegan (1989), Biber (1995), Biber (2008) have
all investigated a wide range of lexico-grammatical features to determine clus-
ters of texts that are similar in their linguistic characteristics, and then examine
those groupings in relation to established text categories (e.g., scientific writing,
face-to-face conversation). Csomay (2002) used cluster analysis to identify within
classroom discourse different functional episode types, which are sections of text
clustered based on their similar linguistic features. Four episode types were identi-
fied. Gries, Newman, and Shaul (2011) provide an example of how texts can be
grouped by their use of frequent lexical strings (i.e., n-grams or lexical bundles).
Text-linguistic applications of cluster analysis may also be useful to L2 researchers,
as they reveal information about the linguistic features used in particular domains
of language use. Such findings, similar to those for factor analysis, allow ESP and
EAP researchers, teachers, and materials developers to understand more about the
linguistic nature of particular registers of a language.This same approach could be
used in L2 research, clustering texts of learner production that are similar in their
linguistic characteristics, and then considering the relation of those clusters to a
priori categories such as task features or different proficiency levels. We explore
a study of this type in the next section, which documents the process used to
perform cluster analysis.
246 Shelley Staples and Douglas Biber
each of the predictor variables (e.g., test scores, Likert scale items) will be added
as columns. There may be other variables not included in the analysis (e.g., pro-
ficiency level) that will be included in the data set as columns but not added to
the HCA.The first step is to select Analyze > Classify > Hierarchical Cluster
Analysis, as shown in Figure 11.1.
of clusters. The Cluster Membership feature produces output that identifies the
cluster of each case. At this stage of the analysis, we are trying to determine the
optimal number of clusters, so you should choose None under Cluster Membership.
Cluster membership will be identified later (see Step 9) using the Save function.
Click Continue when finished.
clustering, and Ward’s method. We will use Ward’s method, but here provide a
short explanation of each of the other options. Based on a review of the literature,
Ward’s method is the most commonly used measure within HCA (see, e.g., Eckes,
2012; Gries et al., 2011; Leung & Uchikoshi, 2012).
The simplest method is the nearest neighbor (also known as single linkage)
method. In this method, cases are joined to existing clusters if at least one of the
members of the existing cluster is of the same level of similarity as the case under
consideration for inclusion (Aldenderfer & Blashfield, 1984, p. 38). The major
advantage of this method is that the results will not be affected by data transfor-
mations. The major disadvantage of this method is that it tends to form chains
of linkage within the data such that toward the end of the clustering, one large
cluster may eventually be formed with individual cases being added one by one.
Visual examination of the data is also not very helpful (Aldenderfer & Blashfield,
1984, pp. 39–40).
The furthest neighbor or complete linkage method indicates that the new case
is added to an existing cluster if it is within a certain level of similarity to all
members of the cluster (Aldenderfer & Blashfield, 1984, p. 40). This method
tends to produce the opposite of the single linkage, namely very tight clusters
with high within-group similarity. However, relatively similar objects may stay
in different clusters for a long time, creating the opposite problem from that of
single linkage.
The between-groups linkage or average linkage method was developed to find a
compromise between the two extremes of the single and complete linkage meth-
ods. It calculates the average of all possible distances between all pairs of cases in
Cluster A and all pairs of cases in Cluster B and combines the two clusters if a
given level of similarity is achieved (Aldenderfer & Blashfield, 1984, pp. 40–41;
Norušis, 2011, p. 373).
While between-groups linkage uses pairs of cases, within-groups linkage adds an
additional consideration of the average measure of all possible pairs of cases in a
resulting cluster (Norušis, 2011, p. 373).
The centroid method uses the distance between the centroid for the cases in
Cluster A and the centroid for cases in Cluster B to measure dissimilarity. The
distance between two clusters is the sum of distances between cluster means for
all of the variables. When a new cluster is formed, the new centroid is a weighted
combination of the two clusters that have been merged (Norušis, 2011, p. 373).
Median clustering is similar to the centroid method but there is no weighting of
the combination of centroids when clusters are merged (Norušis, 2011, p. 373).
Finally, Ward’s method measures dissimilarity between clusters in relation to
the “loss of information” or increase in the error sum of squares by joining two
clusters (Aldenderfer & Blashfield, 1984, p. 43). In practice, the choice of similar-
ity measure usually has only minor consequences for applications in applied lin-
guistics. As noted earlier,Ward’s method is most commonly used, and we illustrate
its application next.
Cluster Analysis 253
variables. As Figure 11.10 shows, there are a number of options for transforming
variables. Z scores are one common method of standardization. In our case study,
the variables have already been transformed (using z scores) for the factor analysis,
and thus we do not need to standardize.
Click Continue to exit this menu, and then OK to run the HCA function.
For this, we examine the agglomeration schedule in the output. The agglom-
eration schedule generally displays the cases or clusters combined at each stage,
the distances between the clusters being combined (the coefficients column, our
main focus), and the next stage at which the cluster joins another cluster. Note
that when using Ward’s method, the coefficient is actually the within-cluster sum
of squares at that step. That is why the values may be much larger than those
found for other measures. Figure 11.12 shows a truncated version of the agglom-
eration schedule from the SPSS output. Note that the total number of stages cor-
responds to one less than the number of cases in the data set.
The agglomeration schedule shows the step-by-step output for clustering
cases. As noted, the procedure begins with each case representing a separate clus-
ter. At Stage 1, two of these cases (Case 887 and Case 894) are clustered together.
The resulting within-cluster sum of squares is .035. Neither of the two cases have
been previously clustered, so the “stage cluster first appears” is 0 for both clusters.
FIGURE 11.12 Truncated agglomeration schedule for 947 cases in the data set
Cluster Analysis 257
TABLE 11.1 Reformatted fusion coefficients for final six clusters formed
60000
Distance between fusion coefficients
50000
40000
30000
20000
10000
0
1 2 3 4 5 6
Number of clusters
The graph in Figure 11.3 can be considered similarly to a scree plot used in
factor analysis, in that we are looking for the number of clusters where the dif-
ference in coefficients starts to flatten out. As discussed earlier, this flattening out
indicates that not much new information is gained by adding more clusters. In
the present study, this flattening out occurs at the point at which three clusters
are created. However, this measure is only one indication of the optimal num-
ber of clusters. The next step is thus to investigate the information gained by a
four-cluster solution and the information lost by a two-cluster solution, to deter-
mine the optimal number of clusters for interpretation.
columns called CLU4_1, CLU3_1, and CLU2_1. Each column presents data for
a different cluster solution, providing the cluster membership for each of the cases
in that solution. For example, the column CLU4_1 identifies cluster membership
for the four-cluster solution.
260 Shelley Staples and Douglas Biber
Notice the case highlighted by the arrow in Figure 11.16. We can see that,
depending on the cluster solution, a particular case may fall into different clus-
ter memberships. Reading left to right, in the four-cluster solution (CLU4_1),
the case was placed into Cluster 4; in the three-cluster solution (CLU3_1), it
was grouped into Cluster 3; and in the two-cluster solution (CLU2_1) it is in
Cluster 1.
Note that SPSS automatically adds a label for each of these three additional
variables, and all three are labeled “Ward Method” (since we used that method for
all three clusters). However, this labeling will be confusing in our output, so we
recommend renaming the labels to reflect the new variable names. We relabeled
our variables “2-cluster solution,” “3-cluster solution,” and “4-cluster solution”
(see Figure 11.17).
convenient way to achieve these goals is to run a one-way ANOVA for each of
the cluster solutions.
In SPSS, select Analyze > Compare means > One-way ANOVA. To
analyze the linguistic characteristics of the clusters in the two-cluster solution,
choose the four factors for the Dependent List, and choose 2-cluster solution as the
independent categorical variable (the Factor in the ANOVA). This will allow us
to compare the mean scores of the four factor scores for each of the two clusters
(see Figure 11.18).
Under Options, select Descriptives so we can see the mean differences in the
factor scores (the dependent variables) according to the cluster categories (see
Figure 11.19).
Click Continue, then OK.
In Table 11.2, we see that there are significant mean differences for all four of
the factors in the two-cluster solution.
The descriptive statistics and means plots (not shown) also indicate that Fac-
tors 1 and 2 are significantly higher for Cluster 1 than for Cluster 2 while Factors
3 and 4 are significantly higher for Cluster 2 than for Cluster 1. The interpreta-
tion of this trend can be found in Case Study 2, which provides a summary of the
linguistic findings from the study conducted for this analysis.
The same procedure is followed for the three- and four-cluster solutions. The
only change in the procedure is to select the variable 3-cluster solution for the
three-cluster solution and the variable 4-cluster solution for the four-cluster solution.
For the three-cluster solution, we again find that the mean differences in Fac-
tor scores are significantly different, as shown in Table 11.3. The table also shows
that the mean scores are different for each of the three clusters, except for Factor 3,
262 Shelley Staples and Douglas Biber
for which Clusters 1 and 2 show similar scores. The specific details and interpre-
tation of these trends based on the linguistic variables in the particular factors is
discussed in Case Study 2. However, we can see that the three-cluster solution
differentiates the cases further than found in the two-cluster solution.
Cluster Analysis 263
TABLE 11.2 Means and standard deviations for the two-cluster solution
Factor Cluster N M SD
TABLE 11.3 Means and standard deviations for the three-cluster solution
Factor Cluster N M SD
Similarly, for the four-cluster solution, the mean differences in Factor scores
are significantly different for the four clusters, as shown in Table 11.3. That table
also shows that the mean scores are different for each of the three clusters. The
specific details and interpretation of these trends based on the linguistic variables
within the particular factors is discussed in Case Study 2. However, we can see
that the four-cluster solution differentiates the cases further than found in the
three-cluster solution.
TABLE 11.4 Means and standard deviations for the four-cluster solution
Factor Cluster N M SD
that this process is not straightforward but relies on the interpretation of the
researcher.To do this, we can use any of the a priori categorical variables available
in the data set (the outside criterion variables), to see how they correspond to the
new categories determined by the cluster analysis. Thus, in the present study, we
Cluster Analysis 265
can investigate the correspondence between cluster membership and the original
categorical variables of task type and proficiency score level.
We will first look at the relation between cluster membership and task type
(independent or integrated). Using the Crosstabs function, you should select
Analyze > Descriptives > Crosstabs and select 2-cluster solution as the row and
task type as the column.
The resulting output for the two-cluster solution shows us that the indepen-
dent tasks (ind) are fairly evenly split between the two clusters while Cluster 2
contains predominantly integrated tasks (int) (see Figure 11.21).
In the three-cluster solution (Figure 11.22), independent tasks are still divided
up in the same way between the first two clusters. However, the integrated tasks
that had been grouped on the first cluster (in the two-cluster solution) have now
been split, with 145 of those texts now comprising the new third cluster.
The four-cluster solution (see Figure 11.23) mostly affects the composition of
the first cluster in the three-cluster solution. The integrated tasks on that cluster
are now split, so that 229 of those texts comprise the new Cluster 3. In the result-
ing solution, there are now two clusters consisting mostly of two different types of
independent task texts, and two clusters consisting mostly of two different types
of integrated task texts.
FIGURE 11.21 Cluster membership by task type for the two-cluster solution
FIGURE 11.22 Cluster membership by task type for the three-cluster solution
266 Shelley Staples and Douglas Biber
FIGURE 11.23 Cluster membership by task type for the four-cluster solution
The same techniques can be used to investigate the relationship between clus-
ter membership and proficiency level (measured in terms of TOEFL iBT score
levels: 1.0–5.0 with .5 increments).
Figure 11.24 shows that there is no clear relationship between score level and
cluster membership for the two-cluster solution. Both the lower scorers (1.0–1.5)
and higher scorers (3.5–5.0) are grouped primarily in Cluster 1 (in fact all score
levels are grouped in Cluster 1 more prominently). The low-middle level scores
show a greater proportion of membership in Cluster 2, but it is not very mean-
ingful. The same (lack of ) pattern can be seen for the three- and four-cluster
solutions, as can be seen in figures 11.25 and 11.26.
In the final analysis, you should keep in mind that HCA is an exploratory
technique. Plotting the fusion coefficients is the first step to determining the
number of clusters that you will select for interpretation. However, the goal of the
analysis is to uncover groups and patterns that had not been previously anticipated
(rather than hypothesis testing). For this reason, it is important to investigate a
range of cluster solutions, choosing the one that is most informative. Two types
of descriptive information are especially useful for this purpose: investigating the
composition of each cluster (i.e., the cases that have been grouped into the clus-
ters), and investigating the mean scores of the predictor variables for each cluster
(in this example study, the means of the linguistic dimension scores).
In some cases, the composition of two clusters might appear to be very similar
with respect to external criteria, but the cluster analysis shows that they are dis-
tinct groups in terms of the predictor variables. For example, in the four-cluster
solution examined earlier, Cluster 2 and Cluster 3 are fairly similar in their com-
position based on outside criterion variables—see Figure 11.23 and Figure 11.26.
However, it turns out that these two clusters are distinct in terms of their perfor-
mance on particular predictor variables: Cluster 2 had quite low scores on Factor 1
while Cluster 3 had more moderate scores (see Table 11.3). Thus, the four-cluster
solution might be the most informative one for our exploratory purposes, even
Cluster Analysis 267
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Total
2-cluster solution 1 38 27 58 70 106 111 98 88 89 685
2 2 7 29 36 83 38 30 27 10 262
Total 40 34 87 106 189 149 128 115 99 947
FIGURE 11.24 Cluster membership by score level for the two-cluster solution
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Total
3-cluster solution 1 32 23 45 58 50 94 71 67 60 540
2 2 7 29 36 83 38 30 27 10 262
3 6 4 13 12 16 17 27 21 29 145
Total 40 34 87 106 189 149 128 115 99 947
FIGURE 11.25 Cluster membership by score level for the three-cluster solution
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Total
4-cluster solution 1 29 22 32 31 47 45 28 26 23 283
2 2 7 29 36 83 38 30 27 10 262
3 3 1 13 27 43 49 43 41 37 257
4 6 4 13 12 16 17 27 21 29 145
Total 40 34 87 106 189 149 128 115 99 947
FIGURE 11.26 Cluster membership by score level for the four-cluster solution
SAMPLE STUDY 1
Yamamori, K., Isoda, T., Hiromori, T., & Oxford, R. (2003). Using cluster analysis to
uncover L2 learner differences in strategy use, will to learn, and achievement over
time. International Review of Applied Linguistics, 41, 381–409.
Background
Yamamori et al. (2003) investigate the strategies and motivational profiles of
groups of learners in relation to their language achievement. As they indicate
in their study, the motivation to investigate achievement in this way stems
from previous research suggesting “there can be more than one route to
success in L2 learning” (p. 382).
Method
The data in this study consisted of survey and achievement test scores from
81 Japanese beginning learners of English as a foreign language, all in seventh
grade. A total of nine measures were used in a cluster analysis to group par-
ticipants, consisting of three measures collected at the end of three consecu-
tive terms. The three measures included (1) a strategy inventory consisting
of five Likert scale survey items (e.g., “I use the dictionary”); (2) a measure of
the will to learn captured by four Likert scale survey items (e.g., “I want to be
good at English”); and (3) end-of-term achievement test scores.
Cluster Analysis 269
SAMPLE STUDY 2
Examining linguistic profiles of L2 writing based on task type and proficiency
level
Background
Previous research on L2 writing has examined the relationship of profi-
ciency level and task type with linguistic characteristics used by L2 writers
using ANOVA and mixed factorial models, among other statistical analyses
(e.g., Biber & Gray, 2013; Cumming et al., 2006; Way, Joiner, & Seaman,
2000). While relationships have been shown between proficiency level and
linguistic features as well as task type and linguistic features, a great deal
of variation in the use of linguistic features within proficiency level in par-
ticular has been noted (e.g., Biber & Gray, 2013). In addition, Jarvis et al.
(2003) show that high proficiency learners may have multiple linguistic
profiles. Thus, cluster analysis was identified as a useful approach to explore
variability in the use of linguistic features in relation to proficiency level and
task type.
270 Shelley Staples and Douglas Biber
Method
This study uses the same data described in the “Procedures for Conducting
Cluster Analysis.” We examined data from 947 responses to writing prompts
on the TOEFL iBT. This data had previously been analyzed for relationships
between linguistic features, task type (independent vs. integrated), and
score level on the iBT (see Biber & Gray, 2013; Biber, Gray, & Staples, 2014).
Although relationships were found between both task type and proficiency
level and linguistic features, it was also revealed that there was variation in
the use of linguistic features across these two domains. Because linguistic
features are known to co-occur and correlate statistically with each other, a
wide range of linguistic features (e.g., personal pronouns, dependent clause
types) were first subjected to a factor analysis to identify underlying dimen-
sions of language use (see Biber & Gray, 2013 for a description of the linguis-
tic features included in the factor analysis).
Four dimensions of language use were identified from the factor analy-
sis: (1) literate versus oral responses (e.g., higher use of nouns vs. higher
use of verbs); (2) information source: text vs. personal experience (e.g.,
third-person pronouns vs. first- and second-person pronouns); (3) abstract
opinion vs. concrete description/summary (e.g., nominalizations vs. con-
crete nouns); (4) Personal narration (e.g., higher use of past-tense verbs).
The standardized dimension scores for each of these four dimensions were
used to cluster the texts.
Sample Studies
Tools and Resources
Crawley, M. J. (2007). Tree models. In The R book. Chichester, UK: John Wiley &
Sons, Ltd.
Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster analysis (4th ed.). Chicester, UK: John
Wiley & Sons, Ltd.
Gries, S. Th. (2006). Exploring variability within and between corpora: Some methodo-
logical considerations. Corpora,1(2), 109–151.
272 Shelley Staples and Douglas Biber
Hair, J. F. & Black, W. C. (2000). Cluster analysis. In L.G. Grimm & P.R.Yarnold, Reading
and understanding more multivariate statistics. Washington, DC: American Psychological
Association.
Johnson, R. A. & Wichern, D. W. (2007). Applied multivariate statistical analysis (6th ed.).
Chapter 12: Clustering, distance methods, and ordination. Upper Saddle River, NJ:
Pearson Education.
Kaufman, L. & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster
analysis. New York: John Wiley & Sons.
Lorr, M. (1983). Cluster analysis for the social sciences. San Francisco, CA: Jossey-Bass.
Further Reading
Gayle, G. (1984). Effective second-language teaching styles. The Canadian Modern Lan-
guage Review, 40(4), 525–541.
Hayes, E. (1989). Hispanic adults and ESL programs: Barriers to participation. TESOL
Quarterly, 23(1), 47–63.
Hill, D. (1992). Cluster analysis and the interlanguage lexicon. Edinburgh Working Papers in
Applied Linguistics, 3, 67–77.
Huang, H. T. (2010). How does second language vocabulary grow over time? A multi-methodological
study of incremental vocabulary knowledge development. Unpublished dissertation. Univer-
sity of Hawai’i, Manoa, HI.
Kang. O., Rubin, D. & Pickering, L. (2010). Suprasegmental measures of accentedness
and judgments of language learner proficiency in oral English. The Modern Language
Journal, 94(4), 554–566.
Lee, J. (2012). The implications of choosing a type of quantitative analysis in interlan-
guage research. Linguistic Research, 29(1), 157–172.
Philp, J. (2009). Pathways to proficiency: Learning experiences and attainment in implicit
and explicit knowledge of English as a Second Language. In R. Ellis, S. Loewen, C.
Elder, R. Erlam, J. Philp, & H. Reinders, Implicit and explicit knowledge in second language
learning, testing, and teaching (pp. 194–215). Tonawanda, NY: Multilingual Matters.
Ranta, L. (2002). The role of learners’ language analytic ability in the communica-
tive classroom. In P. Robinson, Individual differences and instructed language learning
(pp. 159–180). Philadelphia: John Benjamins.
Rysiewicz, J. (2008). Cognitive profiles of (un)successful FL learners: A cluster analytical
study. The Modern Language Journal, 92(1), 87–99.
Shochi, T., Rillard, A., Auberge,V., & Erickson, D. (2009). Intercultural perception of
English, French, and Japanese social affective prosody. In S. Hancil, The role of prosody
in affective speech (pp. 31–60). New York: Peter Lang.
Uchikoshi, U., & Marinova-Todd, S. (2012). Language proficiency and early literacy skills
of Cantonese-speaking English language learners in the U.S. and Canada. Reading and
Writing: An Interdisciplinary Journal, 25, 2107–2129.
Yashima, T. & Zenuk-Nishide, L. (2008). The impact of learning contexts on proficiency,
attitudes, and L2 communication: Creating an imagined international community.
System, 36, 566–585.
Discussion Questions
1. We have emphasized the importance of researcher expertise in making sense
of cluster analytic output and results. Is cluster analysis unique in this regard?
Why? Why not?
Cluster Analysis 273
2. This chapter has shown that the process of carrying out a cluster analysis
often involves the use of other statistical analyses. Explain in your own words
how the following analyses might be used in conjunction with cluster analy-
sis: ANOVA, data transformation, cross-tabs (or chi-square), factor analysis,
discriminant function analysis, correlation. Now examine a few of the cluster
analytic studies listed under Further Reading. Which analyses did they use
along with their cluster analysis? To what ends?
3. Other than the example studies described in this chapter, what types of
research questions or situations can you think of in which cluster analysis
might be a useful approach?
4. Cluster analysis is often contrasted with both factor analysis (see Loewen &
Gonulal, Chapter 9 in this volume) and discriminant function analysis (see
Norris, Chapter 13 in this volume). In what ways are these two procedures
similar to cluster analysis? In what ways are they different?
5. The authors of Sample Study 1 recommend different types of strategy
instruction based on the four distinct learner profiles or clusters obtained
through their analysis. What kinds of interventions do you think might be
most effective with each group? Can you think of other cases where the
results of a cluster analysis could inform L2 pedagogy, assessment, or policy?
Notes
1. Another approach involves combining HCA and K-means clustering. First, HCA is
used on a smaller sample of the data, to determine the optimal number of clusters, and
then the researcher runs a K-means analysis on the full data set, specifying that number
of clusters.
2. Cluster analysis can also be used to group together variables instead of cases; examples
of this type include Kang, Rubin, & Pickering (2010) and Lee (2012).
3. In SAS, it is also possible to produce goodness of fit measures, which can be used in the
process of deciding on a “stopping” point for cluster solutions.
References
Aldenderfer, M. S. & Blashfield, R. K. (1984). Cluster analysis. Thousand Oaks, CA: Sage.
Biber, D. (1989). A typology of English texts. Linguistics, 27, 3–43.
Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Chapter 9: Reg-
isters and text types in English and Somali. Cambridge: Cambridge University Press.
Biber, D. (2008). Corpus-based analyses of discourse: Dimensions of variation in con-
versation. In V. K. Bhatia, J. Flowerdew, & R.H. Jones. Advances in discourse studies
(pp. 100–114). New York: Routledge.
Biber, D., & Finegan, E. (1989). Styles of stance in English: Lexical and grammatical mark-
ing of evidentiality and affect. Text, 9, 93–124.
Biber, D., & Gray, B. (2013). Discourse characteristics of writing and speaking responses on the
TOEFL iBT. Princeton, NJ: Educational Testing Service.
Biber, D., Gray, B., & Staples, S. (2014, advanced access). Predicting patterns of grammatical
complexity across textual task types and proficiency levels. Applied Linguistics.
Csizer, K., & Dörnyei, Z. (2005). Language learners’ motivational profiles and their moti-
vated learning behavior. Language Learning, 55(4), 613–659.
274 Shelley Staples and Douglas Biber
Introduction
The use of Rasch measurement in second language (L2) research has grown sig-
nificantly in the past decade, in particular in the area of language testing (see, e.g.,
McNamara & Knoch, 2012). The current chapter introduces the basic concepts
of Rasch analysis. It will start by providing the conceptual motivation for using
techniques from the Rasch family of models and then provide a guide on how to
use four different Rasch models: the simple Rasch model, the rating scale model,
the partial credit model and the many-facet Rasch model. Readers will learn how
to choose the most appropriate model and how to interpret key output tables
from Rasch analyses. At the end of the chapter, we will describe some of the tools
and resources available as well as further readings on the topic of Rasch analysis.
Background
Why Rasch?
The Rasch family of models, a subset of a larger group of models known as item
response theory (IRT), is becoming more popular as a way of analyzing data col-
lected in L2 research. Rasch analysis found its way into L2 research through its
gradual adoption by language testers (McNamara & Knoch 2012) and has since
spread into other areas of the field. For language testers in particular, this approach
to measurement has provided a powerful new way of generalizing from a person’s
performance on a test to statements of underlying ability.
There are several reasons why Rasch analysis is appealing to researchers
involved in L2 research. For language testers, for example, Rasch provides a more
powerful way of analyzing test data than can be achieved by using more traditional
276 Ute Knoch and Tim McNamara
techniques such as the ones provided by classical test theory (CTT) (see e.g.
Eckes, 2011; Wright, 1992). Both set of techniques (CTT and IRT) are used to
analyze test data to gain a thorough understanding of the performance of test
items, the ability of test takers, and the performance of the measurement instru-
ment as a whole.The data used to analyze the test or instrument commonly come
from a specific population of learners from a certain context. In the case of a CTT
analysis, results might differ if the test or instrument is administered to a different
group of learners and therefore need to be interpreted differently. However, IRT
and Rasch analyses take this sample dependency into account.The models enable
an estimate of test takers’ underlying ability on the basis of their performance on
a particular set of items by making allowance for the difficulty of items and how
well they are matched with the candidates’ ability. The crucial element here is
how items are related to candidate ability, which is not the case in CTT. This dif-
ference between CTT and Rasch (and all IRT models) has been compared to the
difference between descriptive and inferential statistics (e.g., McNamara, 1996) as
the results from a Rasch analysis can be generalized beyond the sample.
Another benefit lies in the fact that the Rasch model can be applied to a wide
variety of data types. While the simple Rasch model could only be used to analyze
dichotomously scored items (e.g., multiple-choice items), extensions of this model
developed in the late 70s and early 80s could also handle data from polytomous items,
semantic differential scales, rating scales, as well as data scored by human raters (e.g.,
in the assessment of speaking and writing). When writing or speaking assessment
data is analyzed using the Rasch model, the system can provide powerful estimates
of rater quality, which has been very helpful for language assessors, in particular since
the increased interest in collecting performance data following the communicative
movement in the early 80s. A further data type that lends itself to a Rasch analy-
sis is that of questionnaires that are usually analyzed using more traditional meth-
ods (including reporting descriptive statistics and making use of factor analyses, see
Loewen & Gonulal, Chapter 9 in this volume). As we will see in this chapter, Rasch
analysis offers a powerful new way of analyzing such data. Further, for L2 researchers
interested in learner progress or development, Rasch analysis also makes it possible
to define the ability of learners on a single ability scale that links all tasks and learn-
ers. In this way, progress can be shown and different preexisting scales can be linked.
In sum, Rasch analysis offers a powerful, comprehensive way to analyze a
variety of data types and can be used to answer a variety of questions posed in L2
research. Rasch analysis is also rather forgiving in its data requirements and can
handle missing data relatively well, which is a major advantage in the real world
of research.
Likert scale questionnaires). As is the case with the rating scale model, the partial
credit model assumes that all raters are applying the scale in the same way. This
problem was addressed in a further development, the many-facet Rasch model.
TABLE 12.1 Data type, response formats, Rasch models, and programs
The data design for a Rasch analysis offers more flexibility than designs
required for a CTT analysis: Crossed, nested, and mixed designs as well as missing
data can be accommodated (see, e.g., Schumacker, 1999 for a detailed discussion).
A further useful discussion of the data requirements for a Rasch analysis in terms
of sample size can be found in Linacre (1994).
TABLE 12.2 Data input format for analyses not involving multiple raters
If each learner’s performance has been rated by more than one rater, the data
should be set out as in the example in Table 12.3. Here, the ratings for each rater
are listed in separate rows.
TABLE 12.3 Data input format for analyses involving multiple raters
Opening Winsteps
1. Select the Winsteps icon on your desktop.
2. Close the smaller Winsteps Welcome window.
Copy the person labels under the red heading “Person Label Variables” and
copy the item variables under the heading “Item Response Variables.”Your win-
dow should now look something like this:
Rasch Analysis 283
4. Locate and double-click your Winsteps input file (usually a. txt file)
5. Select the Enter key on the keyboard twice.
Once a data file has been created, this can be read into Facets. Click the Facets
icon on the computer to start the program. Then, select Files > Specification
File Name? and choose the Facets input file from the location it was saved.Then
click Open and OK.
Note: M=mean; S=one standard deviation from mean; T= Two standard deviations from mean
The items are ordered from the most difficult item at the top (Item 9) to the
easiest at the bottom (Item 10). Candidates with more reading ability are located
near the top of the figure while less able test takers are shown near the bottom.
As the test takers and the reading test items are pictured on the same scale, the
logit scale, it is now possible to make direct comparisons. A test taker placed at
the same logit value as an item has a 50% chance of answering that item cor-
rectly. Test takers mapped higher than an item have a higher than 50% chance of
answering the item correctly. Those mapped lower have less chance of answering
the item correctly (see Wright and Linacre, 1991 for an exact logit to probability
conversion table). The logit scale has the further advantage that it is an interval
scale. Therefore, not only does it tell us that one item is harder than another or
that one candidate is more able than another, but it also gives us a measurement
of how much that difference is.
Apart from descriptive observations about our measurement instrument
(including which students are the most and least able, and which items are the
most and least difficult), the Wright map can provide us with information about
(1) item coverage (i.e., whether there are sufficient items to match the different
ability levels of our students), (2) each individual student’s probability of success
on certain items, and (3) whether the overall difficulty of the items matches the
ability of the test takers and vice versa.
As we will see in Sample Study 1 (Malabonga, Kenyon, Carlo, August, & Lou-
guit, 2008), the authors used the Wright map to guide their evaluation of item
coverage across difficulty levels. Following the pilot administration of the cognate
awareness measure (CAT), they added a group of easier items to more adequately
match the students’ ability.
Person Statistics
The output from a Rasch analysis also provides us with estimates of person ability
and person fit as exemplified for our reading data set in Table 12.3.
The table lists all the test takers in order of ability (Catherine is the most able
and Tim the least able student). It also provides us with their raw score (total
score), the number of items they attempted (total count), their position on the
logit scale (measure), and the standard error associated with this measure (Model
S.E.).The standard error for our data set is large because it is based on a very small
sample (for both items and test takers).
A further feature of a Rasch analysis (which cannot be found in the output of
a CTT analysis) is fit statistics. Rasch analysis is based on a probabilistic model.
It proceeds by comparing expected and observed responses of test takers. Once
complete, the best estimates of person ability (as can be seen in Table 12.3) and
item difficulty (Table 12.4) are displayed. The extent to which the prediction and
observation match is shown in the fit statistics. For both test takers and items,
288 Ute Knoch and Tim McNamara
three types of fit can be found: (1) appropriate fit (the pattern identified by the
program is within a normal range, i.e., as expected), (2) misfit (the pattern does
not correspond to the expected pattern in that it is less predictable), and (3) over-
fit (the patterns found by the program are too predictable).
Appropriate fit values (expressed in Table 12.3 as MNSQ [mean-square])
generally range from 0.8 to 1.3 (McNamara, 1996). (These values can also be
expressed in terms of the normal distribution as z-statistics, where the acceptable
range is +2 to –2). Fit can be calculated in two ways: using all the data, including
outliers (Outfit); or using trimmed data, with the outliers removed (Infit). Infit is
usually preferred. Person fit provides us with the ability to examine whether the
ability of a learner can be defined in the same terms as the ability of others in
the group. If a person is identified as misfitting, it means that his or her ability
has not been captured well by our instrument. For an accessible description of
the differences between the different measures of fit, please refer to McNamara
(1996) or Green (2013); Eckes (2011) provides a discussion of fit in a many-facet
Rasch analysis.
Item Statistics
A further piece of output from a Rasch analysis is a table that provides estimates
of item difficulty and item fit, as can be seen in Table 12.5 for our data set. These
indices mirror the data in Table 12.3 but for items rather than test takers.
The items in this table are arranged according to their position on the
logit scale (measure column), which indicates the degree of difficulty of each
item. In the case of our data set, Item 9 is the most challenging and Item 10
the easiest. We can also see how many test takers answered each item correct
(total score) and how many attempted each item (total count). As was the case
with the person statistics reported in Table 12.3, the item statistics table also
indicates the standard error relating to each item measure (Model S.E.). These
are unusually high because of the small sample size of this data set. There are
again two columns of fit statistics reported. As was the case with the person
statistics, these can be categorized into three groups: (1) those displaying very
high positive values and are therefore misfitting, (2) those in the middle range
showing appropriate fit values, and (3) those with very low values and are
therefore categorized as overfitting. Misfitting items are ones where the pat-
terns of responses from the test takers do not follow predictions, in that some
good students might have answered this item incorrectly even if they were
predicted to be able to answer it correctly or that some test takers with less
ability answered correctly. These items do not add much to our measurement
instrument as they create unwanted noise and should be revised or discarded.
Item overfit is less of a concern. A detailed discussion of item fit statistics can
be found in McNamara (1996).
SAMPLE STUDY 1
Malabonga, V., Kenyon, D. M., Carlo, M., August, D., Louguit, M. (2008). Devel-
opment of a cognate awareness measure for Spanish-speaking English language
learners. Language Testing, 25(4), 495–519.
Results
(1) Pilot administration: The results from the Winsteps administration fol-
lowing the pilot administration showed that the items were not per-
fectly matched to the children’s ability in that the mean difficulty of the
items was well above the mean of ability of the test takers, although
there was an even spread of cognate and noncognate items along the
logit scale. The analysis also identified three items as misfitting. The item
difficulty of items barely varied if analyzed in the whole data set or in
separate sections of cognates and noncognates. Following the analysis,
the authors deleted the misfitting items and added some easier items to
the test.
(2) Operational administration of CAT (fourth grade): The results from the
Winsteps analysis showed an even spread of the two item types along
the scale. The Wright map showed that the items were much better
matched to the children’s ability than the trial instrument but that the
mean difficulty of items was still higher than the mean ability of stu-
dents. The authors argued that this is acceptable as the CAT is designed
for both fourth and fifth graders. The overall fit of the items was accept-
able with 96% of the items fitting the Rasch model.
(3) Operational administration of CAT (fifth grade, one year later): The
Rasch analysis showed that the children’s knowledge of English vocabu-
lary and in particular cognates had increased. The Rasch map showed
that the mean ability of the children was slightly higher than the mean
difficulty level of the test takers (reversing the situation in the previous
year). The findings showed that the CAT is of appropriate difficulty for
both fourth and fifth graders: 90.4% of the items fit the Rasch model
and items that were identified as misfitting were usually among the
most difficult.
the rating scale or partial credit model has more than one possible score point.
An example of such an item can be seen in Table 12.6, which is an extract from
the data collected on a listening test with a number of testlets that attracted more
than one score point. Item 3 in this table has five score points, ranging from 0 to
4.We can see the number (and percentage) of test takers who scored each of these
points in the data count column. A powerful feature of a Rasch analysis is that it
also provides us with information about the average ability of the students at each
score point (i.e., the average location or measure of the students with a certain
score on this item). We expect students who score lower to have less ability and
that the average ability level advances with each score point. In our analysis, this
was generally the case (students at score point 0 for Item 3 were of the lowest
average ability, –1.10) and this ability slowly increased as the score point increased.
However the one student achieving the highest score point, 4, was no more able
than those achieving 3. This is probably an artifact of the artificially small data
set we are using here. Items in which the average ability does not increase with
increasing score points might need revision.
The information in Table 12.6 is available for each individual item and for the
entire data set, as can be seen in Table 12.7. Here not only the average ability for
each test taker at each score point is shown, but also the Andrich thresholds (also
known as step difficulty, tau or delta). These indicate the point where it is equally
likely that someone of this ability would achieve either of the adjacent score
points. This information can be used in the process of rating scale development
and/or revision as it provides useful information about the width of different scale
categories that can be used when descriptors are refined or revised.
The information in Table 12.6 can also be represented visually as shown in
Figure 12.2 (these are known as category characteristic curves). The x-axis indi-
cates the average measure (ability) while the y-axis shows the probability of a
response. It can be seen that as the candidate ability increases, the score increases.
At the lowest end of ability, as the average measure increases, a score of 0 becomes
less and less probable. A score of 1 is only likely at a very narrow band of abil-
ity, while a score of 2 is matched to a much broader band of ability. The higher
TABLE 12.6 Sample item measurement report for partial credit data
peak of score 2 also indicates that this score is the most probable. The category
characteristic curves show visually whether any of the rating scale categories
are wider than others or whether any of them are never the most probable. This
information might lead a test developer to revise the wording of the descriptors,
for example, or to collapse (or expand) scale categories.
There are two Rasch models that can be used to analyze data in partial credit
or rating scale format when only one rater/marker is involved. The rating scale
model can be used if all items have the same structure and number of score points,
while a partial credit model can be used in all other instances. We will now look
at a special case of rating scale analyses: questionnaires.
Questionnaire Analysis
One of the data types commonly used in L2 research is yielded by question-
naires. However, as mentioned earlier in this chapter, researchers often do not use
a Rasch analysis to analyze this type of data even though it offers more powerful
tools than other traditional analysis techniques. In this section, we briefly explain
what a Rasch analysis can offer to researchers administering questionnaires.
Imagine we are using a questionnaire to measure a certain construct such as
motivation to learn languages, L2 anxiety, or willingness to communicate (see for
example Sample Study 2). We use a questionnaire with Likert scale items and
administer this to a group of learners. A Rasch analysis can provide us with some
powerful information about our measure.The Wright map can show how well our
items are able to tap into the construct (as a whole) or whether some of them are
easier to endorse than others (i.e., whether respondents are more likely to select
“strongly agree” or “agree” to certain items than others).The fit statistics can show
us whether any of the items are misfitting (i.e., not measuring the overall construct
in line with the other items) or whether our items are a unidimensional measure-
ment of the construct. Dimensionality can be established by examining the residu-
als of the data with a principal components analysis to examine whether there
is a common factor that explains the residuals (and points to a multidimensional
underlying latent measure; see Loewen & Gonulal, Chapter 9 in this volume) or
whether the residuals are just random noise (Linacre, 1998). We can also gather
information about the different step difficulties for each item (in the case of Lik-
ert scale questions, this can indicate to us the distance between different response
categories—for example, whether the step between “strongly disagree” and “dis-
agree” is much wider than between two other adjacent categories. Finally, we can
examine the category characteristic curves to examine any of the Likert scale cat-
egories are not providing useful information for our measurement (e.g., it might be
possible that the category “neutral” is subsumed under other scales). For a detailed
account of using Rasch analysis to analyze questionnaires, refer to Bond and Fox
(2007). Sample Study 2 is an example of how L2 researchers using questionnaires
can make use of Rasch techniques to investigate the quality of their instruments.
SAMPLE STUDY 2
Weaver, C. (2005). Using the Rasch model to develop a measure of second lan-
guage learners’ willingness to communicate within a language classroom. Journal
of Applied Measurement, 6(4), 396–415.
Background
This study set out to investigate the psychometric properties of a question-
naire designed for an L2 research project on willingness to communicate
Rasch Analysis 295
Research Questions
1. How does the rating scale model differ from the partial credit model in
reflecting students’ responses to the WTC questionnaire?
2. How well do the questionnaire items define a useful range of students’
willingness to speak and write English inside the language classroom?
3. How well do the writing and speaking items perform to create a unidi-
mensional measure of students’ WTC in English?
4. How well does the questionnaire’s four-point Likert scale reflect meas-
ureable differences in students’ WTC in English?
Methods
A total of 500 students (232 first year and 268 second year university stu-
dents in an English as a foreign language environment) completed a 34-item
questionnaire designed to measure the WTC in both speaking and writing.
Each item was designed in a four-point Likert scale format: 1. Definitely not
willing; 2. Probably not willing; 3. Probably willing; and 4. Definitely willing.
To answer the first research question, the author compared the results of the
analyses using the rating scale model and the partial credit model to evaluate
whether the item structure differed for the different items or whether they
could equally all be modeled together. To answer the second question, the
item fit statistics and the item difficulty of the questionnaire were scrutinized.
To answer the third question, the author investigated the unidimensionality
of the questionnaire by examining the residuals with a principal components
analysis. To answer the fourth question, Weaver undertook a variety of analy-
ses focused on rating scale functioning as outlined by Linacre (1999).
Results
The comparison of the rating scale and partial credit model analyses showed
that the category thresholds were largely consistent across the two models.
Therefore the use of the more parsimonious model, the rating scale model,
is supported. The questionnaire was also found to define a useful range of
students’ WTC. The two groups of items focusing on the respondents’ WTC
in speaking and writing could be distinguished by the analysis of the residu-
als, but Weaver was also able to show that they worked together to form
the larger construct of willingness to communicate. Finally, the monotoni-
cally increasing step difficulties of the four-point scale showed that the scale
worked well to define students WTC.
296 Ute Knoch and Tim McNamara
Total Total Obsvd Fair-M Model Infit Outfit Estim. Corr. Extract Agree. Nu Rater
Score Count Average Average Discrm PtBis
Measure S.E. Mnsq Zstd Mnsq Zstd Obs % Exp %
333 68 4.90 5.09 –.48 .25 1.11 .7 1.14 .8 .86 .48 44.1 56.8 1 1
317 64 4.95 5.11 –.59 .26 1.23 1.3 1.23 1.2 .72 .41 45.3 56.5 2 2
428 88 4.86 4.97 .11 .22 .98 .0 .96 –.1 1.03 .50 45.0 59.0 3 3
434 88 4.93 4.93 .30 .22 .78 –1.5 .79 –1.3 1.25 .50 53.4 58.1 4 4
384 76 5.05 5.10 –.51 .25 1.04 .2 1.01 .0 .96 .43 53.9 58.4 5 5
528 104 5.04 4.94 .23 .21 .94 –.4 .98 .0 1.05 .45 59.5 58.3 6 6
573 112 5.12 4.99 –.01 .20 .75 –2.1 .70 –2.3 1.31 .52 56.7 57.8 7 7
441 88 5.01 4.94 .28 .22 1.08 .5 1.10 .6 .91 .32 44.3 57.4 8 8
277 56 4.95 4.97 .11 .27 1.08 .4 1.12 .6 .89 .15 41.7 56.6 9 9
280 56 4.92 4.88 .55 .29 1.20 1.0 1.23 1.1 .80 .35 40.4 55.4 10 10
399.5 80.0 4.97 4.99 .00 .24 1.02 .0 1.03 .1 .41 Mean (Count: 10)
95.4 18.3 .07 .08 .37 .03 .15 1.1 .17 1.1 .11 S.D. (Population
100.6 19.3 .08 .08 .39 .03 .16 1.1 .18 1.1 .11 S.D. (Sample)
Rasch Analysis 299
raters. Raters with very high fit statistics (usually with infit mean-square values of
above 1.3) are considered to be misfitting. That means that their rating patterns
do not fall within the range that the program predicts.This usually points to raters
who are rating inconsistently. These raters are not adding meaningful informa-
tion to the measurement of these students and should therefore be required to
undergo standardization training. Raters with very low infit mean-square values
(usually with infit mean-square values below 0.8) are rating more predictably than
the program predicts. This could point to raters who are overusing certain band
levels on the rating scale and therefore not displaying the kind of expected varia-
tion across test takers. A detailed, accessible discussion of the influence different
raters on raw scores can be found in McNamara (1996, Chapter 5).
Facets also produces a reliability index as part of the rater measurement report.
It is important to note that this is not interpreted in the same manner as tradi-
tional rater reliability indices. The Rasch reliability index on the rater measure-
ment report is interpreted in the opposite way, in that low reliability indices are
desirable. These indicate that the raters are rating reliably the same.
The many-faceted Rasch model also reports a score for each test taker that
takes into account the different facets in an analysis. For example, if the analysis
has identified that a test taker was rated by harsh raters or encountered difficult
tasks, this is accounted for in the “Fair-M Average.”
Finally, the many-faceted Rasch model also makes it possible to model interac-
tions between different raters. For example, it is possible to explore whether par-
ticular raters have certain patterns of interaction with certain rating criteria (e.g.,
always rating more harshly than expected when assessing content) or whether the
background of the students influences the rater’s assessment. This is called a bias
analysis and was one aspect investigated by the authors in Sample Study 3 (Elder,
Knoch, Barkhuizen, & von Randow, 2005).
SAMPLE STUDY 3
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback
to enhance rater training: Does it work? Language Assessment Quarterly, 2(3),
175–196.
Background
This study set out to investigate whether providing raters with detailed indi-
vidualized feedback on their rating performance is effective. The purpose of
the feedback was to enhance the reliability of scores on a writing assessment
for undergraduate students in an English-medium university.
300 Ute Knoch and Tim McNamara
Research questions
1. Does individualized feedback reduce interrater differences in overall
severity?
2. Does individualized feedback make individual raters more internally
consistent in their judgments?
3. Does individualized feedback reduce individual biases in relation to the
scoring of particular categories on the rating scale?
Method
Eight experienced writing raters rated 50 writing samples each. The research-
ers then used a many-facet Rasch analysis to generate individualized feed-
back profiles, which included feedback on the raters’ (1) relative severity in
relation to the group of raters, (2) overall consistency, and (3) patterns of
bias in relation to particular rating scale categories. The raters then rated a
further 60 writing samples. A subsequent Rasch analysis was undertaken to
investigate whether the feedback helped raters to rate more consistently,
reduce any patterns of harshness or leniency, and reduce individual biases in
relation to scale criteria.
Results
The results showed that some raters were able to successfully take on the
feedback in their subsequent ratings but that there was large variation
among raters in terms of their receptiveness. The raters varied less in terms of
their severity in the post-feedback rating round, but this was at the expense
of the overall discrimination power of the test. The authors therefore argued
that costs of implementing this rather labor-intensive feedback may out-
weigh the benefits.
Conclusion
Rasch analysis has enormous potential to be used in L2 research. It has a number
of strengths: Its estimates of the characteristics of subjects relevant to the research
are likely to be robust and stable, as they factor in the quality of the data on which
they are based; it allows the linking of separate measurement instruments (e.g.,
tests) so that “before” and “after” testing is not subject to the idiosyncrasies of
the tests used in each case, and test familiarity effects are avoided; and it allows
detailed analysis of the impact of the quality of judges or raters and other aspects
of the data-gathering setting on measures used in the research. The examples
presented in this chapter show some of the range of research questions that can
Rasch Analysis 301
For a full list of possible Rasch analyses software, please refer to the Rasch
Measurement Analysis Software Directory (http://www.rasch.org/software.htm).
The different options are listed in a helpful table that outlines where they can be
obtained, whether they are free, and which models they support. Many of pro-
grams offer free trial or student versions.
Other Resources
Further useful information about Rasch analysis and answers to questions can be
obtained by joining a discussion list. The two most well-known listservs are the
Mathilda Bay Club (http://www2.wu-wien.ac.at/marketing/mbc/mbc.html)
and the Rasch listserv hosted by the Australian Council of Educational Research
(http://mailinglist.acer.edu.au/mailman/listinfo/rasch). A Facebook group aim-
ing at Rasch measurement is also available (http://www.facebook.com/groups/
raschmeasurement). For up-to-date research articles using Rasch measurement,
we recommend the Rasch Measurement Transactions (http://www.rasch.org/
rmt/contents.htm), which is the official newsletter of the Rasch Measurement
Special Interest Group (http://www.raschsig.org/). The Institute of Objective
Measurement offers a useful website that summarizes Rasch-friendly journals (i.e.
journals publishing research using Rasch analysis), upcoming conferences, book
titles and much more information relating to Rasch analysis (http://rasch.org/).
Further Reading
• Bond,T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement
in the human sciences. An accessible, detailed introduction to the Rasch model.
The book does not use examples from L2 research but covers a broad range
of issues useful for practitioners and researchers in our field.
• McNamara, T. (1996). Measuring second language performance. This book was
the first introduction of the Rasch model to L2 research. It is a very detailed
and accessible step-by-step guide on how to interpret the different aspects of
a Rasch analysis, although the main focus is on the many-facet model. This
book is now out of print, but a scanned copy can be obtained free of charge
on Tim McNamara’s website (http://languages-linguistics.unimelb.edu.au/
academic-staff/tim-mcnamara).
• Green, R. (2013). Statistical analyses for language testers. This book includes
screenshots and step-by-step instructions on how to conduct and interpret a
Rasch analysis. This book is very accessible to complete beginners and could
be used as a starting block for further reading about Rasch.
• Eckes, T. (2011). Introduction to many-facet Rasch measurement. This volume
focuses entirely on the many-facet Rasch model. It provides detailed infor-
mation on how to interpret the output tables and also covers some more
advanced topics.
Rasch Analysis 303
Discussion Questions
1. Choose a number of L2 research studies that have made use of Rasch analysis.
a. Is it clear which Rasch model was used in the analysis?
b. Are the research questions clearly stated and answerable?
c. Are the analyses described in a clear, replicable manner?
d. Are the results presented clearly?
2. In some research designs, subjects have to be tested before and after treat-
ment. It is not usually advisable to use the same test again, because of test
familiarity effects. How does Rasch analysis help get around this problem?
3. One of the differences between CTT and Rasch analysis is that the latter
factors in the quality of the data used to estimate person characteristics, item
difficulties, rater qualities, and so on. In what aspects of the output is there
evidence of this feature of Rasch analysis?
4. Read the sample data set into Winsteps,which can be downloaded from this
book’s companion website (http://oak.ucc.nau.edu/ldp3/AQMSLR.html),
using the procedures described in the chapter.
a. What information can you learn from the Wright map?
b. Is the spread of test items well suited to the test takers?
c. Are there any items that are misfitting or overfitting?
d. Is there any information given by the Rasch analysis that you could not
readily learn from an analysis using classical test theory?
Notes
1. A discussion of the two methods can be found in Linacre (1999).
References
Andrich, D. (1978a). A general form of Rasch’s extended logistic model for partial credit
scoring. Applied Measurement in Education, 4, 363–378.
Andrich, D. (1978b). A rating scale formulation for ordered response categories. Psy-
chometrika, 43, 561–573.
Bond, T., & Fox, C. (2007). Applying the Rasch model. Fundamental measurement in the human
sciences. New York: Routledge.
Cunningham, T.H., & Graham, C.R. (2000). Increasing native English vocabulary rec-
ognition through Spanish: Cognate transfer from foreign to first language. Journal of
Educational Psychology, 92, 37–49.
Eckes, T. (2011). Introduction to many-facet Rasch measurement. Frankfurt: Peter Lang.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to
enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196.
Green, R. (2013). Statistical analyses for language testers. New York: Palgrave.
Iramaneerat, C., Smith, E. V., & Smith, R. M. (2008). An introduction to Rasch measure-
ment. In J. Osborne (Ed.), Best practices in quantitative methods.Thousand Oaks, CA: Sage.
304 Ute Knoch and Tim McNamara
upon such findings by creating a means for estimating future group member-
ship as efficiently and reliably as possible on the basis of a set of measurable
phenomena.
Discrim is a unique application of the multivariate ANOVA (MANOVA)
family, falling within the general linear model approach to inferential statistics
(see Plonsky, Chapter 1 in this volume). It essentially turns MANOVA around
and treats the independent variable (a single grouping factor of some kind) as
the criterion or dependent variable, and the dependent variables (a set of interval
measures of various phenomena) as the predictor or independent variables. The
terminology can therefore become somewhat confusing when moving between
MANOVA and Discrim, so care must be taken to keep in mind precisely what is
being referred to by labels like “independent” or “dependent” variable. Discrim
is most meaningful when applied to naturally occurring groups that are mutually
exclusive and exhaustive of the phenomenon of interest (e.g., individuals that
belong to only one global proficiency level, or texts that can only be classified a
priori as one type of genre or another).Where groups are artificially or arbitrarily
created (e.g., by separating cases above and below the median value of a given
measure), Discrim is typically less effective and/or more difficult to interpret (and
alternative techniques that search for groups, rather than investigate predictabil-
ity of group membership, may be more appropriate, such as factor analysis, see
Loewen & Gonulal, Chapter 9 in this volume).
Where Discrim is used to distinguish cases belonging to a grouping factor
with only two levels, such as first-language (L1) versus second-language (L2)
speakers, it is mathematically identical to MANOVA, with the added benefit of
identifying proportions of cases accurately and inaccurately classified as one or the
other group as well as which variables are best able to do so. However, Discrim is
potentially much more interesting when analyzing membership of cases in more
than two groups. Here, Discrim works by combining the measured variables into
functions (similar to factors in factor analysis), which are essentially new latent
(unobservable) variables based on linear combinations of observable phenomena.
Functions are created mathematically by weighting the contribution of each pre-
dictor variable (based on its correlation with the grouping variable) in different
ways, and then looking at which combination of weighted values is the most dis-
criminatory between the different groups. With multiple groups (i.e., more than
two), there may be multiple ways of weighting and combining measured variables
in order to distinguish between different pairings/sets of groups: It is possible, even
likely, that different combinations of measures are more discriminatory between
certain groups than other groups. Thus, in the CEFR example earlier, it may
be that one function that more heavily weights certain measures (perhaps basic
syntax and pronunciation) is better at discriminating between lower proficiency
levels (A1, A2) whereas another function, emphasizing other measures (say, lexi-
cal variety, morphological accuracy, and fluency), may be better at discriminating
Discriminant Analysis 307
among higher proficiency levels (B2, C1, C2). Luckily, as the math involved can
become extremely complicated, with multiple measures and multiple groups (see
brief math demonstration in Tabachnick & Fidell, 2013), statistical software appli-
cations like SPSS automatically do all of the math for us, so the only real challenge
in applying Discrim is to make sure that it is set up and interpreted correctly.
Discrim applications also provide very useful tables and figures in the output that
help the researcher focus directly on the most important findings.
Note that Discrim is also similar to logistic regression (LR) and cluster analysis
(CA). A major difference with LR is that Discrim adopts stricter assumptions
regarding the normality of variables in data sets and within the population of
interest, while LR makes no distributional assumptions at all about predictor
variables or the linearity of relationships with criterion variables. LR is therefore
much more flexible, but also less powerful, depending on the qualities of the data.
A major difference with CA is that, in Discrim, the number and definition of
groups into which membership is being predicted is known a priori, whereas in
CA the number of predictable “clusters” or groups is not known until the analysis
is completed.
Though one of the lesser utilized statistical approaches within applied linguis-
tics research, Discrim has featured sporadically across multiple domains of inquiry.
For example, in language testing research, Discrim has been used to investigate
the elements of rating scales and rubrics that distinguish between hierarchical
levels of oral test performance (e.g., Fulcher, 1996; Norris, 1996), to examine the
accuracy of criterion-referenced testing and pass-fail decisions (e.g., Robinson &
Ross, 1996), and to explore test-method effects (e.g., Zheng, Cheng, & Klinger,
2007). In L2 composition research, questions regarding which features of writing
performance (e.g., syntactic, lexical, discoursal) are best able to distinguish among
holistically rated higher or lower compositions have been addressed through
Discrim (e.g., Ferris, 1994; Homburg, 1984; Oh, 2006; Perkins, 1983). Reading
researchers have employed Discrim to investigate effects of lexical transfer and
reading processes on comprehension (e.g., Koda, 1989; Nassaji, 2003).
Perhaps the most frequent application of Discrim within applied linguistics
research has come from corpus linguistics. Within this broad domain, research-
ers have utilized Discrim to investigate writing quality (e.g., Crossley & McNa-
mara, 2009; McNamara, Crossley, & McCarthy, 2010), genre identification (e.g.,
Martín Martín, 2003), register variation (e.g., Biber, 2003), the deployment of
specific grammatical phenomena in language use (such as particle placement,
Gries, 2003), and L2 learner production (e.g., Collentine, 2004; Collentine &
Collentine, 2013), to name a few examples. Discrim has also featured from time to
time in research on L2 interactional strategies (e.g., Rost & Ross, 1991), anxiety
effects on learning (e.g., Ganschow & Sparks, 1996), mother tongue maintenance
(e.g., Okamura-Bichard, 1985), language impairment (e.g., Gutiérrez-Clellen &
Simon-Cereijido, 2006), motivation and personality relations with proficiency
308 John M. Norris
(e.g., Brown, Robson, & Rosenkjar, 2001), and the effectiveness of self-instruction
(e.g., Jones, 1998).
Variables
Discrim begins with the identification of grouping and predictor variables.
A grouping variable typically takes the form of a categorical factor of some kind,
often the causal variable that has been operationalized in a study and already ana-
lyzed within MANOVA as an independent variable. Grouping variables can have
two or more levels, and each case in the analysis must belong to one and only
one level of the grouping variable. For example, grouping variables might be text
types (argumentative, narrative, etc.), experimental groups (explicit, implicit, con-
trol, etc.), proficiency levels (low, medium, high, etc.), and so on. In our example
study, Davis (2012) identified three groups of foreign language (FL) programs
based on the extent to which they self-reported low, medium, or high degrees of
using and learning from outcomes assessment. Note that, although the programs
rated their degrees of use on a rating scale, the nature of the scale categories
was essentially categorical: respondents self-identified their programs according
to how much they utilized or learned from doing outcomes assessment, from low
to high (i.e., the distinctions between the levels of the grouping variable are by
no means arbitrary). “Level of learning from assessment use,” then, is the grouping
variable for this study, and membership in one of these three levels is what we will
try to predict on the basis of a set of measures.
Predictor variables consist of interval-scale measures of whatever phenomena
will be used in attempting to classify individual cases according to their mem-
bership in the levels of the grouping variable. Predictor variables typically come
from the measures that have been operationalized and analyzed in MANOVA as
dependent variables; the purpose of Discrim is to investigate how these measures
Discriminant Analysis 309
Assumptions
Prior to initiating Discrim, the standard assumptions for multivariate statistical
analyses should be checked in the full data set. Essentially, the assumptions for
Discrim are the same as they are for MANOVA: independence of observations
on each variable, univariate and multivariate normality, homogeneity of variance
and covariance, and no evidence of multicollinearity. As these assumptions are
discussed elsewhere (e.g., Jeon, Chapter 7 in this volume; Tabachnick & Fidell,
2013), I will not describe them in detail here. Suffice it to state that violation of
these assumptions will affect the quality and reliability of the Discrim analysis, so
they should be taken seriously. Where violations are encountered, steps should be
taken to select alternative appropriate analyses or adjust data such that the analysis
is not threatened. However, I will make mention of three assumptions that are
particularly important for Discrim in that they may have undue effect on the
outcomes of the analysis: (a) Outliers may exert considerable influence, especially
at lower sample sizes, so care should be taken to inspect (graphically) the distribu-
tions of cases on each predictor measure for each level of the grouping variable.
Where identified, serious outliers should be eliminated or their scores adjusted.
(b) Sufficient sample size is also critical for interpreting a Discrim analysis; as a
310 John M. Norris
rule of thumb, at a minimum the smallest group sample size must contain more
cases than the total number of predictor variables to even begin to trust the solu-
tion from the Discrim analysis; more typically, a criterion of 20 cases per predictor
is adopted to avoid problems of overfitting the model with too many predictors.
Finally, (c) multicollinearity should be considered carefully prior to entering pre-
dictor variables into the analysis; high correlations (a typical criterion is r > .70)
between predictor variables reduce the power of the analysis and may confuse the
determination of discriminant functions due to superfluous variables.Where high
correlations are identified, a single marker variable should be selected and other
correlated variables eliminated from the analysis (note that the issue of determin-
ing the magnitude of correlation that should be considered “too much” overlap
is far from resolved in black-and-white terms; see Tabachnick & Fidell, 2013, for
discussion).
In our example study, no severe violations of assumptions were found on the
data set of 90 FL program survey respondents. In particular, significant outliers
were not identified for any of the predictor measures within any of the three
groups, and none of the measures correlated with any other measure at higher
than r = .60. The smallest group sample size, n = 20 for the “low” assessment use
level, was higher than the total number of predictors (n = 9).
predict group membership and move them into the Independents window (see
Figure 13.3).
In SPSS we are also prompted at this point to choose a particular approach to
entering the data in the statistical analysis (also shown in Figure 13.3). Note that it
is possible to analyze the predictor measures in a particular stepwise or sequential
order (e.g., the most important or statistically strongest first, followed by others
after that one is factored out); however, we need a theoretical/logical reason to do
so. In this case, in the absence of any particular reason to look at the effects of one
measure first and others in a particular order, we are going to treat all measures
equally (and that is typically the case for Discrim in L2 research). That means we
will select Enter independents together to enter them all at once, with no particular
order. This approach is also known as Direct Discriminant Analysis.
The next step is to select the statistics that we want to calculate for the Dis-
crim analysis. In SPSS, we click on the Statistics button and a new window pops
open to display a variety of possibilities (see Figure 13.4). What we choose here
depends somewhat on the nature of the data and our approach to analyzing it
(see discussion in Huberty & Olejnik, 2006), but a basic approach will suffice
for most situations. Several options here are quite useful. First, selecting Means
and Univariate ANOVAs will give us the descriptive statistics and ANOVAs for
each group (1, 2, 3) on each of the nine measures. Second, selecting Box’s M will
give us a test of the homogeneity of variance-covariance, which is helpful in
determining whether the MANOVA inferential test is trustworthy or not. Third,
selecting Fisher’s under Function Coefficient will provide us with the newly cal-
culated average value for each group on each measure based on the newly created
the same number of cases within the smallest group as we have predictor variables,
so we will not conduct a cross-validation on this data set. Also, if we are really
interested in how each specific case was classified (accurately or not), then we can
select Casewise results, which will produce a table with each of the individual cases
and the group that it was predicted to be in (based on the combined measures), as
well as the group that it was actually in (based on the grouping variable). Finally,
under Plots we can also ask for graphs that help us to conceptualize the overall
analysis. In our case we will select Combined-groups, which will show us on a single
graph how well the functions differentiate each group.
At this point, our Discrim is ready to be processed, so all we have to do is
click on OK. One additional note is in order at this point, prior to examining
the output of results: It is common practice to run multiple Discrim procedures
by varying the number of predictor variables entered into the analysis. While this
approach may increase Type I error overall, it is also quite useful for determining
how well different subsets of variables combine to predict group classifications.
I will not explore this approach in the current chapter, but I refer readers to
several of the resources in Further Reading for examples of L2 studies that have
done so.
Interpreting Output
Most statistical software applications, SPSS included, provide copious output for
Discrim in the form of tables and figures. Fundamentally, we are interested in
Discriminant Analysis 315
finding answers to questions such as: (a) How well was the set of measures, as a
whole, able to predict group membership? (b) Were some groups more highly
predictable than others? and (c) Which individual measures seem to be the best
predictors of group membership? While only portions of the output are particu-
larly useful for interpreting the main findings, it is helpful to understand the basics
of what all is calculated. The first substantive table produced in SPSS output is
called “Group Statistics.” This table shows the means and standard deviations for
each of the three groups (1, 2, 3) on each of the nine measures in our example
data set. We can skim through it and compare values on a given measure between
the three groups, and we should begin to see which measures may be the best at
predicting differences across the groups.
The next table shows the individual ANOVA results for each of the predictor
variables; in other words, it reports whether each measure on its own showed sta-
tistically significant differences across the levels of the grouping variable.Table 13.1
shows the output for our sample data. Here we see whether there is an overall
significant effect across the three groups (1, 2, and 3 on Processuse) for each of
the measures. Clearly, the answer is yes, as indicated by the very small p values in
the final column. Note that we do not know where the differences might be yet
(between which pairs of 1, 2, and 3), just that there is an overall significant effect
for each measure. Measures that do not show a significant effect here will not
contribute to the predictions later on in the analysis. Note here that Wilks’ lambda
and the F value are inversely proportional; the smaller the lambda, the higher the
F, and the greater the effect of the given measure. At this point, then, we should
be able to identify which of the measures is likely to be the strongest predictor of
group differences (i.e., the measure with the smallest lambda).
The next several tables of output help us to review the assumptions of the
multivariate family of analyses. A bivariate correlations table is provided, showing
the relationships between each pair of the nine predictor variables. In our exam-
ple data set, we find that the correlations are all positive and range from r = .14
to r = .60; thus, although there are obviously some overlapping relationships here,
none is strong enough to suggest multicollinearity, and hence all of the measures
can be included safely in the Discrim analysis.The next two tables show the Box’s
M tests of equality of covariance matrices. The log determinants show a measure
of variability for the predictor measures combined in each group—here, we hope
that they are relatively similar, meaning that variability is not radically different on
the combined measures between each group. In Table 13.2, we see that the vari-
ability is indeed quite similar for our sample data. Box’s M then tests the signifi-
cance of that comparison of variability between the groups on the nine measures.
For our data, the test is not significant ( p = .248), and given the relatively similar
values in the log determinants output, it is safe to assume that covariance is not
overly heterogeneous.
With the subsequent output, the findings specific to Discrim begin in ear-
nest. First, we encounter information regarding the calculated discriminant func-
tions themselves. A function is a combination of measures used to predict group
membership (conceptually kind of like a factor in factor analysis, see Loewen &
Gonulal, Chapter 9 in this volume; and similar to a cluster in cluster analysis, see
Staples & Biber, Chapter 11 in this volume). Discrim tries to find the best linear
combination of measures to distinguish among groups. Table 13.3 shows that in
our example data set, since there are three groups, Discrim only tries to make
two distinctions (hence, two functions). Each function predicts a certain amount
of the total variance that can be accounted for in the three groups. For our data,
TABLE 13.2. Box’s M output for testing homogeneity of covariance across three groups
Log Determinants
Processuse Rank Log Determinant
1 9 –16.303
2 9 –16.258
3 9 –14.130
Pooled within-groups 9 –13.900
The ranks and natural logarithms of determinants printed are those of the group covariance matrices.
Test Results
Box’s M 118.774
F Approx. 1.097
df1 90
df2 11,575.999
Sig. .248
Tests null hypothesis of equal population covariance matrices.
Discriminant Analysis 317
Eigenvalues
Function Eigenvalue % of Variance Cumulative % Canonical
Correlation
1 1.441a 91.0 91.0 .768
2 .142a 9.0 100.0 .352
a. First two canonical discriminant functions were used in the analysis.
Wilks’ Lambda
Test of Wilks’ Lambda Chi-square df Sig.
Function(s)
1 through 2 .359 85.059 18 .000
2 .876 10.994 8 .202
the first function is doing the lion’s share of prediction (91%), with a correspond-
ingly large eigenvalue (a measure of the variance attributable to the function,
but not very interpretable in practical terms). The canonical correlation shows
that the first function is also quite highly correlated with the grouping variable
(Processuse).
Wilks’ lambda is a significance test of the overall ability of the functions to
identify differences between groups. Here again, the smaller the lambda, the
greater the predictive power. The lambda value actually shows the amount of
variance that the model cannot explain (so, you see it is much better at explaining
with the combined Functions 1 and 2). Note that the first lambda test is of both
functions combined. The second test is for Function 2 alone, after the variance
attributable to Function 1 has been factored out. Here we see that Function 2
on its own does not significantly distinguish between the three groups. However,
it does add some additional discrimination (most likely, it discriminates between
two groups but not all three).
The next several tables provide indications of the extent to which each pre-
dictor variable is related to each of the discriminant functions. Standardized
canonical function coefficients are just standardized values (like z-scores) for each
measure that are used to calculate the overall discriminant function.These are not
very interpretable at face value, but we can already start to see which measure
is contributing the most to each function. So, the larger the magnitude of the
coefficient, the more influence or weight that measure has on the function. The
structure matrix table (Table 13.4) is somewhat more interpretable. It shows the
correlation between each measure and the two functions that have been created
by the analysis (similar to ‘loadings’ in factor analysis). The * next to the correla-
tions indicates the function that the particular measure is most highly correlated
318 John M. Norris
TABLE 13.4 Relationship output for individual predictor variables and functions
Structure Matrix
Function
1 2
ActCondA .883 *
–.333
ColA .697* .361
ComA .626* .072
LeadA .592* –.040
InfraA .499* –.393
ProgSupA .350* .137
InstSupA .299* .152
CulEthoA .527 .603*
InstGovA .369 –.402*
*. Largest absolute correlation between each variable and any discriminant function
with. In Table 13.4, it is clear that most of the measures in our example data set
correlate with Function 1, but only about half are highly correlated (i.e., above
.50). For Function 2, only two measures correlated more highly there. Also, if we
needed to pick a single measure to represent each function, it would be the mea-
sure that correlates most highly with it—we can refer to these as marker variables.
ActCondA would be a very good marker variable for the first function (very
highly correlated), while CulEthoA is a moderately strong marker variable for
Function 2.
The final information that we received about the functions themselves is a
table showing the mean values calculated for each group (Groups 1, 2, 3 from our
Processuse grouping variable) on each of the functions created by the analysis.
The values do not have any particularly interpretable meaning on their own;
however, if we compare the groups with each other, we can see how far apart they
are in terms of the functions. For Function 1 in our example data, the difference
between Groups 1 and 3 is 2.851 points, whereas for Function 2, the difference
between 1 and 2 (because they are the most different of the three groups) is only
.974. So, clearly Function 1 is much better at distinguishing between the groups.
The values from this table are represented graphically in Figure 13.6.
The third part of the Discrim output shows classification statistics in several
different tabular and graphic formats, arguably the most useful aspect of the analy-
sis. A first table (“Prior probabilities,” not shown here) just reminds us of the prob-
ability level that was used to estimate group membership. Recall for our example
data that we asked for probabilities based on sample size, so the table shows us that
sample size was used for estimating the size of each group. The next table shows
an overall average value calculated for each group on each measure, based on the
new “scale” of the discriminant functions created by the analysis (see Table 13.5).
Discriminant Analysis 319
4 Processuse
1
2
3
Group centrold
2
1
Function 2
3
0
2
−0
−4
−4 −2 0 2 4
Function 1
FIGURE 13.6 Two-dimensional output for three group average values on two discrimi-
nant functions
We can compare between the three groups on each measure to see which are the
farthest apart—the measures that have widely differing group values are the ones
that will be the best predictors of group membership. Here, again, we see that
ActCondA has very different values for each group, so it will definitely be the best
predictor. Others are clearly discriminating between two groups but not all three
(e.g., ColA between Groups 2 and 3, but not between 1 and 2), and still others
discriminate very little across groups (e.g., InstSupA).
A variety of figures are also provided in the output, depending on what we
have requested in the setup of the analysis. For our example data, we requested a
plot of the group centroids and individual case values for each function, as shown
in Figure 13.6. This curious figure provides a two-dimensional representation of
the ways in which the analysis is able to separate each group and each case. Note
that cases are individual circles (color-coded for each level of Processuse). The
squares (centroids) are essentially an average value for each group on the nine
predictor measures combined. We can read the graph in two ways: (a) look for
distance between each centroid from left to right (here we see a lot of distance
between Group 3 and the others, based on information from Function 1); and
(b) look for distance between each group from top to bottom (here, we do not
320 John M. Norris
see much distance, although 1 and 2 are separated more from each other, but not
much from Group 3, based on Function 2). In essence, then, this figure is showing
that the analysis is pretty highly capable of separating groups by Function 1 (i.e.,
according to a certain set of predictor variables), and then marginally capable of
additional separation by Function 2 (i.e., according to another set of predictor
variables).
Finally, at the very end of the classification output is the information we are
probably most interested in. As shown in Table 13.6 for our example data, Dis-
crim estimates the numbers and percentages of individual cases whose group
membership has been correctly predicted by the combined functions. In other
words, using the combined information from the two functions (i.e., nine mea-
sures of assessment capacity in these data), the analysis was able to correctly pre-
dict the group membership of 73% of the cases (not bad!). We can also see that
Classification Resultsa
Processuse Predicted Group Membership Total
1 2 3
Original CCount 1 15 5 0 20
2 6 15 7 28
3 1 5 36 42
% 1 75.0 25.0 .0 100.0
2 21.4 53.6 25.0 100.0
3 2.4 11.9 85.7 100.0
a. 73.3% of original grouped cases correctly classified.
Discriminant Analysis 321
the predictions were quite a bit higher for group 3 (i.e., the high assessment
use group), not bad for Group 1 (the low assessment use group), and quite a
bit weaker for Group 2 (the mid-assessment use group). Given that we were
interested in predicting three levels of the grouping variable, chance would sug-
gest approximately 33% for each cell in the matrix; thus, while 53% is not very
accurate (half correct, half incorrect), it is actually quite a bit higher than chance
in this analysis.
Reporting Findings
When reporting the findings of a Discrim, it is important to include sufficient
details regarding the nature of the grouping and predictor variables, how statistical
assumptions were checked (and any adjustments made), the setup of the analysis,
and the essential descriptive and inferential statistical details that will allow readers
to both understand the approach adopted and judge the findings on their own.
Following is a basic example of a report based on our example data set.
Results
A discriminant function analysis was conducted to predict the level of
Process use (i.e., learning from and acting on assessment information)
reported by foreign language programs based on several measures of assess-
ment capacity. Low, mid, and high Process use groups were determined by
self-reported scores on a separate survey. Nine predictor variables (mea-
sures of assessment capacity) were included in the analysis: institutional
support, institutional governance, infrastructure, program support, leader-
ship, culture/ethos of assessment, collaboration, communication, and activi-
ties/conditions for assessment. Multivariate assumptions for data quality
were met, and the relatively large sample size (N = 90) as well as sufficient
within-group sample sizes (n = 20, 28, 42) suggested that the analysis would
be robust to some variations in data quality between groups and predic-
tor variables, and despite inequality in group sample sizes. A test of equal-
ity of group means indicated statistically significant ( p <. 05) differences
between the three Process use groups on each of the nine variables, and
bivariate correlations between all pairs of variables ranged from .14 < r <
.61. Given that each variable on its own predicted group differences and
no evidence of multicollinearity was found, all predictor variables were
retained for further analysis. Box’s M ( p = .248) indicated no heterogene-
ity of variance-covariance, hence the subsequent multivariate discriminant
function analysis was inspected.
The analysis identified two discriminant functions, the first accounting
for the large majority (91%) of observable variance across the three Process
use levels. An overall statistically significant effect was found for the com-
bined functions (1 and 2), Wilks’ lambda = .359, χ2(18, N = 90) = 85.059,
p < .001, indicating that the combined predictor variables were able to
322 John M. Norris
account for around 64% of the actual variance in Process use between
the three groups. On its own, the second function did not provide addi-
tional statistically significant predictions, Wilks’ lambda = .876, χ2(9,
N = 90) = 10.994, p = .202. As shown in Table 1, Function 1 was best
represented by the measure of activities/conditions for assessment, which
correlated at .883 with the function; additional strongly correlating mea-
sures included collaboration and communication. Function 2, by contrast,
was best represented by the measure of culture/ethos of assessment, which
correlated moderately (r = .603) with the function; note that institutional
governance correlated negatively and moderately with Function 2, but
positively with Function 1, suggesting a somewhat ambiguous relationship
between this variable and predictions of Process use.
Figure 1 shows the individual cases and group centroids (average values
for each group) displayed in two dimensions: (a) from left to right, Function
1 clearly distinguishes between all three groups, and much more so between
Function
1 2
ActCondA .883 *
–.333
ColA .697* .361
ComA .626* .072
LeadA .592* –.040
InfraA .499* –.393
ProgSupA .350* .137
InstSupA .299* .152
CulEthoA .527 .603*
InstGovA .369 -.402*
*. Largest absolute correlation between each variable and any discriminant function
FIGURE 1 Predicting process use: Cases and group centroids for two discriminant
functions
group 3 (high Process use) and the other two; (b) from top to bottom,
Function 2 additionally distinguishes between Groups 1 and 2, but less so
between Group 3 and the other two groups.
Finally,Table 2 shows the classification results for the discriminant analy-
sis. Overall, the combined Functions 1 and 2 were able to classify 73% of
the cases correctly into the three levels of Process use. Classification accu-
racy was much higher for Group 3 (the highest Process use level), with 86%
of cases predicted correctly, and substantially lower for Group 2, with only
54% of cases predicted correctly.
Note that the emphasis in reporting Discrim findings is on two types of effect
sizes: First, the strength of relationship between individual predictor variables and
each function is represented by correlations. Second, the overall quality of the
model is represented by the percentages of correctly classified cases; correlations
and percentages are both easily interpreted types of effect size. In this sense, Dis-
crim may automatically help researchers to move beyond relatively meaningless
yes/no statistical significance testing that characterizes typical interpretations of
MANOVA by offering a type of follow-up procedure that centers on magnitude
of relationships and patterns in the data.
324 John M. Norris
SAMPLE STUDY
McNamara, D., Crossley, S., & McCarthy, P. (2010). Linguistic features of writing
quality. Written Communication, 27, 57–86.
Background
McNamara, Crossley, and McCarthy (2010) set out to determine whether
automated measures of cohesion and coherence, as well as syntactic com-
plexity, diversity of words, and other characteristics of words, all provided by
the Coh-Metrix tool, would be predictive of L2 English essays rated to be of
high versus low proficiency on a standardized rubric.
Methods
General: N = 120 English L2 argumentative essays written by college
students
Grouping variable: Essays rated on a six-point holistic rating scale for writing
proficiency; grouped into high (ratings of 4, 5, 6) or low (ratings of 1, 2, 3)
proficiency levels.
Predictor variables: Significance tests were used to determine which indi-
vidual variables, among an initial pool of 53 indices extracted automatically
by Coh-Metrix, showed differences between the two groups. Interestingly,
none of the measures of coherence/cohesion predicted group differences;
the single measure with the largest effect size from each of predictor was
then selected for Discrim.
Discriminant Analysis: Direct Discrim was run on a “training” set of n = 80
cases, and subsequently cross-validated on a “test” set of the remaining
n = 40 cases.
Results
On the training test, Discrim classified essays with 65% accuracy (68% for
low-rated essays, 62% for high-rated essays). On the test set, Discrim classi-
fied essays with 70% accuracy (73% for low-rated essays, 67% for high-rated
Discriminant Analysis 325
Further Reading
Examples of L2 Studies Employing Discriminant Analysis
Brown, Robson, & Rosenkjar (2001): Inquired into the relationship between motivation,
personality, anxiety, and learning strategies with L2 proficiency of Japanese learners
of English. Direct Discrim was used to predict learner membership in high, medium,
and low proficiency groups (as determined by a cloze test), and findings indicated that
one set of variables distinguished between low proficiency and the other two groups,
while a second set of variables distinguished between middle and high proficiency
groups, though less accurately.
Collentine (2004): Investigated gain scores on Spanish L2 oral interviews by students in
two distinct learning contexts, study abroad versus regular instruction at home, over
the period of one semester of instruction. Direct and stepwise Discrim were utilized
to indicate which grammatical and lexical features in the oral production best classify
learners in the two learning contexts.
Crossley & McNamara (2009): Compared L1 and L2 written texts on the basis of
numerous measures from the Coh-Metrix computational tool (including, e.g.,
measures of cohesion, text difficulty, and lexical frequency). Multiple Discrim analy-
ses identified the optimal number of variables for achieving maximum distinction
between the two text types; note the use of cross-validation with half of the texts.
326 John M. Norris
Nassaji (2003): Investigated the role of syntactic, semantic, word recognition, and grapho-
phonic processes in determining reading comprehension by adult ESL learners.
Direct Discrim resulted in high levels of classification accuracy between low-skilled
and high-skilled reading groups, most effectively due to lexical-semantic processes but
also attributable to other measured skills. Note the use of cross-validation in this study,
based on two halves of the learner data.
Zheng, Cheng, & Klinger (2007): Examined possible test-method effects of three dif-
ferent item formats (multiple choice, constructed response, and constructed response
with explanations) on the reading comprehension scores of ESL versus non-ESL
examinees. Direct Discrim suggested that while the multiple choice format was able
to distinguish between the two groups significantly, it did so only to a small degree
and it correlated highly with scores on the other two formats, indicating that item
format did not have a substantial effect on overall comprehension score differences
between the two groups.
Discussion Questions
1. First, try to replicate the analyses above with the DISCRIM data set provided
along with this chapter on the book’s companion website (http://oak.ucc.nau.
edu/ldp3/AQMSLR.html).Was your analysis successful? Did you find the same
patterns in the data? Next, access a different data set that has been analyzed
already with MANOVA, and conduct a Discrim analysis using the measured
variables to predict membership in one grouping variable. Compare the two sets
of output from the two multivariate analyses: What research questions can you
answer on the basis of Discrim that could not be answered with MANOVA?
2. Using the same data, conduct several additional Discrim analyses, each one
investigating a different combination of measures (e.g., remove the most pre-
dictive single measure from the first analysis and run Discrim again). In what
ways do the findings differ from one analysis to the next? How do you
Discriminant Analysis 327
Acknowledgments
I would like to thank two anonymous reviewers and Luke Plonsky for construc-
tive feedback on this chapter. I am also indebted to John Davis for making a
portion of his Ph.D. dissertation data available for the purpose of demonstrating
Discrim here. Lastly, I thank my Advanced Statistics students at Georgetown Uni-
versity for their questions and insights related to multivariate statistics, and J. D.
Brown for introducing me to Discrim and other analytic techniques.
References
Biber, D. (2003).Variation among university spoken and written registers: A new multidi-
mensional analysis. In C. Meyer & P. Leistyna (Eds.), Corpus analysis: Language structure
and language use (pp. 47–70). Amsterdam: Rodopi.
Brown, J., Robson, G., & Rosenkjar, P. (2001). Personality, motivation, anxiety, strategies,
and language proficiency of Japanese students. In Z. Dörnyei & R. Schmidt (Eds.),
Motivation and second language acquisition (pp. 361–398). Honolulu: University of Hawai‘i,
Second Language Teaching and Curriculum Center.
Collentine, J. (2004). The effects of learning contexts on morphosyntactic and lexical
development. Studies in Second Language Acquisition, 26, 227–248.
Collentine, J., & Collentine, K. (2013). A corpus approach to studying structural conver-
gence in task-based Spanish L2 interactions. In K. McDonough and A. Mackey (Eds.),
Second language interaction in diverse educational contexts (pp. 167–188). Amsterdam: John
Benjamins.
328 John M. Norris
Hypothesis Testing
In classical hypothesis testing, the null hypothesis that all means are equal is tested
against an alternative that specifies the means are not equal. Using the one-way
ANOVA, researchers first seek to reject the null hypothesis using a preset level of
statistical significance. In many applied research questions, the researcher enter-
tains a hypothesis of mean inequalities, often with a specific hypothesis based on
a one-tailed probability distribution. That is, the researcher hypothesizes not only
that the means differ, but also that they will differ in favor of one of the groups.
This approach is common in the classical experimental group versus control group
contrast, where the null hypothesis predicated on random variation is the bench-
mark against which significant or nonrandom differences are inferred. An alterna-
tive to the null hypothesis–based analysis of variance approach is one grounded
on an informed or theory-driven hypothesis about the ordering of mean scores.
Recent trends in Bayesian data analysis described by Ntzouflas (2009), Kruschke
(2011), and Lunn, Jackson, Best, Thomas, and Speigelhalter (2012), for example,
afford alternatives to null hypothesis testing and are optimal when researchers are
testing hypotheses predicated on grounded theoretical arguments, and in the pres-
ent illustration, for testing framework-driven operationalizations of proficiency.
The conceptual difference between null hypothesis testing and the Bayesian
alternative is that predictions about mean differences are stated a priori in a hier-
archy of differences as motivated by theory-driven claims. Prior research thereby
informs the hypotheses and in Bayesian terms we test an informative hypoth-
esis. In this approach, the null hypothesis is typically superfluous, as the research-
ers aim to confirm that the predicted order of mean differences are instantiated
in the data. Support for the hierarchically ordered means hypothesis is evident
330 Beth Mackey and Steven J. Ross
only if the predicted order of mean differences is observed. The predicted and
plausible alternative hypotheses thus must be expressed in advance of the data
analysis—thus making the subsequent ANOVA confirmatory. In addition to the
advantage of avoiding superfluous null hypothesis testing, the Bayesian approach
avoids the pitfalls of post hoc comparisons of means and the awkward specifi-
cations of planned comparisons across means. The Bayesian approach outlined
in this chapter provides a straightforward system for ordering hypotheses about
mean differences prior to data analysis.
this level. How well test developers can consistently operationalize the proficiency
framework is fundamental to the construct validity of any framework-based test. In
a five-level framework, the expected hierarchy stipulates that the ordering of means
for any given language test would be: μ1 < μ2 < μ3 < μ4 < μ5. This hierarchy
specifically predicts that the mean of the lowest level items will be distinctly lower
than those of the next higher level, such that the mean difficulties of the items will
separate into five minimally overlapping distributions along a continuum of dif-
ficulty. As noted earlier, hierarchical prediction of the order of item difficulty means
explicitly tests the validity of the framework for item construction.
To date, a number of frameworks have been proposed in applied linguistics
to describe how assessments can be designed to measure gradations of language
proficiency, and to provide criteria to interpret validity claims. Currently used
frameworks include the Common European Framework of Reference (http://
www.coe.int/t/dg4/linguistic/cadre1_en.asp), the American Council of Teach-
ers of Foreign Languages Proficiency Guidelines (http://www.actfl.org/pub-
lications/guidelines-and-manuals/actfl-proficiency-guidelines-2012), and the
Interagency Language Roundtable Skill Level Descriptions (http://govtilr.org/
Skills/ILRscale1.htm). Most language assessment frameworks are predicated on
functional descriptions of how language is used in a range of contexts represent-
ing varied social and employment-related domains. Test developers strive to sam-
ple specimens of language from those contexts, and to construct items and tasks
that reflect comprehension of propositional content appearing within them, and
in the case of spoken or written language assessments, learners’ ability to speak or
write coherently with fluency and accuracy on tasks representing the functional
domains of interest. A fundamental assumption is that the specimens of language
used for test construction can be accurately arrayed along a continuum of diffi-
culty, and that items written to assess comprehension are matched to the passages
and texts along that ordered continuum of difficulty. Crucial for a validity argu-
ment for frameworks based on subjective classification of language specimens is
the accuracy of the classification system itself.
Test developers select texts and passages and write items at each of the levels
covered by the test. Using this framework, test developers operationalize the scale
by producing test passages targeted at each level. The present study used a sample
drawn from a reading test with a sample size of 1,889.
The set of items on the test was initially subjected to a Rasch analysis in order
to estimate the difficulty of each item (see Knoch & McNamara, Chapter 12 in
this volume). As each item has been preclassified by test designers according to
an intended level of difficulty, and extensively checked by moderation panels, the
expectation is that the hierarchy of item difficulty will be corroborated with an
empirical confirmation of the actual item difficulties.
The Rasch item analysis performed on the test generates the observed dif-
ficulty of each item on a logit scale. All of the items on each test are placed on a
332 Beth Mackey and Steven J. Ross
.
. T 10023
.
. 10028 10039 10050
. Level 5 Items
. T 10061
.## 10024 10046
.#### 10021
1 .## ** 10018 10026 10033 10055
.## S 10067 Level 4 Items
.####### S 10040
.#### 10001
.#### 10056
.########## 10043
.###### 10007 10011 10020 Level 3 Items
.############ 10012 10044 10049
.###### #M 10025 10029 10031 10052 10054
.############ 10048 10065
0 .###### **M 10015 10051 10062
.########### 10009 10057
.#### 10063 10068 Level 2 Items
.######### 10045 10047 10059 10060
.### S 10008 10034 10035 10032 10066
.####### 10002 10006 10018 10032
.## 10004 10005 10010 10016 10022 10070
.### 10003 10038
Level 1 Items
.# 10013 10027
.# S 10058
-1 . T ** 10014 10030 10064
.#
. 10041
. Level 0 Items
. 10069
. 10042
. T
relative continuum of facility from easy to difficult. For each group of test takers,
the proficiency of individuals is arrayed in a person-item map showing the dif-
ficulty of test items relative to the ability of test takers. Figure 14.1 shows a sche-
matic array of items from another test form (right) with persons (hashed marks
on left) on the same scale of reference.
Although different methods of setting standards exist, a widely used method
for making proficiency level decisions was based on an analysis of the item pool,
with the cut score set at each level where the probability of a correct response
for a given person ability estimate approximated a raw score of 70% correct
of the items preidentified as indicators of proficiency at each level. Figure 14.1
illustrates this item-based approach. Test takers able to correctly answer 70% of
the within-category items, as well as 90% of easier items, were deemed to be
proficient at the threshold cut score level. Accordingly, a test taker with an ability
to answer 70% of the Level 3 items in Figure 14.1 would be deemed proficient
at Level 3, but not at Level 4. As noted earlier, the validity of the 70% cut score
decision point is strongly predicated on the homogeneity of items written to the
intended level of difficulty. As the cut score decision point in principle applies to
all languages tested, it presents both a convenient and homogeneous method for
defining common benchmarks for proficiency as well as a formidable validation
challenge. The analyses to follow test a key assumption of the item-based method
of setting standards empirically.
Bayesian Informative Hypothesis Testing 333
The difficulty hierarchy predicts that the average logit difficulty of Level 1 items
(μ1) will be systematically less than the mean of items written to measure pro-
ficiency at Level 2 (μ2). The hierarchy expressed in H1 is thus predicted to hold
across languages, as the language assessment framework examined here is written
to be independent of which particular language is described by the scale. The
prediction is that the mean difficulties of test items for the language sampled will
concur with the hierarchy of difficulty. In this sense the intended analyses are
confirmatory: Either H1 is corroborated by the observed data or it is not. Support
for the measurement framework in general will be evidenced through consistent
confirmation of the hierarchy of difficulty.
For a confirmatory model to function, plausible alternatives to the theory-
driven order of hypotheses need to be articulated explicitly. A plausible alter-
native hypothesis, H2, would predict that adjacent categories collapse into the
lower-level categories, suggesting that there is no systematic difference between
particular levels. For instance, foreign language passages chosen to represent Level
2 reading proficiency share characteristics with Level 1 passages, but are selected
such that they contain some extra linguistic complexity to make comprehension
slightly more challenging. Similarly, Level 2 items are constructed to entail slightly
more complexity than Level 1 items. If Level 2 items are not in fact any more dif-
ficult than Level 1 items, hypothesis H2 would be supported by the empirical fact
that the logits of difficulty of Levels 1 and 2 will conflate into an indistinguishable
range. Test developers and item writers are well-acquainted with the difficulty of
fine-tuning the linguistic content of items differentiating Levels 1 versus 2 and
334 Beth Mackey and Steven J. Ross
Levels 3 versus 4.Thus, on a five-level scale, not only might Level 2 items collapse
down to Level 1, but Level 4 may also conflate into Level 3:
This outcome is deemed the second most likely outcome given the item devel-
opment and writing process. Even after considerable moderation by item review-
ers, items at Levels 2 and 4 may be indistinguishable from the next-lower category
of difficulty.
A second plausible alternative hypothesis, H3, predicts that in a five-level hier-
archy adjacent levels collapse into ranges of proficiency at the threshold of the
next higher level. Correspondingly, items written for a Level 2 reading passage
require proficiency nearer to the next higher base level (Level 3) than the level
below it, and Level 4 items are indistinguishable empirically from Level 5 items.
This possibility is less likely than H2, as items at the extremes of the hierarchy are
expected to be relatively easy to write and moderate.
1 Level 1
2 Level 2
3 Level 3
4 Level 4
5 Level 5
Null (H0 ) μ1 = μ2 = μ3 = μ4 = μ5
Agnostic (Ha ) μ1, μ2, μ3, μ4, μ5
Hypothesis 1 (H1 predicted hypothesis) μ1 < μ2 < μ3 < μ4 < μ5
Hypothesis 2 (H2 “collapse down”) μ1 = μ2 < μ3 = μ4 < μ5
Hypothesis 3 (H3 “fold up”) μ1 < μ2 = μ3 < μ4 = μ5
Step 5. Open the Comparison of Means software and click on the Data button
to import the text file (see Figure 14.3). A Data Input screen will open. Click
on Browse Data File and navigate to the text file to import your data set into the
Comparison of Means software. Once the data are imported, the program will
automatically validate that your data set is in the expected format. Click OK.
The upper left corner of the Data Input screen will be updated to reflect the
number of groups and the number of observations (items) in each group, as seen
in Figure 14.4.
Step 6. Comparison of Means offers six methods for testing and comparing
means following Kuiper and Hoijtink (2010). There are two overall categories
for exploration and confirmation, with the methods in each involving hypoth-
esis testing, model selection, and BMS. These methods allow us to examine each
language test and build an understanding of how a framework-driven hierarchy
of item difficulty is supported by the data. For the present case, we will use only
a confirmatory approach using BMS to investigate the relationship between the
intended levels and the actual outcomes.
The criterion for the confirmatory approach will be a posterior model prob-
ability (PMP) that favors H1, that of the ordered mean difficulty in Table 14.3.The
data are expected to corroborate the H1 as the most probable outcome.
TABLE 14.3 Comparison of Means software (exploratory and confirmatory tests) (Kuiper
and Hoijtink, 2010)
Step 8. Add the null hypothesis under “Specify models for confirmatory
methods.” As noted earlier, most Bayesian analyses do not entertain random out-
comes as a viable alternative to a specific hierarchy of means. The null hypoth-
esis is included here for didactic purposes. The first order-restricted hypothesis is
included with the use of the Add button. A field will enter the large text box on
the right. Our first hypothesis of interest is the preferred hypothesis, where mean
item difficulties are predicted to increase as the level increases. To represent these
relationships in the tool, we need to enter “1 < 2 < 3 < 4 < 5”. To enter the first
hypothesis of interest, type “1” in the first field and use the pull-down menu to
select <; enter “2” into the next field and continue entering the restrictions until
all are captured. See Figure 14.6 for an illustration of this process.
Once the first hypothesis is entered, clicking on the Add button allows each
subsequent hypothesis of interest to be entered in sequence. Each order-restricted
hypothesis can then be viewed by clicking on the H1, H2, and H3 tabs (see Fig-
ure 14.7). We are now ready to run the analysis.
Step 9. Once all hypotheses have been specified and ordered, Comparison of
Means is ready for execution with a press of the Run button (see Figure 14.8).
The Bayesian estimation approach utilizes a Markov-Chain Monte-Carlo
(MCMC) estimation of posterior probabilities based on the observed data. The
MCMC approach starts with a random starting point of each variable chosen
from the sample distribution and estimates the next value of the variable based
on the value of its immediate predecessor until eventually all subsequent values
of the variable form a marginal distribution of that variable, independent of the
original starting point (Lunn, Jackson, Best, Thomas, & Speigelhalter, 2012). In
the present analysis, the Markov Chain is instantiated through the Gibbs sampler,
which iteratively resamples from the conditional distribution of values derived
from prior estimations until, after many thousands of iterations, it arrives at a final
posterior distribution. The preferred hypotheses in the present study predict that
the data means will follow the ordering of mean item difficulties as specified in
H1. To the extent that any of the alternative models better fit the observed test
342 Beth Mackey and Steven J. Ross
5.25 times more likely (42.8/8.14) than the “collapse down” alternative hypoth-
esis, H4. The PMP (see Figure 14.9) estimates H1 to be unambiguously the most
probable with a PMP of .82, given the data. The posterior model probability for
each hypothesis is the Bayes factor for that particular hypothesis divided by the
sum of the Bayes factors for all of the hypotheses tested. The Bayes factor and the
PMP estimates suggest that the ordering of mean difficulties predicted by the
item design specification and construction system is corroborated by the empiri-
cal data based on large samples of test takers.
We see that the ordering of mean difficulties predicted by the item design
specification and construction system is corroborated by the empirical data based
on large samples of test takers. The Bayesian confirmatory approach provides a
useful diagnostic for identifying languages and modalities in the test development
framework that might be in need of further development and moderation.
Discussion
Confirmatory analyses were conducted to both illustrate the potential for an
ANOVA software program, Comparison of Means (Kuiper and Hoijtink, 2010),
344 Beth Mackey and Steven J. Ross
Further Reading
Hoijtink, H., Klugkist, I., and Boelen, P. (2008). Bayesian evaluation of informative hypotheses.
New York: Springer.
Kruschke, J. (2011) Doing Bayesian data analysis: A tutorial with R and BUGS. New York:
Academic Press.
Bayesian Informative Hypothesis Testing 345
Discussion Questions
1. Discuss when a null hypothesis test for an ANOVA design is superfluous and
when a Bayesian approach might be more appropriate.
2. Discuss what makes a hypothesis “informative.”
3. Identify areas of applied linguistics that are amenable to ordered hypotheses.
4. Discuss how meta-analysis results can inform informed hypothesis testing.
5. Sketch out a study on corrective feedback on L2 writing (or another area of
L2 research) that predicts an order for different condition or treatment. How
and why might the approach illustrated in this chapter be used to examine
such a set of predictions?
References
Hoijtink, H. (2012) Informative hypotheses. New York: Chapman Hall/CRC.
Hoijtink, H., Klugkist, I., and Boelen, P. (2008). Bayesian evaluation of informative hypotheses.
New York: Springer.
Kruschke, J. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. New York:
Academic Press.
Kuiper, R. M., & Hoijtink, H. (2010). Comparisons of means using exploratory and confir-
matory approaches. Psychological Methods, 15(1), 69–86. doi:10.1037/a0018720
Lunn, D., Jackson, C., Best, N., Thomas, A., and Speigelhaler, D. (2012). The BUGS book:
A practical introduction to Bayesian analysis. London: Taylor & Francis.
Ntzouflas, I. (2009). Bayesian modeling using WinBUGS. New York: John Wiley and Sons.
This page intentionally left blank
INDEX
Egbert 24, 30, 33, 44, 48, 65, 76 – 7, 113, fit indices 223, 228 – 9, 234, 240; NFI 233;
127, 162, 181 NNFI 233; RMSEA 223, 228; (S)RMR
eigenvalue 193 – 7, 205, 207 – 9, 317 233; Tucker-Lewis 223
Eiting 222, 242 fit statistics 233, 287, 289, 294, 296, 301;
Elder 299, 303 item fit 289 – 90; person fit 287 – 8
Ellis, N. C. 36 Fitzmaurice 109, 113, 126
Ellis, R. 107 fixed effects 114 – 15, 159, 166, 168 – 77,
Enders 168, 179, 220, 241 179; parameters 168, 173
Eouanzoui 274 Ford 185, 187, 196, 205, 210
EQS 184, 238 – 9, 241 forest plot 79, 115 – 16
Erdfelder 29, 44 Forstmeier 171, 181
Erdosy 274 Fox 164, 241, 180, 238, 294, 302
error: free 219; variance 113 – 15, 185, 222 frequency: cases, of 89; data 36; linguistic
error sum of squares 252 features, of 245, 325; subjects, of 88
eta squared 5, 27, 35, 49; ANOVA, in funnel plot 115 – 17, 119 – 20, 124
132 – 3; calculating 35 fusion coefficient 196, 255, 257 – 8, 276
Excel 7, 31, 39, 99, 112 – 3, 122, 279 – 80
experimental design 39, 160, 162 gain scores 325
explanatory variable 154 gamma 74
exploratory factor analysis (EFA) 8, 187 – 8, Ganschow 244, 274
190, 196, 203, 205 – 12, 216, 223, 242 Gass 3, 5, 7 – 8, 29, 37, 46 – 7, 76, 77, 110,
118, 127, 160, 181, 182, 211
Faca 196, 212 Gelman 162, 180
Facets 7, 278 – 9, 283 – 301 General Linear Model (GLM) 5, 35 – 6, 38,
Fabringar 185, 192, 209, 210 133, 306
factor analysis 6, 12 – 13, 182 – 4, 209; generalizability 77, 127, 242
cluster analysis, compared to 244 – 5, Gibbs sampler 341
247, 255, 258; conducting 187, 190 – 4, Gillespie 168, 180
197; confirmatory 209; discriminant Glass 122, 125
analysis, compared to 306, 316 – 17; Gleser 113, 125
exploratory 185, 209; factor extraction, Glorfeld 196, 210
in 190, 193, 200 – 1; factor loadings, in Goh 207 – 8, 128
199; principal axis factoring 184, 193, Goldstein 160, 162, 180
208; principal components analysis 182, Gonulal 7, 8, 44, 182, 187 – 8, 194, 196 – 7,
184 – 5, 187, 193, 294 – 5; reporting 205; 205
rotation 197 – 9, 203; software for 7; Goo 108, 118, 126
structural equation modeling, compared goodness of fit statistics 228, 230, 233
to 216, 219 Gorsuch 183, 185, 187, 194, 210
factor loadings: matrix 197 – 8 Götz 211, 244, 246
factor rotation 197 – 9, 203 grammaticality judgment task (GJT) 110,
factor scores 13, 186, 201 – 3, 244, 261 162 – 3
Faul 29, 44 Granena 35, 44
Fern 118, 125 Grant 246, 274
Ferris 246, 274 graphics 78 – 80, 83, 87, 90, 103 – 4;
Few 78 – 80, 103, 105 guidelines for 80, 83, 86 – 7
Fidell 131 – 2, 137 – 8, 140 – 1, 143, 157 – 8, Gray 247 – 8, 269, 270, 273
183, 185, 187, 192, 211 Green 157, 288, 302 – 3
Field 156, 178, 180, 182 – 5, 187, 188, 190, Gridlines 82, 87 – 8
193 – 4, 197 – 9, 201 – 3, 205, Gries 178, 180, 245, 252, 271, 274, 307,
209 – 10 328
Finch 220, 242 Gu 214, 219, 225, 235 – 41
Finegan 245, 273
Finney 220, 241 Hair 187, 210
Fisher 80, 105 Hancock 214, 219, 220, 221, 224, 239, 241
Index 351
median 41, 55, 90 – 1, 112 – 13, 118, 165, multivariate analysis of variance see
188, 306; clustering 252 – 3 MANOVA
meta-analysis 12, 27, 30, 36 – 7, 42, 106 – 28, multivariate normality 17, 140, 220, 228,
121; analysis 112 – 13; benefits of 309; outliers 17 – 18, 137
106 – 7; coding 110, 111; data analysis multivariate statistics 3 – 4, 29
112 – 15; data collection/coding multiway analysis 14
110 – 12; defining the domain of 108; Muthén, B. O 236, 238
examples of 27, 36, 108 – 10, 120; forest Muthén, L. K. 236, 238
plot 115 – 16; funnel plot 115 – 17;
history of 107, 122; interpreting results NAEP State Comparisons Tool from
117 – 19; interrater reliability for coding the National Center for Educational
112; method 114; models 114 – 15; Statistics 84 – 5, 87, 105
moderators 108, 114, 115, 119, 120 – 1, Nakata 244, 274
123; publication bias 115 – 17, 119, 120, Nassaji 49, 77, 307, 326, 328
121, 122, 124; Q-test 114; results 345; National Assessment of Educational
searching for primary studies 108 – 9; Progress (NAEP) 83
software for 112, 120, 121 – 2; weighting nested random effect 160
effect sizes 113 – 14 Newman 245, 252, 274
methodological reform 4 – 5, 24 Nicol 82, 105
Meunier 163, 163, 178, 180 Nilsson 162, 178, 180
Minimum Fit Function Chi-Square 228, nominal variable 10, 305, 310
230, 233 nonnormality 6, 11, 29, 46, 48
Ministeps 301 non-parametric statistics 74, 178
Mirman 168, 178, 180 normal distribution 44, 62, 65, 67, 162,
misfit 49, 221, 224, 288 – 91, 294, 299 166, 288; Satorra-Bentler’s correction
missing data 15, 61, 113, 142, 143, 162, 220
198, 220, 276, 279; imputation of 162, normality 11, 47 – 50, 137, 144, 170, 307;
220 univariate 17; multivariate 140, 220,
Mitchell 168, 179 228, 309
mixed effects 6, 15, 159 – 70, 173, 175, 178 Norman 182, 211
mixed methods 7 Norris 5, 6, 8, 14, 24, 28, 36, 44, 45, 49, 77,
Mizumoto 7, 8, 43, 184, 211 107 – 8, 118, 122 – 3, 126, 160, 180
model building: competing or concurrent Normed Fit Index (NFI) 228 – 30, 233
models 214, 217 – 18, 221; measurement Norušis 252, 274
model 215, 217, 221; model Novomestky 56, 76
specification 222 – 3, 225 – 6, 231, 339; Ntzouflas 329, 345
structural model 215 – 16, 221, 240 null hypothesis significance testing
model comparison 174 – 5 (NHST) 6, 10, 15, 25 – 31, 36, 106
model overfitting 221, 238, 289, 303, 310
moderator variables 13, 114, 123 oblique rotation: direct oblimin 197;
modification indices: Lagrange Multiplier promax 197
test 224; Wald test 224 observed scores 14, 214, 217, 236
Molenaar 219, 242 observed variables 14, 182, 214 – 15, 217,
moments: distribution, of a 57 – 8; package 56 221 – 3, 231
Monroe 118, 125 Ockey 216, 239, 242
Morris 112, 126 odds ratio 36
Mueller 214, 219, 221, 224, 239, 241 Oh 114, 127
multicollinearity 13, 137 – 40, 188, 192, Olkin 113, 122, 125
203, 309 – 10, 313, 316, 321; software Onsman 209, 212
for 7 operationalization: construct, of 215;
multiple comparisons 9, 11, 69 proficiency, of 329
multiple regression 5 – 7, 10, 13 – 16; 35, ordered means 15, 329
131; hierarchical 13, 224 ordinal scale 10, 88, 214, 225, 277, 305
354 Index
Ortega 5, 8, 24, 36, 45, 49, 77, 107 – 8, population: covariance 316; distribution
122 – 3, 126, 160, 162, 180 62; effect 114, 135; mean 39; true 26, 33
Orthogonal rotation: varimax 197; Porte 5, 8, 45, 77
quartimax 197; equamax 197 posterior model probability (PMP) 338,
Orwin 111, 113, 126 343
Osborne 194, 197, 210 post hoc: comparisons 10 – 11, 330; power
Oswald 6, 8, 12, 24, 37 – 8, 44, 45, 92, 107, 29, 47, 49; tests 48, 334
108, 109, 112 – 14, 115, 117 – 8, 123 – 4, posttest 39, 110, 112 – 13
127, 128 power 135
outfit mean square 288 power analysis 29, 43 – 4, 125, 239
outliers 17 – 18, 48, 51, 94, 115, 137, 140, Powers 80 – 1, 105
220, 288, 309 – 10 practical significance 27, 38, 43, 107, 114,
Oxford 244, 268, 274 117 – 18, 126
precision: measurement, of 10, 90, 115;
pairwise comparison 49, 67, 69 – 70, 149, observation, of 135; statistics, in 30, 74,
198 133, 137
Papi 183, 188, 190, 194 – 7, 200 – 1, 203, predictor variables 15, 35; cluster analysis,
205, 209 – 10, 211 in 247 – 8, 260 – 1, 266; discriminant
parameters: constrained 222; fixed 222; free analysis, in 307 – 12, 314 – 16, 320 – 1,
222; estimates 222, 224 323; multiple regression analysis, in
parametric statistics 49, 162 131 – 3, 135
Parsimony Normed Fit Index (PNFI) 228, pretest 110, 112 – 13
230, 233 Principal Axis Factoring (PAF) 184, 193,
partial correlations 13 208
partial credit model 275, 277 – 8, 284, Principal Components Analysis (PCA)
292 – 3, 295 – 6, 301, 303 182, 184 – 5, 187, 193, 294 – 5
path diagram 79, 228 – 9 prior estimation 341
Patton 244, 274 probability: level 11, 135 – 7, 144, 148, 318;
Pearson 107, 127 posterior model 338 – 9, 343
percentage: cumulative 194; descriptive Processuse 310, 315 – 20
36, 90, 93, 110, 290, 292, 313, 320, 323; publication bias 38, 115 – 17, 119 – 24
variance, of 194, 197 Purdie 118, 125
percentile 37, 51, 90 – 2, 118 p-value 78, 223 – 4, 228, 230, 233
person ability 285, 287, 332
Pett 187, 193, 199, 203, 205 – 6, 209, 211 Q-Q plot 61 – 2, 65 – 8, 167
Pexman 82, 105 quantiles 51, 65 – 8, 167
Phakiti 47 – 8, 77 Quene 162, 178, 181
Pickering 272 – 3 questionnaire analysis 294
pie chart 79, 97 – 9
Pigott 108, 114, 125, 128 random assignment 110
pilot study 290 random effects: crossed 161, 179, 181;
Pinheiro 175, 181 nested 160 – 1, 168
planned comparisons 330 random intercepts 169 – 71, 173, 176 – 7
Plonsky 3, 4, 5, 7, 8, 10, 11, 12, 24, 29, 30, random slopes 171 – 3, 176 – 8
32, 33, 37 – 8, 42, 44, 45, 46 – 9, 65, 76 – 7, range restriction 113, 122
82, 90, 92, 97, 105, 107, 108 – 9, 110, Rasch analysis: many-facet model 302;
112 – 14, 117 – 9, 123 – 4, 127, 127, 128, simple model 275 – 8, 296, 301
133, 158, 160, 162, 181, 182, 187 – 8, raters: multiple 279 – 80; severity 278
194, 196 – 7, 205, 211, 214, 242, 244, ratio 94, 97, 188, 223; aspect 102 – 3; F-
274, 306, 327 147 – 8; odds 36; scale 10, 14
plots 12, 65, 78, 90 – 4, 124, 138, 140, Raudenbush 160, 161, 162, 181
166 – 7 Raykov 239, 242
Poltavtchenko 108, 127 R development core team 7 – 8, 164, 181
Index 355
software 6 – 7; AMOS 7, 184, 215, 219, 222, Stoel 214, 219, 242
233 – 5; LISREL 7, 196, 213, 215, 222, Stoll 245, 274
225, 227 – 9; Mplus 7, 236 – 9; PRELIS Strahan 185, 210
225 – 6; SIMPLIS 225 – 6, 228; see also Streiner 182, 211
SPSS structural equation modeling (SEM) 6, 14,
solutions: 2-cluster 260, 261, 267; 3-cluster 183, 213; software for 7
260, 261, 267; 4-cluster 260, 261, 267; study quality 8, 110, 114
six-factor 194, 196, 208; two-factor Stukas 117, 227
196 – 7 Sullivan 187, 193, 199, 203, 205 – 6, 209,
Smith, E.V. 301 211
Smith, R. M. 301 sum of squares 148, 153, 252, 256 – 7
Sörbom 213 – 14, 225, 228, 238, 242 survey questions 183, 203, 268 – 9,
Spada 108, 118, 127 309 – 10
sparklines 97, 98, 99 Sutton 115, 122, 128
Sparks 244, 274 synthetic ethic: 24, 42 – 3
Speigelhaler 329, 341, 345
sphericity 162, 188, 192, 205 Tabachnick 131 – 2, 137 – 8, 140 – 1, 143,
Spino 183, 188, 190, 194 – 7, 200 – 1, 203, 157 – 8, 183, 185, 187, 192, 211
205, 209 – 10, 211 table: titles 83; correlations 60, 315; output
SPSS 4, 7, 31, 33, 35, 39, 41, 156, 219; 63, 275, 279, 284, 301 – 2
bootstrapping 49, 51 – 2, 54, 57, 60 – 5, Tafaghodtari 207 – 8, 128
67, 73; cluster analysis 246, 248, 253, Tait 185, 187, 196, 205, 210
256, 260 – 1; commands 137 – 8, 143, Takeuchi 184, 211
149; discriminant analysis 307 – 8, Tatham 187, 210
310 – 12, 314 – 15; factor analysis, Taylor 27 – 8, 45, 117, 128
performing 184, 187, 190, 193, 196 – 9, test development 222, 234, 237 – 8
202 – 3; mixed effects 164 – 5, 178; Thomas 329, 341, 345
output 35, 41, 147, 199, 256, 315; Thompson 183 – 4, 187, 193 – 4, 201,
output for ANOVA 148, 153; output 202 – 3, 211
for factor analysis 199; output for three-dimensional display 102
hierarchical regression 152 – 3; output Tibshirani 50, 75
for regression 147, 149; output for Tily 171, 173 – 4, 179
tolerance statistics 140; output for Tobias 109, 113, 126
variables 147, 152; Rasch analysis 279; Tolerance 138 – 40
structural equation modeling 219, 225, Tomita 118, 127
233, 238; visual displays 79, 99 – 100, transparency 5, 205
123 – 4 treatment 51, 80, 119, 147, 222, 243, 325;
Squared Euclidean distance 253 coding in R 166, 168
standard error 16, 54, 60 – 6, 69, 83, 133 – 4, Truscott 108, 128
170, 220, 228, 231, 234, 287, 296 Tseng 184, 211, 214, 219, 242
standardized coefficients 33, 148 Tsuda 244, 274
statistical power 9 – 11, 23, 29 – 30, 47 – 8, t-test 11, 19, 25, 27, 49, 51, 160, 162;
108, 160, 220; analyzing 29; NHST 30; bootstrapping, in 63, 65 – 6
sample size 30 Tufte 78, 82, 90, 97, 103
statistical significance 23 – 30, 36, 43, 49, 72, Tukey 47, 76
92, 117, 135, 143, 147, 151, 152, 173, Turner 178, 181
174, 223, 323, 329; flaws associated with Type I error 4, 11, 49, 77, 314
24 – 8, 36, 106 – 7; statistical power 30 Type II error 47 – 8, 65
Steiger 242, 244
stepwise regression 143 Uchikoshi 245, 252, 274
Sterling 183, 188, 190, 194 – 7, 200 – 1, 203, Ullman 214, 222 – 3, 238 – 9, 242
205, 209 – 10, 211 unstandardized coefficients 35, 148 – 9,
Stevens 27 – 8, 45, 117, 128 152 – 3
Index 357