Advancing Quantitative Methods in Second Language Research

ADVANCING QUANTITATIVE
METHODS IN SECOND
LANGUAGE RESEARCH
By picking up where introductory texts have left off, Advancing Quantitative

Methods in Second Language Research provides a “second course” on quantitative
methods and enables second language researchers to both address questions cur-
rently posed in the field in new and dynamic ways and to address novel or more
complex questions as well. In line with the practical focus of the book, each
chapter provides the conceptual motivation for and step-by-step guidance needed
to carry out a relatively advanced, novel, and/or underused statistical technique.
Using readily available statistical software packages such as SPSS, the chapters walk
the reader from conceptualization through to output and interpretation of a range
of advanced statistical procedures such as bootstrapping, mixed effects modeling,
cluster analysis, discriminant function analysis, and meta-analysis. This practical
hands-on volume equips researchers in applied linguistics and second language
acquisition (SLA) with the necessary tools and knowledge to engage more fully
with key issues and problems in SLA and to work toward expanding the statistical
repertoire of the field.
Luke Plonsky (PhD, Michigan State University) is a faculty member in the

Applied Linguistics program at Northern Arizona University. His interests include
SLA and research methods, and his publications in these and other areas have
appeared in Annual Review of Applied Linguistics, Applied Linguistics, Language Learn-
ing, Modern Language Journal, and Studies in Second Language Acquisition, among
other major journals and outlets. He is also Associated Editor of Studies in Second
Language Acquisition and Managing Editor of Foreign Language Annals.
SECOND LANGUAGE ACQUISITION RESEARCH SERIES
Susan M. Gass and Alison Mackey, Series Editors
Monographs on Theoretical Issues:

Schachter/Gass
Second Language Classroom Research: Issues and Opportunities (1996)
Birdsong
Second Language Acquisition and the Critical Period Hypotheses (1999)
Ohta
Second Language Acquisition Processes in the Classroom: Learning Japanese (2001)
Major
Foreign Accent: Ontogeny and Phylogeny of Second Language Phonology (2001)
VanPatten
Processing Instruction: Theory, Research, and Commentary (2003)
VanPatten/Williams/Rott/Overstreet
Form-Meaning Connections in Second Language Acquisition (2004)
Bardovi-Harlig/Hartford
Interlanguage Pragmatics: Exploring Institutional Talk (2005)
Dörnyei
The Psychology of the Language Learner: Individual Differences in Second
Language Acquisition (2005)
Long
Problems in SLA (2007)
VanPatten/Williams
Theories in Second Language Acquisition (2007)
Ortega/Byrnes
The Longitudinal Study of Advanced L2 Capacities (2008)
Liceras/Zobl/Goodluck
The Role of Formal Features in Second Language Acquisition (2008)
Philp/Adams/Iwashita
Peer Interaction and Second Language Learning (2013)
VanPatten/Williams
Theories in Second Language Acquisition, Second Edition (2014)
Leow
Explicit Learning in the L2 Classroom (2015)
Dörnyei/Ryan
The Psychology of the Language Learner—Revisited (2015)
Monographs on Research Methodology:

Tarone/Gass/Cohen
Research Methodology in Second Language Acquisition (1994)
Yule
Referential Communication Tasks (1997)
Gass/Mackey
Stimulated Recall Methodology in Second Language Research (2000)
Markee
Conversation Analysis (2000)
Gass/Mackey
Data Elicitation for Second and Foreign Language Research (2007)
Duff
Case Study Research in Applied Linguistics (2007)
McDonough/Trofimovich
Using Priming Methods in Second Language Research (2008)
Dörnyei/Taguchi
Questionnaires in Second Language Research: Construction, Administration, and
Processing, Second Edition (2009)
Bowles
The Think-Aloud Controversy in Second Language Research (2010)
Jiang
Conducting Reaction Time Research for Second Language Studies (2011)
Barkhuizen/Benson/Chik
Narrative Inquiry in Language Teaching and Learning Research (2013)
Jegerski/VanPatten
Research Methods in Second Language Psycholinguistics (2013)
Larson-Hall
A Guide to Doing Statistics in Second Language Research Using SPSS and R,
Second Edition (2015)
Plonsky
Advancing Quantitative Methods in Second Language Research (2015)
Of Related Interest:
Gass
Input, Interaction, and the Second Language Learner (1997)
Gass/Sorace/Selinker
Second Language Learning Data Analysis, Second Edition (1998)
Mackey/Gass
Second Language Research: Methodology and Design (2005)
Gass with Behney & Plonsky

Second Language Acquisition: An Introductory Course, Fourth Edition (2013)
ADVANCING
QUANTITATIVE
METHODS IN SECOND
LANGUAGE RESEARCH
Edited by
Luke Plonsky
NORTHERN ARIZONA UNIVERSITY
First published 2015
by Routledge
711 Third Avenue, New York, NY 10017
and by Routledge
2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2015 Taylor & Francis
The right of Luke Plonsky to be identified as the author of the editorial
material, and of the authors for their individual chapters, has been asserted in
accordance with sections 77 and 78 of the Copyright, Designs and Patents
Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or
utilised in any form or by any electronic, mechanical, or other means, now
known or hereafter invented, including photocopying and recording, or in
any information storage or retrieval system, without permission in writing
from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered
trademarks, and are used only for identification and explanation without
intent to infringe.
Library of Congress Cataloging-in-Publication Data
Plonsky, Luke.
Advancing quantitative methods in second language research / Luke
Plonsky, Northern Arizona University.
pages cm. — (Second Language Acquisition Research Series)
Includes bibliographical references and index.
1. Second language acquisition—Resesarch. 2. Second language
acquisition—Data processing. 3. Language and languages—Study and
teaching—Research. 4. Language acquisition—Research. 5. Language
acquisition—Data processing. 6. Quantitative research. 7. Multilingual
computing. 8. Computational linguistics. I. Title.
P118.2.P65 2015
401'.93—dc23
2014048744
ISBN: 978-0-415-71833-2 (hbk)
ISBN: 978-0-415-71834-9 (pbk)
ISBN: 978-1-315-87090-8 (ebk)
Typeset in Bembo
by Apex CoVantage, LLC
For Pamela
This page intentionally left blank
CONTENTS
List of Illustrations xi
Acknowledgments xvii
List of Contributors xix
PART I
Introduction 1
1 Introduction 3
Luke Plonsky
2 Why Bother Learning Advanced Quantitative Methods in L2

Research? 9
James Dean Brown
PART II
Enhancing Existing Quantitative Methods 21
3 Statistical Power, p Values, Descriptive Statistics, and Effect

Sizes: A “Back-to-Basics” Approach to Advancing Quantitative
Methods in L2 Research 23
Luke Plonsky
4 A Practical Guide to Bootstrapping Descriptive Statistics,

Correlations, t Tests, and ANOVAs 46
Geoffrey T. LaFlair, Jesse Egbert, and Luke Plonsky
x Contents
5 Presenting Quantitative Data Visually 78

Thom Hudson
6 Meta-analyzing Second Language Research 106

Luke Plonsky and Frederick L. Oswald
PART III
Advanced and Multivariate Methods 129
7 Multiple Regression 131

Eun Hee Jeon
8 Mixed Effects Modeling and Longitudinal Data Analysis 159

Ian Cunnings and Ian Finlayson
9 Exploratory Factor Analysis and Principal

Components Analysis 182
Shawn Loewen and Talip Gonulal
10 Structural Equation Modeling in L2 Research 213

Rob Schoonen
11 Cluster Analysis 243

Shelley Staples and Douglas Biber
12 Rasch Analysis 275

Ute Knoch and Tim McNamara
13 Discriminant Analysis 305

John M. Norris
14 Bayesian Informative Hypothesis Testing 329

Beth Mackey and Steven J. Ross
Index 347
ILLUSTRATIONS
FIGURES
3.1 A descriptive model of quantitative L2 research 24

3.2 Screenshot of effect size calculator for Cohen’s d 32
3.3 Screenshot of effect size calculator for Cohen’s d with CIs 32
3.4 Linear regression dialogue box used to calculate CIs
for correlation coefficients 34
3.5 Statistics dialogue box within linear regression 34
3.6 Output for linear regression with CIs for correlation 35
3.7 Output for descriptive statistics produced through
Explore in SPSS 40
3.8 Descriptive statistics and CIs for abstracts
with vs. without errors 41
3.9 A revised model of quantitative L2 research 42
4.1 Explore main dialogue box 52
4.2 Bootstrap dialogue box 53
4.3 Bootstrap specifications 54
4.4 Descriptive statistics table with bootstrapped 95% CIs
for various descriptive statistics 55
4.5 Correlations output table with bootstrapped 95% CIs
for Pearson correlation coefficient 60
4.6 Bootstrapped correlation coefficients and Q-Q plot 62
4.7 Independent-Samples Test output table with
bootstrapped 95% CIs 63
xii Illustrations
4.8 Bootstrap mean differences, Q-Q plot, and jackknife-after-

boot plot of the mean difference between English and
Vietnamese 66
4.9 Plot of the bootstrap T-statistics, their Q-Q plot, and the
jackknife-after-boot plot 68
4.10 One-way ANOVA output table with bootstrapped
95% CIs 69
5.1 Cleveland’s 1993 graphic display of barley harvest data from
Immer, Hayes, & Powers (1934) 81
5.2 Types of graphics used over last four regular issues of five
applied linguistics journals 86
5.3 Bar chart showing means of listening scores for each
category of self-rated confidence ratings with
95% CI (N = 45) 88
5.4 Histogram of speaking scores (N = 45) 89
5.5 Grouped bar chart for speaking scores by gender
with 95% CI course level by gender with 95% CI 89
5.6 Percentage of students in each proficiency level by gender 90
5.7 Number of students in each proficiency level by gender 91
5.8 Box-and-whisker plot for the speaking test by gender 92
5.9 Box-and-whisker plots for the five proficiency levels
across the speaking test scores 93
5.10 Student scores (means and CIs) on five tests administered
three weeks apart over a semester (N = 45) 93
5.11 Mean scores and 95% CIs on reading, listening,
and grammar for three proficiency levels 94
5.12 Graphic representation of score data across levels with
box chart display of distributions 95
5.13 Scatter plot for the relationship between reading scores
and grammar scores (N = 45) 96
5.14 Mean state scores for NAEP data in Table 5.4 96
5.15 Mean state scores for NAEP data in Table 5.4 ordered
by state score 97
5.16 Scatter plot matrix of correlations between four subtests 98
5.17 Number of weekly online posts with sparklines showing
the online posting activity for each student 98
5.18 Example pie charts for student distribution 99
5.19 Initial SPSS bar chart for speaking mean scores by level 100
5.20 Edited SPSS bar chart for speaking mean scores by level 100
5.21 Listening score interaction of gender by proficiency level 101
6.1 Example of a forest plot 116
Illustrations xiii
6.2 Example of a funnel plot without the presence

of publication bias 116
6.3 Example of a funnel plot with the presence
of publication bias 117
7.1 Partial L value table 136
7.2 Mahalanobis distance dialogue boxes in SPSS 138
7.3 Mahalanobis distance column in SPSS data view 139
7.4 Tolerance statistic dialogue boxes in SPSS 140
7.5 Multiple regression analysis decision tree 142
7.6 SPSS standard multiple regression dialogue boxes:
the first dialogue box and selections in the Statistics tab 144
7.7 SPSS standard multiple regression dialogue boxes:
selections in the Linear Regression Plots dialogue box 145
7.8 A scatter plot indicating normality 145
7.9 A scatter plot indicating nonnormality 146
7.10 SPSS hierarchical regression analysis dialogue boxes:
selections of PVs for the first model 150
7.11 SPSS hierarchical regression analysis dialogue boxes:
selections of PV for the second and final model
and selection of statistics 151
8.1 Q-Q plots for untransformed (left) and transformed
(right) proficiency scores 167
9.1 Types of factor analysis 184
9.2 Overview of the steps in a factor analysis 186
9.3 Example of KMO measure of sampling adequacy
and Bartlett’s Test of Sphericity (SPSS Output) 188
9.4 Adapted R-matrix 189
9.5 Communalities 190
9.6 Choosing EFA 191
9.7 Main dialogue box for factor analysis 191
9.8 Descriptives in factor analysis 192
9.9 Dialogue box for factor extraction 193
9.10 Total variance explained 195
9.11 Scree plot 195
9.12 Dialogue box for factor rotation 198
9.13 Options dialogue box 199
9.14 Unrotated component matrix 200
9.15 Rotated factor loadings (pattern matrix) 201
9.16 Factor scores dialogue box 202
9.17 Labeling the factors 204–5
10.1 Two competing structural models 215
xiv Illustrations
10.2 Two competing structural models with measurement

part added 216
10.3 Two competing models: a one-factor model and a
three-factor model 218
10.4 PRELIS data definition options 226
10.5 Starting to build command lines 226
10.6 Adding latent variable command lines 227
10.7 Setup for the three-factor model 230
10.8 Importing data for one-factor model in AMOS 234
10.9 Output file three-factor model with correlated error
in AMOS 235
11.1 Step 1 247
11.2 Step 2 248
11.3 Step 3, part 1 249
11.4 Step 3, part 2 249
11.5 Step 4, part 1 250
11.6 Step 4, part 2 251
11.7 Step 5, part 1 251
11.8 Step 5, part 2 253
11.9 Step 6 254
11.10 Step 7 254
11.11 Dendrogram of cluster analysis for 947 cases 255
11.12 Truncated agglomeration schedule for 947
cases in the data set 256
11.13 Distance between fusion coefficients
by number of clusters 258
11.14 Step 9, part 1 259
11.15 Step 9, part 2 259
11.16 Data view with 2, 3, and 4 cluster solutions 260
11.17 Step 9, part 3 261
11.18 Step 10, part 1 262
11.19 Step 10, part 2 262
11.20 Step 11 264
11.21 Cluster membership by task type for the
two-cluster solution 265
three-cluster solution 265
four-cluster solution 266
11.24 Cluster membership by score level for the
two-cluster solution 267
Illustrations xv

three-cluster solution 267
four-cluster solution 267
12.1 Sample person/item (Wright) map 285–86
12.2 Sample category characteristic curve 293
12.3 Sample facets map 297
13.1 Selecting the right analysis in SPSS 311
13.2 Selecting and defining grouping variables 311
13.3 Selecting predictor variables 312
13.4 Selecting statistics for the analysis 313
13.5 Selecting analysis and display option for classification 314
13.6 Two-dimensional output for three group average values
on two discriminant functions 319
14.1 Schematic person-item map with cut scores 332
14.2 Grouping identifiers and item difficulty estimates 336
14.3 Comparison of Means data input 337
14.4 Group observations in Comparison of Means 338
14.5 Confirmatory model specification 339
14.6 Entry of hypothesized mean hierarchies 340
14.7 Summary of hierarchy of hypotheses 341
14.8 Execution of Comparison of Means 342
14.9 Comparison of Means Bayesian analysis output 343
TABLES
1.1 Software used and available for procedures in this book 7

3.1 Data and results from Sample Study 1 25
3.4 Example results showing the inconsistency of p values 28
3.5 General benchmarks for interpreting d and r effect
sizes in L2 research 38
5.1 Types of graphical charts and frequency of use found in
last four regular issues of five L2 journals 79
5.2 2009 average reading scale score sorted by gender, grade
12 public schools 84
5.3 2009 average NAEP reading scale scores by gender for
grade 12 public schools in 11 states (first revision) 85
xvi Illustrations
5.4 2009 average NAEP reading scale scores by gender for

grade 12 public schools in 11 states sorted on state mean
scores (second revision) 85
6.1 Suggested categories for coding within meta-analyses of L2
research 110
7.1 SPSS output for tolerance statistics 140
7.2 SPSS output for variables entered/removed 147
7.3 SPSS output for regression model summary 147
7.4 SPSS output for ANOVA resulting from regression 148
7.5 SPSS output for regression coefficients 149
7.6 SPSS output for variables entered/removed in hierarchical
regression model 152
7.7 SPSS output for hierarchical regression model summary 152
7.8 SPSS output for ANOVA resulting from hierarchical
regression 153
7.9 SPSS output for hierarchical regression coefficients 153
9.1 Parallel analysis 196
11.1 Reformatted fusion coefficients for final six clusters formed 257
11.2 Means and standard deviations for the two–cluster solution 263
11.3 Means and standard deviations for the three–cluster solution 263
11.4 Means and standard deviations for the four–cluster solution 264
12.1 Data type, response formats, Rasch models, and programs 278
12.2 Data input format for analyses not involving multiple raters 279
12.3 Data input format for analyses involving multiple raters 280
12.4 Sample person measurement report (shortened) 288
12.5 Sample item measurement report (shortened) 289
12.6 Sample item measurement report for partial credit data 292
12.7 Sample rating scale category structure report 293
12.8 Sample rater measurement report 298
13.1 ANOVA output for nine predictor variables 315
13.2 Box’s M output for testing homogeneity of covariance
across three groups 316
13.3 Canonical discriminant functions output 317
13.4 Relationship output for individual predictor variables
and functions 318
13.5 Classification output for each predictor variable 320
13.6 Accuracy of classification output for membership
in three groups 320
14.1 Grouping labels for analysis 335
14.2 Hypotheses tested in confirmatory technique 337
14.3 Comparison of Means software (exploratory and
confirmatory tests) 339
ACKNOWLEDGMENTS
I want to begin by expressing my sincere gratitude to the diverse set of individuals

who have contributed to this volume in equally diverse ways. I am very grateful,
first of all, to all 18 chapter authors. It is clear from their work that they are not
only experts in the statistical procedures they have written about but in their abil-
ity to communicate and train others on these procedures as well. I also thank the
authors for their perseverance and persistence in the face of my many requests.
In addition to my own comments, each chapter was also reviewed by at least one
reviewer from both the target audience (graduate students or junior researchers
with at least one previous course in statistics) and from the modest pool of applied
linguists with expertise in the focal procedure of each chapter. I am very thankful
for the comments and suggestions of these reviewers which led to many substan-
tial improvements throughout the volume: Dan Brown, Meishan Chen, Euijung
Cheong, Joseph Collentine, Jersus Colmenares, Scott Crossley, Deirdre Derrick,
Jesse Egbert, Maria Nelly Gutierrez Arvizu, Eun Hee Jeon, Tingting Kang, Geof-
frey LaFlair, Jenifer Larson-Hall, Jared Linck, Junkyu Lee, Qiandi Liu, Meghan
Moran, John Norris, Gary Ockey, Fred Oswald, Steven Ross, Erin Schnur, and
Soo Jung Youn. Along these lines, my thanks go to the students in my ENG 599
and 705 courses, who read and commented on prepublication versions of many of
the chapters in the book. Special thanks to Deirdre Derrick for all her help on the
index. I also thank Shawn Loewen and Fred Oswald, both of whom have had a
(statistically) significant effect on my development as quantitative researcher. A big
thanks go to Sue Gass and Alison Mackey, series editors, for their encouragement
and support in carrying this book from an idea to its current form. Last, thanks
to you, the reader, for your interest in advancing the field’s quantitative methods.
In the words of Geoff Cumming, happy reading and “may all your confidence
intervals be short!”
CONTRIBUTORS
Douglas Biber (Northern Arizona University)

James Dean Brown (University of Hawaii at Manoa)
Ian Cunnings (University of Reading)
Jesse Egbert (Brigham Young University)
Ian Finlayson (University of Edinburgh)
Talip Gonulal (Michigan State University)
Thom Hudson (University of Hawaii at Manoa)
Eun Hee Jeon (University of North Carolina, Pembroke)
Ute Knoch (University of Melbourne)
Geoffrey T. LaFlair (Northern Arizona University)
Shawn Loewen (Michigan State University)
Beth Mackey (University of Maryland)
Tim McNamara (University of Melbourne)
John M. Norris (Georgetown University)
Frederick L. Oswald (Rice University)
Luke Plonsky (Northern Arizona University)
Steven J. Ross (University of Maryland)
Rob Schoonen (University of Amsterdam)
Shelley Staples (Purdue University)
PART I
Introduction
1
INTRODUCTION
Luke Plonsky
Rationale for This Book

Several reviews of quantitative second language (L2) research have demonstrated
that empirical efforts in the field rely heavily on a very narrow range of statisti-
cal procedures (e.g., Gass, 2009; Plonsky, 2013). Namely, nearly all quantitative
studies employ t tests, ANOVAs, and/or correlations. In many cases, these tests
are viable means to address the research questions at hand; however, problems
associated with these techniques arise frequently (e.g., failing to meet statistical
assumptions). More concerning, though, is the capacity of these tests to provide
meaningful and informative answers to our questions about L2 learning, teach-
ing, testing, use, and so forth. Also concerning is that the near-default status of
these statistics restricts researchers’ ability to understand relationships between
constructs of interest as well as their use of analyses to examine such relationships.
In other words, our research questions are being constrained by our knowledge
of statistical tools.
This problem manifests itself in at least two ways. First, it is not uncommon
to find researchers that convert intervally measured (independent) variables into
categorical ones in order for the data to fit into an ANOVA model. Doing so
trades precious variance for what appears to be a more straightforward analyti-
cal approach (see Plonsky, Chapter 3 in this volume, for further comments and
suggestions related to this practice). Second, and perhaps more concerning, the
relatively simple statistics found in most L2 research are generally unable to model
the complex relationships we are interested in. L2 learning and use are multivari-
ate in nature (see, e.g., Brown, Chapter 2 in this volume). Many studies account
for the complexity in these processes by measuring multiple variables. Few, how-
ever, attempt to analyze them using multivariate techniques. Consequently, it is
4 Luke Plonsky
common to find 20 or 30 univariate tests in a single study leading to a greater

chance of Type I error and, more importantly, a fractured view of the relation-
ships of interest (Plonsky, 2013).
Before going on I need to clarify two points related to the intentions behind
this volume. First, neither I nor the authors who have contributed to this volume
are advocating for blindly applied technical or statistical sophistication. I agree
wholeheartedly with the recommendation of the American Psychological Asso-
ciation to employ statistical procedures that are “minimally sufficient” to address
the research questions being posed (Wilkinson & Task Force on Statistical Infer-
ence, 1999, p. 598). Second, the procedures described in this book are just tools.
Yes, they carry great potential to help us address substantive questions that cannot
otherwise be answered. We have to remember, though, that our analyses must be
guided by the substantive interests and relationships in question and not the other
way around. I mention this because of the tendency, particularly among novice
researchers, to become fascinated with a particular method or statistic and to
allow one’s research questions to be driven by the method.
Having laid out these rationales and caveats . . . at the heart of this volume is
an interest in informing and expanding the statistical repertoire of L2 researchers.
Toward this end, each chapter provides the conceptual motivation for and the
practical, step-by-step guidance needed to carry out a relatively advanced, novel,
and/or underused statistical technique using readily available statistical software
packages (e.g., SPSS). In related disciplines such as education and psychology,
these techniques are introduced in statistics texts and employed regularly. Despite
their potential in our field, however, they are rarely used and almost entirely
absent from methodological texts written for applied linguistics.
This volume picks up where introductory texts (e.g., Larson-Hall, 2015) leave
off and assumes a basic understanding of research design as well as basic statistical
concepts and techniques used in L2 research (e.g., t test, ANOVA, correlation).
The book goes beyond these procedures to provide a “second course,” that is, a
conceptual primer and practical tutorial on a number of analyses not currently
available in other methods volumes in applied linguistics. The hope is that, by
doing so, researchers in the field will be better equipped to address questions cur-
rently posed and to take on novel or more complex questions.
The book also seeks to improve methodological training in graduate programs,
the need for which has been suggested as the result of recent studies surveying
both published research as well as researcher self-efficacy (e.g., Loewen et al.,
2014; Plonsky, 2014). This text will assist graduate programs in applied linguistics
and second language acquisition/studies in providing “in-house” instruction on
statistical techniques using sample data and examples tailored to the variables,
interests, measures, and designs particular to L2 research.
Beyond filling gaps in the statistical knowledge of the field and in available texts
and reference books, this volume also seeks to contribute to the budding method-
ological and statistical reform movement taking place in applied linguistics. The
Introduction 5
field has seen a rapid increase in its awareness of methodological issues in the last
decade. Evidence of this movement, which holds that methodological rigor and
transparency are critical to advancing our knowledge of L2 learning and teaching,
is found in meta-analyses (e.g., Norris & Ortega, 2000), methodological syntheses
(e.g., Hashemi & Babaii, 2013; Plonsky & Gass, 2011), methodologically oriented
conferences and symposia (e.g., the Language Learning Currents conference in
2013), and a number of article- and book-length treatments raising method-
ological issues (e.g., Norris, Ross, & Schoonen, in press; Plonsky & Oswald, 2014;
Porte, 2012).This book aims to both contribute to and benefit from the momen-
tum in this area, serving as a catalyst for much additional work seeking to advance
the means by which L2 research is conducted.
Themes
In addition to the general aim of moving forward quantitative L2 research, three
major themes present themselves across the volume. The first and most prevalent
theme is the role of researcher judgment in conducting each of the analyses pre-
sented here. Results based on statistical analyses can obscure the decisions made
throughout the research process that led to those results. As Huff (1954) states in
the now-classic How to Lie with Statistics, “despite its mathematical base, statistics
is as much an art as it is a science” (p. 120). As noted throughout this book, deci-
sion points abound in more advanced and multivariate statistics.These procedures
involve multiple steps and are particularly subject to the judgment of individual
researchers. Consequently, researchers must develop and combine not only sub-
stantive but also methodological/statistical expertise in order for the results of
such analyses to maximally inform L2 theory, practice, and future research.
The second theme, transparency, builds naturally on the first. Appropriate deci-
sion making is a necessary but insufficient requisite for the theoretical and/or
practical potential of a study to be realized. Choices made throughout the process
must also be justified in the written report, giving proper consideration to the
strengths and weaknesses resulting from each decision relative to other avail-
able options. Consumers of research can then more adequately and confidently
interpret study results. Of course, the need for transparency applies not only to
methodological procedures but also to the reporting of data (see Larson-Hall &
Plonsky, in press).
The third major theme found throughout this volume is the interrelatedness of
the procedures presented. Statistical techniques are often presented and discussed
in isolation despite great conceptual and statistical commonalities. ANOVA and
multiple regression, for example, are usually considered—and taught—as distinct
statistical techniques. However, ANOVA can be considered a type of regression
with a single, categorical predictor variable; see Cohen’s (1968) introduction to
the general linear model (GLM). The relationship between these procedures
can also be demonstrated statistically: The eta-squared effect size yielded by an
6 Luke Plonsky
ANOVA will be equal to the R2 from a multiple regression based on the same
independent/predictor and dependent/criterion variables. Both indices express
the amount of variance the independent variable accounts for in the dependent
variable. Whenever applicable, the chapters in this volume have drawn attention
to such similarities and shared utility among procedures.
Structure of the Book

This book is divided into three parts containing 14 chapters written by some of
the most methodologically savvy scholars in the field. Part I sets up the book and
the techniques found throughout. Brown’s chapter, following this brief introduc-
tion, discusses the value and place of more advanced statistics, highlighting advan-
tages/benefits and disadvantages of applying such techniques in L2 research. The
remaining two parts correspond to two complementary approaches to advancing
quantitative L2 research. The chapters in Part II seek to enhance and improve
upon techniques currently in use. A chapter I wrote begins Part III with a critique
of the status quo of null hypothesis significance testing (NHST) in L2 research.
The chapter then guides readers toward more appropriate and informative use
of p values, effect sizes, and descriptive statistics, particularly in the context of
means-based comparisons (t tests, ANOVAs) and correlations. LaFlair, Egbert, and
myself provide a step-by-by guide to an alternative approach to running these
same analyses proposed to aid L2 researchers overcome some of the problems
commonly found in our data (e.g., non-normality, small Ns): bootstrapping. Hud-
son then illustrates a number of key principles for visual presentations of quan-
titative data. In the final chapter of Part II, Fred Oswald and I present a practical
guide to conducting meta-analyses of L2 research. (This chapter is an updated and
expanded version of a similar one we published in 2012.)
The eight chapters in the second section focus on more advanced statistical pro-
cedures that, despite their potential, are not commonly found in L2 research. Each
chapter begins with a conceptual overview followed by a step-by-step guide to the
targeted technique.These include multiple regression ( Jeon), mixed effects model-
ing and longitudinal analysis (Cunnings & Finlayson), factor analysis (Loewen &
Gonulal), structural equation modeling (Schoonen), cluster analysis (Staples &
Biber), Rasch analysis (Knoch & McNamara), discriminant function analysis
(Norris), and Bayesian data analysis (Mackey & Ross). Practice data sets have been
provided on the companion website to go along with each chapter in this part of
the book as well as with Chapters 3, 4, and 6 in the previous part. The companion
website can be found here: http://oak.ucc.nau.edu/ldp3/AQMSLR.html
Software
One of the challenges in preparing and using a book like this one is choos-
ing the statistical software. Such a decision involves considering accessibility, cost,
Introduction 7
TABLE 1.1 Software used and available for procedures in this book
Procedure In Chapter Additional Options*
Descriptives, NHST, effect SPSS, Excel R

sizes (Chapter 3)
Bootstrapping (Chapter 4) SPSS, R Excel (macro)
Visuals (Chapter 5) Excel SPSS, R
Meta-analysis (Chapter 6) SPSS, Excel R
Multiple regression SPSS R, Excel (macro)
(Chapter 7)
Mixed effects, longitudinal R SPSS
analysis (Chapter 8)
Factor analysis (Chapter 9) SPSS R, Excel (macro)
Structural equation modeling LISREL, SPSS (AMOS) R, Excel (macro)
(Chapter 10)
Cluster analysis (Chapter 11) SPSS R, Excel (macro)
Rasch analysis (Chapter 12) Winsteps, Facets SPSS (extension), R, Excel
(macro)
Discriminant function SPSS R, Excel (macro)
analysis (Chapter 13)
Bayesian analysis Comparison of Means SPSS (AMOS), R, Excel
(Chapter 14)
*I have limited additional options to SPSS, R, and Excel, the three most commonly used programs
used for statistical analyses in applied linguistics according to Loewen et al. (2014).
user friendliness, and consistency across chapters, among other issues. Further-
more, there are numerous options available, each of which possess a unique set of
strengths and weaknesses. IBM’s SPSS, for example, is very user friendly but can
be costly.The default settings in SPSS can also lead to users not understanding the
choices that the program makes for them (e.g., Mizumoto & Plonsky, in review;
Plonsky & Gonulal, in press).
As shown in Table 1.1, most analyses in this book have been demonstrated
using SPSS.To a much lesser extent, Microsoft Excel and R (R development core
team, 2014) have also been used along with, in a small number of cases, more
specialized packages.
References
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin,
70, 426–443.
Gass, S. (2009). A survey of SLA research. In W. Ritchie & T. Bhatia (Eds.), Handbook of
second language acquisition (pp. 3–28). Bingley, UK: Emerald.
Hashemi, M. R., & Babaii, E. (2013). Mixed methods research:Toward new research designs
in applied linguistics. Modern Language Journal, 97, 828–852.
Huff, D. (1954). How to lie with statistics. New York: Norton & Company.
Larson-Hall, J. (2015). A guide to doing statistics in second language research using SPSS and R.
New York: Routledge.
8 Luke Plonsky
Larson-Hall, J., & Plonsky, L. (in press). Reporting and interpreting quantitative research
findings: What gets reported and recommendations for the field. Language Learning,
65, Supp. 1, 125–157.
Loewen, S., Lavolette, B., Spino, L. A., Papi, M., Schmidtke, J., Sterling, S., et al. (2014).
Statistical literacy among applied linguists and second language acquisition researchers.
TESOL Quarterly, 48, 360–388.
Mizumoto, A., & Plonsky, L. (in review). R as a lingua franca: Advantages of using R for
quantitative research in applied linguistics. Manuscript under review.
Norris, J. M., & Ortega, L. (2000). Effectiveness of L2 instruction: A research synthesis and
quantitative meta-analysis. Language Learning, 50, 417–528.
Norris, J. M., Ross, S., & Schoonen, R. (Eds.) (in press). Improving and extending quantitative
reasoning in second language research. Malden, MA: Wiley.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodologi-
cal synthesis and call for reform. Modern Language Journal, 98, 450–470.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes:
The case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Gonulal, T. (2015). Methodological reviews of quantitative L2 research: A
review of reviews and a case study of exploratory factor analysis. Language Learning,
65, Supp. 1, 9-35.
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Porte, G. (Ed.) (2012). Replication research in applied linguistics. New York: Cambridge Uni-
versity Press.
R development core team. (2014). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing.
Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychol-
ogy journals: Guidelines and explanations. American Psychologist, 54, 594–604.
2
WHY BOTHER LEARNING
ADVANCED QUANTITATIVE
METHODS IN L2 RESEARCH?
James Dean Brown
Why would anyone bother to learn advanced quantitative methods in second

language (L2) research? Isn’t it bad enough that language researchers often need
to learn basic statistical analyses? Well, the answer is that learning statistics is like
learning anything else: About the time you think you’ve finished, you find that
there is so much more to learn. Like so many things, every time you come to the
crest of a hill, you see the next hill. So maybe instead of asking “Why bother?” you
should be asking “What’s next after I learn the basic stats?”That is what this book
is about. In this chapter, I will summarize some of the benefits that you can reap
from taking that next step and continuing to learn more advanced techniques in
quantitative analysis. Naturally, such benefits must always be weighed against any
disadvantages as well, so I will consider those too.
What Are the Advantages of Using Advanced

Quantitative Methods?
By advantages, I mean the benefits, the plusses, and the pros of learning and
using the advanced quantitative methods covered in this book and elsewhere.
The primary advantages are that you can learn to measure more precisely,
think beyond the basic null hypothesis significance test, avoid the problem of
multiple comparisons, increase the statistical power of your studies, broaden
your research perspective, align your research analyses more closely to the
way people think, reduce redundancy and the number of variables, expand
the number and types of variables, get more flexibility in your analyses, and
simultaneously address multiple levels of analysis. Let’s consider each of these
advantages in turn.
10 James Dean Brown
Measuring More Precisely

One concern that all researchers should share is for the accuracy and precision of
the ways they measure the variables in their studies. Variables can be quantified
as nominal, ordinal, interval, or ratio scales (for a readily available, brief review of
these four concepts, see Brown, 2011a).Variables that are nominal, ordinal, or ratio
scales in nature can typically be observed and quantified fairly easily and reliably.
However, interval scales (e.g., proficiency test scores, questionnaire subsection
scores) may be more problematic. That is why you should take special care in
developing and piloting such measures and should always report the reliability in
your study of the resulting scores as well as arguments that support their valid-
ity. One issue that is seldom addressed is the degree to which these “interval”
scales are actually true interval scales. Can you say that the raw score points on a
particular test actually represent equal intervals? If not, then defending the scores
as an interval scale may not be justified. One solution to that problem is to use
an advanced statistical technique called Rasch analysis. This form of analysis can
help you analyze and improve any raw-score interval scales you use, but also as
a byproduct, you can use Rasch analyses to convert those raw scores into logit
scores which arguably form a true interval scale. There are a number of other rea-
sons why you might want to use Rasch analysis to better understand your scales
and how precisely you are measuring the variables in your studies (Knoch &
McNamara, Chapter 12 in this volume).
Thinking Beyond the Null Hypothesis Significance Test

In this volume, in Chapter 3 Plonsky examines the proper place of null hypothesis
significance testing (NHST) and the associated p values, as well as the importance
of examining the descriptive statistics that underlie the NHST and considering
the statistical power of the study as well as the estimated effect sizes. As far back as
the 1970s, I can remember my statistics teachers telling me that doing an analysis
of variance (ANOVA) procedure and finding a significant result is just the begin-
ning. They always stressed the importance of considering the assumptions and of
following up with planned or post hoc comparisons, with plots of interaction
effects, and with careful attention to the descriptive and reliability statistics. In
the ensuing years, the importance of also considering confidence intervals (CI),
power, and effect sizes (for more on these concepts see Plonsky, Chapter 3 in
this volume; Brown 2007, 2008a, 2011b) has become increasingly evident. All
of these advanced follow-up strategies are so much more informative than the
initial result that it is downright foolish to stop interpreting the results once you
have found a significant p value. Similar arguments can be made for following
up on initial multiple-regression results, on complex contingency table analyses,
or on any other form of analysis you may perform. The point is that you should
never stop just because you got (or didn’t get) an initial significant p value. There
A Case for Advanced Quantitative Methods 11
is so much more to be learned from using follow-up analyses and more still from
thinking about all of your results as one comprehensive picture of what is going
on in your data.
Avoiding the Problem of Multiple Comparisons

Another important benefit of using advanced statistical analyses is that they can
help you avoid the problem of multiple comparisons, also known technically as
avoiding Type I errors (incorrect rejection of a true null hypothesis). This is a
problem that arises when a researcher uses a univariate statistical test (one that
was designed to make a single hypothesis test at a specified probability level
within the NHST framework) multiple times in the same study with the same
data. For more advanced ANOVA techniques with post hoc comparisons, or for
studies with multiple dependent variables, multivariate ANOVA (or MANOVA)
designs can greatly expand the possibilities for controlling or minimizing such
Type I errors. These strategies work because they make it possible to analyse
more variables simultaneously and adjust for multiple comparisons, thereby giv-
ing greater power to the study as a whole and avoiding or minimizing Type
I errors. For more on this topic, see Plonsky, Chapter 3 in this volume, or Brown
(1990, 2008b).
Increasing Statistical Power

Another way of looking at the issue of multiple statistical tests just described is
that many of the more complex (and multivariate) statistical analyses provide
strategies and tools for more powerful tests of significance when compared with
a series of univariate techniques used with the same data. In the process, using
these more omnibus designs, researchers are more likely to focus on CIs, effect
sizes, and power instead of indulging in the mania for significance that multiple
comparisons exemplifies (again see Plonsky, Chapter 3 in this volume).
In addition, as LaFlair, Egbert, and Plonsky point out in Chapter 4, the advanced
statistical technique called bootstrapping provides a nonparametric alternative to
the t-test and ANOVA that can help to overcome problems of small sample sizes
and nonnormal distributions, and do so with increased statistical power. Since
many studies in our field have these problems with sample size and normality,
bootstrapping is an advanced statistical technique well worth knowing about.
Broadening Your Research Perspective

More advanced statistical analyses will also encourage you to shift from a myopic
focus on single factors or pairs of factors to examining multiple relationships
among a number of variables. Thus, you will be more likely to look at the larger
picture for patterns. Put another way, you are more likely to consider all parts of
12 James Dean Brown
the picture at the same time, and might therefore see relationships between and
among variables (all at once) that you might otherwise have missed or failed to
understand.
Indeed, you will gain an even more comprehensive view of the data and results
for a particular area of research by learning about and applying an advanced
technique called meta-analysis. As Plonsky and Oswald explain (Chapter 6 in this
volume), meta-analysis can be defined narrowly as “a statistical method for cal-
culating the mean and the variance of a collection of effect sizes across studies,
usually correlations (r) or standardized mean differences (d )” or broadly as “not
only these narrower statistical computations, but also the conceptual integration
of the literature and the findings that gives the meta-analysis its substantive mean-
ing” (p. 106).Truly, this advanced form of analysis will give you the much broader
perspective of comparing the results from a number of (sometimes contradictory)
studies in the same area of research.
Aligning Your Research Analyses More Closely

to the Way People Think
Because of their broadened focus, many advanced analyses more closely match
the ways that you actually think (or perhaps should think) about your data. More
specifically, language learning is complex and complicated to think about, and
some of the advanced statistics can account for such complexity by allowing the
study of multiple variables simultaneously, which of course provides a richer and
more realistic way of looking at data than is provided by examining one single
variable at a time or even pairs of variables.
In addition, Hudson (Chapter 5 in this volume) explains the importance of
visually representing the data and results and doing so effectively. Two of the
follow-up strategies mentioned earlier (plotting the interaction effects and CIs)
are often effectively illustrated or explained in graphical representations (as line
graphs and box-and-whisker plots, respectively). Indeed, thinking beyond the ini-
tial NHST and using more advanced statistical analyses will naturally tend to lead
you to use tables and figures to visualize many relationships simultaneously. For
example, a table of univariate follow-up tests adjusted for multiple comparisons
puts all of the results in one place and forces you and your reader to consider them
as a package; a factor analysis table shows the relationships among dozens of vari-
ables in one comprehensive way; a Rasch analysis figure can show the relation-
ships between individual examinees’ performances and the item difficulty at the
same time and on the same scale; and a structural equation model figure shows
the relationships among all the variables in a study in an elegant comprehensive
picture. Such visual representations will not only help you interpret the complex-
ity and richness of your data and results, but will also help your readers understand
your results as a comprehensive set.
Reducing Redundancy and the Number of Variables

Few researchers think about it, but advanced statistical analyses can also help
you by reducing the confusion of data that you may face. Since these advanced
analyses often require careful screening of the data, redundant variables (e.g., two
variables correlating at say .90, which means they are probably largely represent-
ing the same construct) are likely to be noticed and one of them eliminated (to
avoid what is called multicollinearity). In fact, one very useful function of factor
analysis (see Loewen & Gonulal, Chapter 9 in this volume) in its many forms is
data reduction. For example, if a factor analysis of 32 variables reveals only eight
factors, researchers might want to consider the possibility that there is consid-
erable redundancy in her data. As a result, she may decide to select only those
eight variables with the highest loadings on the eight factors, or may decide to
collapse (by averaging them) all of the variables loading together on each factor
to create eight new variables, or may decide to use the eight sets of factor scores
produced by the factor analysis as variables. Whatever choice is made, the study
will have gone from 32 variables (with considerable dependencies, relationships,
and redundancies among them) to eight variables (that are relatively orthogonal,
or independent). Such a reduction in the number of variables will very often
have the beneficial effect of increasing the overall power of the study as well as
the parsimony in the model being examined (see Jeon, Chapter 7 in this volume).
Expanding the Number and Types of Variables

Paradoxically, while reducing the number of variables, advanced statistical analyses
can also afford you the opportunity to expand the number and types of variables
in your study in important ways. For instance, research books often devote con-
siderable space to discussions of moderator variables and control variables, but simple
univariate analyses do not lend themselves to including those sorts of variables.
Fortunately, more complex analyses actually allow including such variables in
a variety of ways. More precisely, multivariate analyses allow you to introduce
additional moderator variables to determine the links between the independent and
dependent variables or to specify the conditions under which those associations
take place. Similarly, various multivariate analyses can be structured to include
control variables or associations between variables, while examining still other asso-
ciations (e.g., partial and semi-partial correlations, covariate analyses, hierarchical
multiple regressions). Thus, moderator and control variables not only become a
reality, but can also help us to more clearly understand the core analyses in a study.
Getting More Flexibility in Your Analyses

Most quantitative research courses offer an introduction to regression analysis,
which is a useful form of analysis if you want to estimate the degree of relationship
14 James Dean Brown
between two continuous scales (i.e., interval or ratio) or to predict one of those
scales from the other. However, more advanced statistical analyses offer consider-
ably more flexibility. For instance, multiple regression (see Jeon, Chapter 7 in this
volume) allows you the possibility of predicting one dependent variable from
multiple continuous and/or categorical independent variables. Discriminant func-
tion analysis (see Norris, Chapter 13 in this volume) makes it possible to predict
a categorical variable from multiple continuous variables (or more accurately, to
determine the degree to which the continuous variables correctly classify mem-
bership in the categories). Logistic regression makes it possible to predict a categori-
cal variable such as group membership from categorical or continuous variables,
or both. Loglinear modeling can be applied to purely categorical data to test the
fit of a regression-like equation to the data. For excellent coverage of all of these
forms of analysis, see Tabachnick and Fidell (2013).
Other advanced statistical procedures provide the flexibility to look beyond
simple relationships to patterns in relationships. For example, instead of look-
ing at a correlation coefficient or a matrix of simple correlation coefficients, it
is possible to examine patterns in those correlation coefficients by performing
factor analysis, which can reveal subsets of variables in a larger set of variables that
are related within subsets, yet are fairly independent between subsets. The three
types of factor analysis (principle components analysis, factor analysis, and con-
firmatory factor analysis; see Chapter 9 in this volume for Loewen and Gonulal’s
explanation of the differences) can help you understand the underlying pattern
of relationships among your variables, and thereby help you to: (a) determine
which variables are redundant and therefore should be eliminated (as described
earlier); (b) decide which variables or combination of variables to use in sub-
sequent analyses; and (c) item-analyze, improve, and/or validate your measures.
In contrast, cluster analysis is a “multivariate exploratory procedure that is used
to group cases (e.g., participants or texts). Cluster analysis is useful in studies
where there is extensive variation among the individual cases within predefined
categories” (Staples & Biber, Chapter 11 in this volume, p. 243). Also useful is
multiway analysis, which can help you study the associations among three or
more categorical variables (see Tabachnick & Fidell, 2013 for more on multiway
analysis).
Another form of analysis that provides you with considerable flexibility is
structural equation modeling (SEM), which is
a collection of analyses that can be used for many questions in L2 research.

SEM can deal with multiple dependent variables and multiple independent
variables, and these variables can be continuous, ordinal or discrete [also
known as categorical ], and they can be indicated as observed variables (i.e.,
observed scores) or as latent variables (i.e., the underlying factor of a set of
observed variables) (Mueller & Hancock, 2008; Ullman, 2006).
(Schoonen, Chapter 10 in this volume, p. 214)
SEM combines ideas that underlie many of the other forms of analysis discussed
here, but can additionally be used to model theories (a) to investigate if your data
fit them, (b) to compare that fit for several data sets (e.g., for boys and girls), or
(c) to examine changes in fit longitudinally.
With regard to means comparisons, mixed effects models (see Cunnings &
Finlayson, Chapter 8 in this volume), which by definition are models that include
both fixed and random effects, are flexible enough to be used with data that are
normally distributed or that are categorical (i.e., nonnumeric). In addition, mixed
effects models are especially useful when designs are unbalanced (i.e., groups
have different numbers of participants in each) or have missing data. Importantly,
if you are studying learning over time, these models can accommodate repeated
measures in longitudinal studies.
Simultaneously Addressing Multiple Levels of Analysis

Advanced statistical analyses, especially multivariate analyses, also encourage
researchers to use more than one level of analysis. Indeed, these advanced analy-
ses can provide multiple levels of analysis that help in examining data and the
phenomena they represent in overarching ways. A simple example is provided by
MANOVA, which is a first stage that can justify examining multiple univariate
ANOVAs (with p values adjusted for the multiple comparisons) in a second stage.
Stepwise regression or hierarchical/sequential versions of various analyses allow
researchers to analyze predictor variables and combinations of variables in stages,
even while factoring out another variable or combination of variables.
Similarly, Bayesian data analysis as Mackey and Ross apply it to item analysis in
Chapter 14 (in this volume) not only provides an alternative to NHST ANOVA
approaches, but in fact,
The conceptual difference between null hypothesis testing and the Bayes-
ian alternative is that predictions about mean differences are stated a priori
in a hierarchy of differences as motivated by theory-driven claims. . . . In this
approach, the null hypothesis is typically superfluous, as the researchers aim
to confirm that the predicted order of mean differences are instantiated in
the data. Support for the hierarchically ordered means hypothesis is evident
only if the predicted order of mean differences is observed. The predicted
and plausible alternative hypotheses thus must be expressed in advance of
the data analysis—thus making the subsequent ANOVA confirmatory.
(Mackey & Ross, Chapter 14 in this volume, p. 334)
Clearly, this advanced alternative form of analysis not only provides a means for
examining data hierarchically and with consideration to previous findings and/
or theoretical predictions, but in fact, it also demands that the data be examined
in that way from the outset.
16 James Dean Brown
What Are the Disadvantages of Using Advanced

Quantitative Methods?
So far, I have shown some of the numerous advantages of learning more about
advanced statistical analyses. But given the investment of time and energy involved,
the disadvantages of using these advanced techniques should be weighed as well.
I will take up those issues next. By disadvantages, I mean the difficulties that are
likely to be encountered in learning and using advanced quantitative methods like
those covered in this book.
Larger Sample Sizes

Many of the advanced statistical procedures require larger sample sizes than the
more traditional and simpler univariate analyses. The sample sizes often need to
be in the hundreds, if not bigger, in order to produce meaningful and interpre-
table results. The central problem with applying many of these advanced statistics
to small samples is that the standard errors of all the estimates will tend to be large,
which may make analyzing the results meaningless.
Unfortunately, getting large sample sizes is often difficult because you will
need to get people to cooperate and to get approval from human subjects com-
mittees. Getting people to cooperate is a problem largely because people are busy
and, more to the point, they do not feel that your study is as important as you
feel it is. Getting human subjects committees to approve your research can also
be vexingly difficult because those committees are often made up of researchers
from other fields who have little sympathy for or understanding of the problems
of doing L2 research. Nonetheless, for those doing advanced statistical analyses,
getting an adequate sample size is crucial, so the data gathering stage in the
research process is an important place to invest a good deal of your time and
energy.
Additional Assumptions
Another disadvantage of the more advanced statistical procedures is that they
tend to require that additional assumptions be met. Where a simple correlation
coefficient will have three assumptions, a multiple regression analysis will have
at least five assumptions, two of which will require the data screening discussed
in the next paragraph. In addition, whereas for univariate statistics a good deal
is known about the robustness of violating assumptions (e.g., it is known that
ANOVA is fairly robust to violations of the assumption of equal variances if
the cell sizes are fairly similar), less is known about such robustness in the more
complex designs of advanced statistical procedures. For a summary of assumptions
underlying univariate and some multivariate statistics, see Brown (1992), or for
multivariate statistics, see the early sections of each of the chapters in Tabachnick
and Fidell (2013).
Need for Data Screening

In analyzing whether the data in a study meet the assumptions of advanced statisti-
cal procedures, data screening is often essential. For example, univariate normality
(for each variable) and multivariate normality (for all variables taken together)
are assumptions of a number of the more advanced forms of statistical analysis.
Screening the data to see if these assumptions are met means examining the data
for univariate and multivariate outliers, as well as examining skew and kurtosis sta-
tistics for each variable and sometimes looking at the actual histograms to ensure
that they look approximately normal. Not only are such procedures tedious and
time consuming, but also they may require you to eliminate cases that are outliers,
change some of your data points to bring them into the distribution, or mathe-
matically transform the data for one variable or more. Such moves are not difficult,
but they are tedious. In addition, they are hard to explain to the readers of a study
in applied linguistics and may seem to those readers as though you are manipulat-
ing your data.Worse yet, moves like mathematical transformations take the analysis
one step away from the original data, which may start to become uncomfortable
even for you (e.g., what does the correlation mean between a normally distributed
scale and one transformed with its natural log, and how do you explain that to
your readers?). Nonetheless, the assumptions of advanced procedures, and the sub-
sequent data screening may make such strategies absolutely necessary.
Complexity of Analyses and Interpretations

There is no question that advanced statistical techniques, especially multivari-
ate ones, are more difficult to analyze and interpret. First, because they involve
higher-level mathematics than univariate statistics, you may find yourself learning
things like matrix algebra for the first time in your life. Second, because many
of the analyses involve tedious recursive procedures, it is absolutely essential to
use statistical computer programs (many of which are very expensive) to analyze
the data. Third, the results in the computer output of advanced statistical tech-
niques, especially multivariate ones, are often much more difficult to interpret
than those from simpler univariate statistical analyses. In short, as Tabachnick and
Fidell (2013) put it: “Multivariate methods are more complex than univariate by
at least an order of magnitude” (p. 1).
Are the Disadvantages Really Disadvantages?

Fortunately, I have noticed over the years that the disadvantages of learning and
using advanced quantitative methods most often lead to long-term advantages.
Larger Sample Sizes

For example, the need to obtain large sample sizes forces you to get responsibly
large sample sizes. These large sample sizes lead in the long run to more stable
18 James Dean Brown
results, a higher probability of finding significant results if they exist, more power-
ful results, and ultimately to more credible results in your own mind as well as in
the minds of your readers.
Additional Assumptions
Checking the more elaborate assumptions of advanced statistical tests forces you
to slow down at the beginning of your analyses and think about the descriptive
statistics, the shapes of the distributions involved, the reliability of various mea-
surements, the amounts of variance involved and accounted for, the degrees of
redundancy among variables, any univariate or multivariate outliers, and so forth.
Ultimately, all of this taken together with the results of the study can and should
lead to greater understanding of your data and results.
Need for Data Screening

The need for data screening similarly forces you to consider descriptive statistics,
distributions, reliability, variance, redundancy, and outliers in the data, but at a
time when something can be done to make the situation better by eliminating
outliers or bringing them into the relevant distribution, by transforming variables
that are skewed, and so forth. Even if you cannot fix a problem that you have
noticed in data screening, at the very least, you will have been put on notice that
a problem exists (or an assumption has been violated) such that this information
can be taken into account when you interpret the results later in the study.
Complexity of Analyses and Interpretations

In discussing the complexity issue, I mentioned earlier that Tabachnick and Fidell
(2013) said that, “Multivariate methods are more complex than univariate by at
least an order of magnitude.” But it is worth noting what they said directly after
that: “However, for the most part, the greater complexity requires few conceptual
leaps. Familiar concepts such as sampling distributions and homogeneity of vari-
ance simply become more elaborate” (p. 1). Moreover, given the advantages of
using advanced statistical techniques, they may well (a) force you to learn matrix
algebra for the first time in your life, which will not only make it possible for you
to understand the more advanced statistics, but also make the math underlying
the simpler statistics seem like child’s play; (b) motivate you to find a grant to pay
for the computer software you need, or some other way to get your institution
pay for it, or indeed, to finally sit down and learn R, which is free; and (c) push
you to finally get the manuals and/or supplementary books you need to actu-
ally understand the output and results of your more elaborate statistical analyses,
and again, doing so will make the output from simpler statistical analyses seem
like child’s play. In short, the added complexity involved in advanced statistical
analyses is not all bad. Indeed, it can lead you to exciting places you never thought
you would go.
Conclusion
In writing this chapter, I wrestled with using the word advantages. Perhaps it is bet-
ter to think about the advanced procedures described here as opening up options
rather than as having advantages—but then it occurred to me that people with
those options will have distinct advantages, so I stayed with the idea of advantages.
That is not to say that using advanced statistics, especially multivariate analyses,
for every study will be the best way to go. For example, I once had a student who
hated statistics so much that he set out to write a paper that used only descriptive
statistics and a single t-test, and he did it, writing an elegant, straightforward, and
interesting paper. Simple as it was, he was using exactly the right tools for that
research project.
However, learning new, advanced statistical techniques can help you to stay
interested and up-to-date in your research. Having multiple options can also help
you avoid getting stuck in a statistical rut. For instance, I know of one researcher
in our field who clearly learned multiple regression (probably for her disserta-
tion) and has used that form of analysis repeatedly and almost exclusively across
a number of studies. She is clearly stuck in a statistical rut. She is holding a ham-
mer, so she uses it for everything, including screws. I just wish she would extend
her knowledge to include some other advanced statistical procedures, especially
extensions of regression like factor analysis or SEM.
The bottom line here is that advanced statistics like those covered in this book
can be useful and even exciting to learn, but the harsh reality is that these forms
of analysis will mean nothing without good ideas, solid research designs, reliable
measurement, sound data collection, adequate data screening, careful checking of
assumptions, and comprehensive interpretations that include all facets of the data,
their distributions, and all of the statistics in the study.
Fortunately, you have this book in your hands. I say fortunately because this col-
lection of chapters is a particularly good place for L2 researchers to start expanding
their knowledge of advanced statistical procedures: It covers advanced statistical
techniques; it was written by L2 researchers; it was written for L2 researchers; and
it contains examples drawn from L2 research.
Good researching!
References
Brown, J. D. (1990). The use of multiple t tests in language research. TESOL Quarterly,
24(4), 770–773.
Brown, J. D. (1992). Statistics as a foreign language—Part 2: More things to look for in read-
ing statistical language studies. TESOL Quarterly, 26(4), 629–664.
20 James Dean Brown
Brown, J. D. (2007). Statistics Corner. Questions and answers about language testing sta-
tistics: Sample size and power. Shiken: JALT Testing & Evaluation SIG Newsletter, 11(1),
31–35. Also retrieved from http://www.jalt.org/test/bro_25.htm
Brown, J. D. (2008a). Statistics Corner. Questions and answers about language testing statis-
tics: Effect size and eta squared. Shiken: JALT Testing & Evaluation SIG Newsletter, 12(2),
Brown, J. D. (2008b). Statistics Corner. Questions and answers about language testing statis-
tics:The Bonferroni adjustment. Shiken: JALT Testing & Evaluation SIG Newsletter, 12(1),
Brown, J. D. (2011a). Statistics Corner. Questions and answers about language testing sta-
tistics: Likert items and scales of measurement. Shiken: JALT Testing & Evaluation SIG
Newsletter, 15(1), 10–14. Also retrieved from http://www.jalt.org/test/bro_34.htm
Brown, J. D. (2011b). Statistics Corner. Questions and answers about language testing sta-
tistics: Confidence intervals, limits, and levels? Shiken: JALT Testing & Evaluation SIG
Newsletter, 15(2), 23–27. Also retrieved from http://www.jalt.org/test/bro_35.htm
Mueller, R. O., & Hancock, G. R. (2008). Best practices in structural equation modeling.
In J. Osborne (Ed.). Best practices in quantitative methods (pp. 488–508). Thousand Oaks,
CA: Sage.
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Boston: Pearson.
Ullman, J. B. (2006). Structural Equation Modeling: Reviewing the basics and moving
forward. Journal of Personality Assessment, 87(1), 35–50.
PART II
Enhancing Existing
Quantitative Methods
3
STATISTICAL POWER, P VALUES,
DESCRIPTIVE STATISTICS, AND
EFFECT SIZES
A “BACK-TO-BASICS” APPROACH
TO ADVANCING QUANTITATIVE
METHODS IN L2 RESEARCH
Luke Plonsky
Introduction
Methodologically speaking, a great deal of quantitative L2 research has been mis-
guided. All too often we have been asking the wrong questions of our data. Con-
sequently, many of the answers we have derived have been, at best, weak in their
ability to inform theory and practice and, at worst, wrong or misleading. This
chapter seeks to reorient the field toward more appropriate kinds of questions
and analytical approaches. More specifically, I argue here against the field’s flawed
use and interpretation of statistical significance and, instead, in favor of more
thorough consideration of descriptive statistics including effect sizes and confi-
dence intervals (CIs). The approach I advocate in this chapter is not only more
basic, statistically speaking, and more computationally straightforward, but it is also
inherently more informative and more accurate when compared to the most fun-
damental and commonly used analyses such as t tests, ANOVAs, and correlations.
I begin the chapter with a model that describes quantitative L2 research as cur-
rently practiced, pointing out major flaws in our approach. I then review major weak-
nesses of relying on statistical significance ( p values), particularly in the case of tests
comparing means (t tests, ANOVAs) and correlations. I follow this discussion with a
brief introduction to the notion of statistical power, followed by guides to calculating
and using effect sizes and other descriptive statistics including CIs. I conclude with a
revised/proposed model of what quantitative L2 research might look like if we were
to embrace this approach. Points made throughout the discussion are illustrated with
data-based examples, many of which can be replicated using the practice data set that
accompanies this chapter (http://oak.ucc.nau.edu/ldp3/AQMSLR.html). Unlike
much of the remainder of this book, the statistical issues in this chapter are very
simple. Nevertheless, these ideas largely go against what is often taught in introduc-
tory research methods courses and certainly what is found in most L2 journals.
24 Luke Plonsky
Before beginning the main discussion, I also want to emphasize that the con-
cepts and procedures in this chapter, though far from mainstream L2 research
practice, are central to a set of methodological reforms currently gaining traction
in the field. Among other issues, this movement has sought to (a) encourage rep-
lication research (Porte, 2012), (b) promote a synthetic ethic in primary as well as
secondary research (e.g., Norris & Ortega, 2000, 2006; Oswald & Plonsky, 2010;
Plonsky & Oswald, Chapter 6 in this volume), (c) critically reflect on and exam-
ine methodological practices and self-efficacy (e.g., Larson-Hall & Plonsky, 2015;
Loewen et al., 2014; Plonsky, 2013, 2014), and (d) introduce novel analytical tech-
niques (e.g., Cunnings, 2012; Larson-Hall & Herrington, 2010; LaFlair, Egbert, &
Plonsky, Chapter 4 in this volume; Plonsky, Egbert, & LaFlair, in press). Taking
yet another step back, it is also worth noting that, although many of the concepts
and techniques embodied by this movement and discussed in this chapter may be
unfamiliar to L2 researchers, they have been recognized for decades as the pre-
ferred means to conducting basic quantitative research among methodologists in
other social sciences such as psychology and education.
The Flawed Notion of Statistical Significance

To begin this discussion on the flaws of statistical significance, let’s first consider
the pivotal role of p values. Figure 3.1 presents a descriptive account of the path
by which most quantitative L2 research attempts to advance the field. Researchers
begin by conducting a study on the effect of A on B or the relationship between
X and Y. (Note: Most studies are already flawed at this point in that their research
questions elicit only yes/no answers such as “Is there a difference . . . ?” or “Is there
a relationship between . . . ?”. A much more informative approach is to pose more
open-ended research questions that are inherently more informative and that bet-
ter represent the continuous data being collected, such as “To what extent . . . ?”.
Conduct a study
(e.g., the effects of A on B)
p < 0.05 p > 0.05 Trash
Important finding / Get published!
Modify relevant theory, research, practice
FIGURE 3.1 A descriptive model of quantitative L2 research

A “Back-to-Basics” Approach 25
Once the data are collected and analyzed using, for example, a t-test or Pearson
correlation, most researchers will take special note of the p value associated with
the results of those tests. As depicted in Figure 3.1, if on one hand the p value is
larger than .05, the difference between groups or the correlation is often consid-
ered uninteresting and is discarded, and another study might be run to attempt to
achieve a statistically significant result. On the other hand, if the t-test or correla-
tion yields a statistically significant result (i.e., <.05), it is considered important
and is much more likely to get published and to consequently have an impact on
L2 theory, future research, and practice.
In this model, which is, again, the dominant approach in quantitative L2
research, researcher perception and dissemination of study results both hinge
critically on our adherence to null hypothesis significance testing (NHST). As
I describe in the remainder of this section, this approach is deeply flawed on many
accounts, both conceptually and statistically. I focus here, though, on three main
arguments: (a) NHST is unreliable, (b) NHST is crude and uninformative, and
(c) NHST is arbitrary. Among the many other, more comprehensive accounts
of the inherent flaws in NHST, I recommend Kline (2013, Chapter 3), Norris
(in press), and Cumming (2012, Chapter 2).
NHST Is Unreliable
The first major flaw of NHST is that it is unreliable. More specifically, because
p values vary as a function of sample size, any correlation or difference in mean
scores can reach statistical significance, given a large enough sample. Consider
the (fabricated) data from three studies in Tables 3.1–3.3 each of which, let’s say,
is interested in comparing the effects of traditional (Group 1) with experimen-
tal (Group 2) approaches to teaching vocabulary. A t-test comparing the means
in Study 1 found no difference between the two groups, which each have five
participants.
Study 2 collected data from 15 participants in each condition. Although their
means and standard deviations were identical to those in Study 1, the p value in
TABLE 3.1 Data and results from Sample Study 1
Study N1 N2 M1 (SD1) M2 (SD2) p d

Study 1 5 5 15 (3) 18 (4) .2265 .85
Study N1 N2 M1 (SD1) M2 (SD2) p d

Study 1 15 15 15 (3) 18 (4) .0276 .85
26 Luke Plonsky
Study N1 N2 M1 (SD1) M2 (SD2 ) p d

Study 1 45 45 15 (3) 18 (4) .0001 .85
Study 2 was found to be statistically significant.The results of this study, therefore,

indicate that there is a real difference between the two conditions.
The samples in Study 3 were even larger: 45 participants in each group. The
same means and standard deviations were observed for the two conditions again,
but this time the p value is even smaller: .0001.
Although the only difference across studies was in the sample size, in a tra-
ditional NHST framework, we would likely interpret these studies as show-
ing inconsistent support for the experimental intervention. If we focus on the
descriptive statistics, including the Cohen’s d effect size, however, we see that all
three studies provide the same exact result: a positive and somewhat strong effect
(d = .85) in favor of the experimental condition. The only difference across stud-
ies was in the sample size.
The same inconsistency we observed in the previous example also applies to
correlational analyses (and virtually all other analyses based on NHST). A correla-
tion coefficient of .4 based on 30 participants may not be statistically significant.
With a sample of 60, however, that same correlation would, in most cases, yield
a p value below .05. In both cases, the correlation between the two variables as
observed is .4, but the statistical significance is different simply because of their
respective Ns.
At this point and as a means to help make sense of the potential unreliability
of results simply based on different sample sizes, it might be useful to remind
ourselves of the definition of p: the likelihood that the observed mean difference
(or correlation, etc.) would be observed given a true population difference (or
correlation) of 0 (i.e., d = 0; r = 0). Because neither the mean difference nor the
correlation is ever going to be 0, any size mean difference or correlation can reach
statistical significance given a large enough sample. Along these lines, over two
decades ago, Thompson (1992) reminds us that “[with NHST] . . . tired research-
ers, having collected data on hundreds of subjects, then conduct a statistical test to
evaluate whether there were a lot of subjects, which the researchers already know,
because they collected the data and know they are tired.” (p. 434).
NHST Is Crude and Uninformative

The results provided by NHST are not only unreliable, they are extremely unin-
formative. When we submit our data to a test of statistical significance, we reduce
the number of possible outcomes to two. In other words, we reduce continuous
results to a yes/no dichotomy, often overlooking or even ignoring the rich infor-
mation provided by our descriptive statistics. By doing so, we waste our data and
we fail to accurately or informatively advance L2 theory, research, and practice.
To be sure, p values tell us nothing about (a) replicability, (b) theoretical or
practical importance, or, perhaps most importantly, (c) magnitude of effects. A p
value of greater than .05 does not necessarily indicate that there is no difference
between two group means or even that there is a small difference between two
group means. Nevertheless, many researchers interpret it that way, falling prey to
what Cumming (2012) calls the “slippery slope of nonsignificance” (p. 31). Like-
wise, very small p values can certainly correspond to small effects.
To illustrate the lack of informational value provided by p values, consider
the following examples from published L2 studies. In one study published
recently the authors present the results of a t-test comparing the “ideal L2 self ”
ratings for high- (M = 4.65, SD = 1) and low-motivation (M = 4.56, SD = 1.1)
learners. The t-test yielded a nonstatistically significant p value, indicating no
difference between the two groups. This result, to be expected given the very
similar descriptives, was confirmed by a very small eta-squared effect size of
.002, which we can understand to mean that group membership (i.e., high vs.
low motivation) explains less than 1% of the variance in ideal self ratings. In the
same table, the authors present the results of another t-test comparing the same
two groups on ought-to self ratings. The mean score was 3.74 (SD = 1.1) for
the high-motivation group and 3.96 (SD = 1) for the low-motivation group. In
this case, however, the t-test revealed a statistically significant difference between
the groups. Are we then to interpret the difference between groups here to be
large or important? The eta-squared value for this contrast was just .01, indicat-
ing that group membership could explain 1% of the variance in group means.
From a dichotomous NHST perspective, one of these tests reveals an important
difference in group means and the other does not. From the perspective of
practical significance based on the effect size and other descriptive statistics, it
is clear that the two groups are nearly identical. (See results related to Table 4
in Mackey & Sachs, 2012, for a counterexample wherein the authors correctly
interpret substantial correlations despite the nonstatistical p values associated
with them.)
Consider as well the results in Table 3.4 which were extracted from nine
primary studies in Taylor, Stevens, and Asher’s (2006) meta-analysis of the effects
of reading strategy instruction. Three distinct patterns of results can be observed
in this sample, each of which reveals the crudeness of p. First, although the
means being compared in studies A–E were not found to be statistically signifi-
cant, their effect sizes (Hedges’ g, which expresses mean difference in standard
deviation units, similar to Cohen’s d ) were substantial—certainly more than
the null effect we might interpret based on a nonstatistical p value. These effect
sizes were, in fact, almost identical to but slightly larger, actually, than those in
28 Luke Plonsky
TABLE 3.4 Example results showing the inconsistency of p values*
Study N1 N2 Effect size (g) p
A 12 15 −.555 .152
B 8 8 .556 .259
C 30 29 .492 .060
D 24 21 .553 .066
E 21 22 .472 .123
F 78 80 .481 .003
G 183 61 .530 .000
H 29 14 −.251 .436
I 12 14 −.292 .450
*Results from Taylor et al. (2006)
studies F and G, both of which obtained statistical significance. Second, recall

from the previous section that p values fluctuate as Ns increase or decrease. In
this particular case, although the effect sizes from A–E and F–G are very similar,
the p values in the latter group are statistically significant due to their relatively
large Ns. Third, like studies A–E, H and I yielded p values larger than .05. In
the NHST approach, these results would therefore be equated with no differ-
ence between groups. However, not only does the effect size show a nontrivial
difference between groups; these differences and that of A run in the opposite
direction to what we might expect, showing a substantial advantage for the
comparison groups. Bottom line: Not only are p values unreliable, but they also
fail to provide information about the size or importance of the relationships and
effects we are interested in.
NHST Is Arbitrary
Students in introductory research methods courses often ask what is so special
about the .05 level of statistical significance. The answer, of course, is nothing—a
sentiment Rosnow and Rosenthal (1989) had in mind when they quipped,“surely,
God loves the .06 nearly as much as the .05” (p. 1277). Nevertheless, much of the
field lives (or least publishes) according to an arbitrary standard for importance.
To summarize the discussion thus far, quantitative L2 research relies very
strongly on an analytical approach that is unreliable and arbitrary. Even if
NHST-based findings were stable and principled, results based on this approach
would still fail to provide us any indication of the kinds of information we are
most interested in or that can guide L2 theory and practice. Consequently, unless
we are content to attempt to advance our field in this fashion (i.e., based on arbi-
trary, unreliable, yes/no-only results), we must change our approach (see Norris,
in press).
Statistical Power
A closely related notion, statistical power is the probability of observing a statis-
tically significant relationship given that the null hypothesis is false (e.g., d ≠ 0;
r ≠ 0). The more powerful the study, the less likelihood of false negatives. An
understanding of power can also be used to answer the very practical and frequent
question of “How many participants do I need (to detect statistical significance)?”
(That is, assuming we are still interested in statistical significance.)
The conventionally desired level of statistical power in the social sciences is .80
which, when achieved, provides the researcher with an 80% chance of detecting
a statistical relationship if present (Cohen, 1992). (Note that the .80 convention
for avoiding false negatives is much more liberal than the typical safeguard for
avoiding false positives of .05. In the former, we implicitly accept an error rate of
20%; in the latter the accepted error rate is theoretically only 5%.) But how can
we determine if .80 power is possible? As with statistical significance, power varies
as a function of the effect size and sample size such that, given a larger anticipated
effect (e.g., d ≈ 1), a smaller sample will be able to detect a statistical relationship
80% of the time (N ≈ 35). Likewise, when a small effect (e.g., d ≈ .2) is expected
based on theoretical predictions and/or previous research, a larger sample
(N ≈ 400) is needed in order to have an 80% chance of finding the effect at the
.05 level.1
A related exercise and consideration might be to estimate the statistical power
in previous L2 research, much of which relies necessarily on small samples. Plon-
sky and Gass (2011) examined this issue by means of a post hoc power analysis for
174 studies in the interactionist tradition of L2 research. Their results show that
this subdomain has had, on average, just a 56% chance of obtaining statistically
significant results. Likewise, looking at 606 primary studies across many different
subdomains of L2 research, Plonsky (2013) found average post hoc power at just
.57.These results can be interpreted as indicating that the likelihood of observing
expected relationships is, on average, comparable to tossing a coin and hoping for
heads.
Evidence of what I refer to as the “power problem” (Plonsky, 2013, p. 678) in
L2 research does not stop there. Additional indications include (a) extremely rare
use of power analyses in order to inform sampling decisions, (b) generally small
samples / high sampling error, (c) heavy reliance on NHST, (d) presence of non-
normal distributions and a lack of checking for statistical assumptions, and (e) rel-
atively infrequent use of multivariate statistics that can preserve experiment-wise
power (Plonsky, 2013).
One step toward addressing this problem is to determine sample sizes based on
a priori power analyses, rather than simply based on convenience or convention.
Using free software such as G*Power (Faul, Erdfelder, Lang, & Buchner, 2007) or
any number of freely available online calculators designed for this purpose, you
can calculate the sample size needed for a given level of statistical power such as
30 Luke Plonsky
.80. The only information you need to bring to the equation is the anticipated
effect size. One source for obtaining this value would be a meta-analysis on a
topic closely related to that of the study. In the absence of a relevant meta-analytic
effect size, you could also plug into the equation the effect size from one or more
studies on a closely related topic.
At this point I should recognize that in some instances it is not possible to col-
lect data from a sample large enough to obtain an ideal level of statistical power.
For example, researchers who study learners of less commonly taught languages
may find it difficult to obtain large samples. Similarly, funding may not be avail-
able to pay as many participants as are needed for adequate power.These problems
are further compounded in cases where the anticipated effect size is small, thus
necessitating a larger sample. In such cases, I recommend taking one or more of
the three following courses of action. First, when you know that a study lacks
statistical power, you should avoid the use of statistical testing. Focus instead on
the descriptives, including effect sizes and CIs (see discussion below). Second, in
addition to avoiding tests of statistical significance, underpowered studies should
also address fewer contrasts between or among groups. For example, if you only
expect to be able to recruit 35 participants, rather than comparing four groups/
conditions, divide them into two. The additional two conditions can then be
compared to themselves and to the first two in a subsequent study. Third, you
could bootstrap the analyses or statistics of interest based on the available data/
sample (see Larson-Hall & Herrington, 2010; Plonsky et al., in press; LaFlair et al.,
Chapter 4 in this volume).
However, even if we were able to adequately address the multifaceted “power
problem” in L2 research, we would still be relying on the flawed notion of statis-
tical significance. More specifically, a proper understanding and use of statistical
power can help the field overcome, at least in part, the unreliability of NHST.
The other problems, however, remain. Consider Cumming’s (2012) comments
on this issue:
I’m ambivalent about statistical power. On the one hand, if we’re using
NHST, power is a vital part of research planning . . . On the other hand,
power is defined in terms of NHST, so if we don’t use NHST we can
ignore power and instead use precision for research planning . . . However,
I feel it’s still necessary to understand power . . . partly to understand NHST
and its weaknesses. . . . although I hope that, sometime in the future, power
will need only a small historical mention.
(p. 321)
To be clear, I am not suggesting that sample size does not matter. Larger sam-
ples will yield less sampling error and, thus, greater precision in our results. The
point here, though, is that the notion of statistical power as a means to reliably
detect small p values is only relevant within the (flawed) NHST framework. As an
alternative, I argue in the next section that thorough use of descriptive statistics,
including effect sizes and CIs, can and should replace much of the statistical test-
ing in L2 research.
Effect Sizes
The focus up to this point in the chapter has been somewhat negative. I have
essentially been describing problematic trends and practices in the field. In this
section I describe a way forward that helps us to address and improve on these
practices by relying on effect sizes in place of NHST. In doing so, I want to
address three fundamental questions: (a) What are effect sizes, and how do we
calculate them? (b) Why should we use effect sizes? (That is, how is this approach
an improvement on current quantitative data practice?) (c) How can we interpret
effect sizes?
What Are Effect Sizes, and How Do We Calculate Them?

Let’s start off with a definition of effect sizes: a standardized, quantitative indica-
tion of a relationship or an effect. There are many types of effect size indices, but
the ones that are most common and applicable in the context of L2 research
fall into three categories: mean differences (e.g., d ), correlations (r), and variance
accounted for (r 2, R2, and η2).
The first among these, Cohen’s d, is a descriptive statistic that expresses the
mean difference between (or within) groups (in SD units—like a z-score). This
index is therefore used when we are interested in comparing mean scores, as is
often the case in L2 research. The formula for this effect size is very simple:
M1 − M 2
d=
SD
The difference between means (the numerator) is divided by the pooled stan-
dard deviation or that of a control or baseline group, depending on whether the
groups have equal variance (see Cumming, 2012). This calculation can be done by
hand, but there are also numerous online calculators and Microsoft Excel macros
developed for this purpose. (Unfortunately and inexplicably, SPSS does not cur-
rently provide Cohen’s d in the output from tests comparing mean scores.) I often
use the calculator developed by David B. Wilson that can be downloaded freely
here: http://mason.gmu.edu/~dwilsonb/downloads/ES_Calculator.xls. Figure 3.2
shows how user-friendly macros such as this one are. The user simply enters the
groups’ means, standard deviations, and sample sizes. The effect size here is d = .85,
which is based on the sample data I used earlier to show the unreliability of
p values. A similar calculator freely available through the Centre for Evaluation and
Monitoring is also available here: http://www.cem.org/effect-size-calculator. This
32 Luke Plonsky
calculator has the added advantage of providing CIs around the d value. We can
see in Figure 3.3, for example, that the standardized mean difference, which we
observed at .85, is likely between .41 and 1.27 in the population. Finally, Hedges’ g,
a variant of Cohen’s d, also expresses mean differences and is useful in that it applies
a correction for biased effects due to small samples, which are often found in L2
research (Plonsky, 2013).
Though not often viewed this way, correlations such as Pearson’s r are another
type of effect size. This index, which ranges from −1 to +1, is likely very familiar
FIGURE 3.2 Screenshot of effect size calculator for Cohen’s d
FIGURE 3.3 Screenshot of effect size calculator for Cohen’s d with CIs
to most L2 researchers. There are Web-based applications for calculating correla-

tion coefficients, but most L2 researchers use SPSS. To run a correlation based on
normally distributed data, the sequence is Analyze > Correlate > Bivariate.
You then move your two or more (continuously measured) variables into the
Variables box and select OK. For example, using the practice data set I’ve made
available with this chapter, the correlation between the length (in words) of
abstracts and their overall ratings is r = .38. (These data are from a study in which
Jesse Egbert and I examined the relationship between linguistic and stylistic fea-
tures of conference abstracts and the scores given to them by raters; Egbert &
Plonsky, in press.)
Most researchers reading this are probably very familiar with and used to cal-
culating correlation coefficients. Few, however, are likely aware of how to cal-
culate CIs around this statistic. Again, if we run the correlation described in the
previous paragraph, we can see that SPSS does not produce this information
automatically, but it can be done by following a short sequence of steps.
The first step is to create new variables based on standardized values of the
two variables of interest: Analyze > Descriptive Statistics > Descriptives.
From within the Descriptives dialogue box, move “Words-tot” and “R_all” into
the Variable(s) box. Before clicking OK, check the box for Save standardized values
as variables. The next step is to run the correlation again. However, because we
know that SPSS does not produce CIs using the Correlate > Bivariate pro-
cedure, we have to run the correlation as a simple regression. (You may recall
that correlation is simply a type of regression model in which there is a single,
continuous predictor variable.) The regression menu can be accessed as follows:
Analyze > Regression > Linear. Abstract score is our criterion variable so we’ll
move our newly created standardized variable for abstract score (“Zscore: R_all”)
into the Dependent box on the right. Length is our predictor and we’ll move
the standardized variable for length (“Zscore: Words-tot”) to the Independent(s)
box. The final command we need to give SPSS is within the Statistics box. Sim-
ply click on Statistics in the top right corner of the Linear Regression dialogue
box, and check the box for Confidence intervals. Then click Continue to close the
Statistics dialogue box and OK to run the regression. The two dialogue boxes
should look like those in Figures 3.4 and 3.5. The other default settings are fine
for our purposes.
The output from this procedure should look like Figure 3.6.We can see in the
Standardized Coefficients column that the regression model has produced the
same value for the correlation (.38) that we found earlier using the Correlate
> Bivariate function. This table also provides the 95% CI for that correlation:
.272–.488, which tells us the range of values that the true population correlation
is likely to fall within. (There are also numerous online calculators that can be
used to calculate the CIs for correlation coefficients, such as this one provided
by Chris Evans on the PSYCTC website, available at http://www.psyctc.org/
stats/R/CI_correln1.html)
FIGURE 3.4 Linear regression dialogue box used to calculate CIs for correlation coefficients
FIGURE 3.5 Statistics dialogue box within linear regression

Coefficientsa
95.0%
Unstandardized Standardized Confidence
Coefficients Coefficients Interval for B
Lower Upper
Model B Std. Error Beta t Sig. Bound Bound
1 (Constant) 1.017E–013 0.55 .000 1.000 −.108 .108

Zscore:
Words-tot .380 0.55 .380 6.935 .000 .272 .488
FIGURE 3.6 Output for linear regression with CIs for correlation
Closely related to r, both conceptually and statistically, is a third set or “fam-

ily” of effect sizes that indicate the extent of shared variance between variables or
the amount (%) of variance in one variable that can be accounted for by another.
This family includes the R2 effect size, which we can calculate by simply squaring
a correlation coefficient. You’ll recall from the previous example that the cor-
relation we observed between abstract rating and length of words was .38. Once
we calculate this value, we can square it (.38 × .38) to determine the amount of
shared variance between the two variables: 14%.
In the context of multiple regression analyses (see Jeon, Chapter 7 in this
volume), the R2 effect size index expresses the total or combined variance in the
criterion (dependent) variable that is accounted for by the predictor variables.
This effect size is produced automatically in the SPSS output for multiple regres-
sion. Returning to our abstract study, Egbert and I also used multiple regression
to attempt to explain additional variance in abstract ratings. Our model produced
a cumulative R2 value of .31. This result indicates that the set of predictors in our
model (e.g., word length, inclusion of results) was able to account for almost a
third of the variance in abstract ratings.
The second effect size in this family is eta-squared. You may recognize this
effect size as appearing along with ANOVA results and/or in SPSS output.
Although we don’t often think of ANOVA as a type of regression, the two
procedures are actually quite similar and, consequently, eta-squared, like R2,
expresses the percent of variance in the dependent variable that can be accounted
for by group membership in the independent variable(s). Granena (2013), for
example, compared aptitude test scores for native speakers, early L2 learners,
and late L2 learners. The results of an ANOVA revealed an eta-squared value of
approximately .08. In other words, 8% of the variance in aptitude scores could
be accounted for by group membership (i.e., native, early, late). Like r and R2,
eta-squared can be calculated using SPSS when running ANOVA, but not with-
out asking it to do so. Furthermore, you may have to use a different set of menus
than you are used to. Rather than running ANOVA through the Comparing
Means menu, to calculate an ANOVA and its corresponding eta-squared value,
36 Luke Plonsky
you need to run the ANOVA through the General Linear Model drop-down
menu: Analyze > General Linear Model > Univariate. This procedure will
produce an ANOVA. To request an eta-squared value as part of the output, click
the Options button and check the box for Estimates of effect size. An eta-squared
value will then be provided in the column labeled as such. Note also that this
value for the overall result (“Corrected model”) will be identical to the R2 value
provided as a footnote underneath the output (another remnant of the fact that
ANOVA is actually a type of regression, falling under the larger family of general
linear models; see Cohen, 1968).
There are several additional types of effect size indices for different types of data
and analyses. For categorical or frequency data, researchers may turn to phi and
Kramer’s V. Another option for categorical data is a simple percentage. Though
not traditionally regarded as an effect size, percentages certainly comply with our
earlier definition and, more importantly, they are very easy both to calculate and to
interpret. A final effect size commonly used with categorical data is the odds ratio.
This index, which expresses the probability of a possible (binary) outcome given a
particular condition, is particularly useful in conjunction with logistic regression.
Why Use Effect Sizes?

The main reasons for using effect sizes largely correspond to and address the
flaws of NHST described earlier. Recall that the first major flaw was that NHST
is unreliable in that any size mean difference or correlation will reach statistical
significance given a large-enough sample. Effect sizes, by contrast, are not affected
by sample size.2 The second major flaw was that NHST is crude and uninfor-
mative and that it forces continuous data into a dichotomous (significant/non-
significant) result. Furthermore, p values tell us nothing about the extent of the
relationship in question (e.g., Cohen, 1994). Effect sizes, however, provide an esti-
mate of the actual strength of the relationship or of the magnitude of the effect
in question. Although L2 researchers have been trained, implicitly or explicitly, to
set up studies that elicit dichotomous responses, theory and practice can truly be
informed only through the more nuanced and precise findings provided in effect
sizes. The third and perhaps most obvious flaw of NHST is the arbitrary nature
of the .05 cutoff. Unlike p values, effect sizes are continuous, standardized (again,
think z-scores), and scale-free. These features of effect sizes enable researchers to
make cross-study comparisons and to combine (average) them via meta-analysis.
Beyond these strong conceptual and statistical reasons, I can add one very
compelling practical motivation for considering effect sizes: Many major journals
now require authors to report them. Following the precedent set by an editorial
in Language Learning (Ellis, 2000), published in concert with Norris and Ortega’s
(2000) seminal meta-analysis in the same issue, several journals that publish L2
research now require authors to report effect sizes. In addition to Language Learn-
ing, these journals include Foreign Language Annals, Language Learning & Technology,
Language Testing, Modern Language Journal, Studies in Second Language Acquisition,

and TESOL Quarterly. Additionally, many other L2 journals without such explicit
policies adhere to APA style, which also requires reporting of effect sizes.
As a result of both the benefits described and the relatively recent requirements
of journals in this area, the presence of effect sizes has increased substantially in
recent years. Plonsky and Gass’s (2011) review of methodological practices in the
interactionist tradition found, for example, that whereas none of the 174 stud-
ies they examined in the 1980s or 1990s reported effect sizes, 27% of the studies
published in the 2000s did so. Likewise, Plonsky (2014) found the percentage of
studies reporting effect sizes to increase exponentially from 3% in the 1990s to
42% in the 2000s.
Interpreting Effect Sizes

It is clear that the field’s interest in effect sizes is increasing. However, primary
researchers currently do little in the way of using effect sizes to enhance our
results or, more importantly, our understanding of the variables and relationships
we study. (The same could be said for descriptive statistics more generally.) That
is, most authors currently treat effect sizes as a hoop to jump through or box to
check as part of a set of manuscript submission guidelines. What authors need to
do is provide more meaningful interpretations of the full range of descriptives in
their data, including of course their effect sizes.
Unlike p values, which are usually understood in a very straightforward (but
equally uninformative) manner (i.e., significant/nonsignificant), effect sizes
require more nuanced interpretation. Taking advantage of the rich information
provided by effect sizes forces us to address questions such as “What does a
d value of .65 mean (for theory and practice)?” “What constitutes a small or
large effect in this particular domain?” and “How does a correlation of, say, .35
compare with the predictions of theory for the relationship between these two
variables?”
There are a number of different approaches for addressing these questions (see
Stukas & Cumming, in press). One very common approach has been to compare
observed effects to benchmarks designed for this purpose. Based on their synthesis
of effects from 346 primary studies and 91 meta-analyses (N > 604,000), Plonsky
and Oswald (2014) proposed a general scale for interpreting d and r values in L2
research (Table 3.5). Values for each type of effect size, labeled as roughly small,
medium, and large correspond approximately to the 25th, 50th, and 75th per-
centiles of observed effects in their sample. Such benchmarks can be useful as a
means to situate the effects of a particular study in relation to the larger field. The
authors also caution, however, that doing so should only be considered a first step
in the interpretation of effect sizes. In other words, we cannot assume that what
constitutes a large effect in one area of L2 research is necessarily the same as what
one would expect to be a large effect in all other areas.
38 Luke Plonsky
TABLE 3.5 General benchmarks for interpreting d and r effect sizes in L2 research
Effect size Small Medium Large
Mean difference (d )

Between-groups 0.40 0.70 1.00
Within-groups 0.60 1.00 1.40
Correlation (r) 0.25 0.40 0.60
Indeed, there are a number of additional factors that merit consideration when
interpreting effect sizes. Most critically, researchers must provide an explanation
of what the particular numerical effects they observe mean in the context of their
domain. Others factors, discussed at length in Plonsky and Oswald (2014), include
(a) effects found in previous studies in the same subdomain; (b) mathematical
readings of effect sizes (see Plonsky & Oswald, 2014, pp. 893–894); (c) theoreti-
cal and methodological maturity of the domain in question; (d) research setting
(e.g., lab vs. classroom); (e) practical significance; (f ) publication bias in previous
research; (g) psychometric properties and artifacts; and (h) other methodological
features.
Descriptive Statistics: Means, Standard Deviations, and CIs

In addition to calculating and interpreting effect sizes, it is absolutely criti-
cal that researchers become very familiar with their descriptive statistics. (I
realize this will sound obvious to many of you, but scholars in our field often
carry out statistical tests without ever first conducting a thorough examina-
tion of their descriptive statistics.) In the case of research investigating mean
differences, those means are probably a good place to start. But they are just
that: a starting point. Mean scores give an initial indication of the difference(s)
between two or more groups. They say nothing, however, about the spread of
scores around them. For this crucial information, we usually look at standard
deviations.
I have to point out here that the importance of understanding the spread of
scores can hardly be overstated. This concept, called variance, is deeply entrenched
in nearly all statistical techniques employed in L2 research and across the social
sciences (see GLM in Plonsky, Chapter 1, p. 5). For example, though we tend
to think of ANOVA (analysis of variance) as a comparison of means, it is just as
much if not more concerned with within- and between-group variance. Recall
from the previous section that a standard deviation forms the denominator in the
formula for Cohen’s d. Despite the centrality of this statistic and the concept it
represents, very rarely do L2 researchers give any explicit consideration of stan-
dard deviations in written reports. If fact, it is quite common for them to be left
out of published L2 research (e.g., Plonsky, 2013).
In terms of statistical testing and comparisons of mean scores, when there is a

lot of variance (large standard deviations) group scores are more likely to overlap.
Consequently, the results of a t test or ANOVA are less likely to be statistically
significant and their corresponding d values will be smaller. More conceptually
speaking, a close look at the standard deviation can help you decide how much
faith to put in your mean with respect to its ability to represent your sample. Stan-
dard deviations can also provide insights into substantive issues. For example, in
experimental designs, an increase in the experimental group’s standard deviation
from pre- to posttest might indicate that not all learners responded uniformly to
the treatment and that there may be learner-internal moderators at play.
A related descriptive statistic that is considered and reported even less fre-
quently is the CI. CIs express a range of values around an observed mean score
that are likely (at a given level of probability, typically 95%) to contain the true
population mean. Returning to the abstract study described earlier, imagine you
were interested in understanding typical abstract ratings.We might begin by cal-
culating the mean score for this variable along with its corresponding 95% CIs and
other descriptives. The series of commands using SPSS is as follows: Analyze >
Descriptive Statistics > Explore. (See steps for calculating CIs for correla-
tions above.) From there we simply move the “R_all” variable into the Depen-
dent list. The CIs are set at 95% by default, but if you had reason to set them
more strictly or more leniently, you could do so using the Statistics dialogue
box. After clicking OK, the resulting output shown in Figure 3.7 would provide
a full set of descriptive statistics including the CIs. (This is one reason I almost
never calculate descriptives using SPSS using the Analyze > Descriptive
Statistics > Descriptives menu—it is not nearly as informative as the Explore
function.)
Calculating CIs and other descriptives using Excel is also quite simple:
1. Calculate the mean score by typing in the following in the first empty cell at
the bottom of the column of data you are interested in: =AVERAGE(X:Y),
where X and Y refer to the top and bottom cells of data (be sure to exclude
any header rows).
2. In the cell immediately below the mean score, calculate the standard devia-
tion for the set of scores: =STDEV(X:Y), where X and Y are the same as for
the step 1.
3. In the cell immediately below the standard deviation, calculate the interval
that will be added and subtracted from the mean score to construct the
CI: =CONFIDENCE.NORM(alpha,SD,N). The alpha field here is usually
.05, corresponding to a 95% CI, but could easily be adjusted; for a 90% CI,
for example, this value would be .1. In the SD field of this formula, simply
type in the name of the cell where that value was calculated in step 2 (e.g.,
U55). And the N field refers to the number of data points/cases/observations
in the sample.
40 Luke Plonsky
4. Construct the upper and low bounds of the CI by adding/subtracting the

value from step 3 to/from the mean calculated in step 1. Simply type into
two empty cells: = B − C and = B + C, where B refers to the mean score cal-
culated in step 1 and C refers to the interval calculated in step 3, respectively.
There are many ways to interpret CIs (see Cumming, 2012), but their primary
purpose is to help us situate mean scores in the context of the many other possible
values that might represent the true population score (as opposed to that of the
sample). As Carl Sagan (1996) put it, CIs are “a quiet but insistent reminder that
no knowledge is complete or perfect” (pp. 27–28). As with standard deviations,
considering the CIs around our mean scores, numerically and/or visually, helps
us avoid the temptation to view our samples and their mean scores as absolute.
In the case of abstract ratings for this particular L2 research conference, we can
see in Figure 3.7 that the mean score is 3.64 (on a scale of 1–5) with 95% CIs of
[3.56, 3.71]. (CIs are typically reported in brackets.) The width of the interval is
quite narrow, which is likely due to the relatively large sample (N = 287). Con-
sequently, assuming these data are based on a valid and reliable instrument, we
can be fairly confident that our point estimate of 3.64 is very close to the true
population mean for scores at this conference.
CIs can also be used to indicate whether the difference between a pair of mean
scores is statistically significant and whether that difference is stable. This infor-
mation is also quite easy to access: We simply check to see whether the mean of
one group falls within or outside the CI for the other group’s mean. We can try
this out using the abstract data set. Let overall score here be the dependent vari-
able and let the presence of one or more errors be a dichotomous independent
Descriptives
Statistic Std. Error
R_all Mean 3.6359 .03923

95% Confidence Lower Bound 3.5587
Interval for Mean Upper Bound 3.7131
5% Trimmed Mean 3.6568
Median 3.7500
Variance .442
Std. Deviation .66464
Minimum 1.75
Maximum 5.00
Range 3.25
Interquartile Range 1.00
Skewness −.441 .144
Kurtosis −.316 .287
FIGURE 3.7 Output for descriptive statistics produced through Explore in SPSS
variable. The menu sequence using SPSS is, again, Analyze > Descriptive Sta-
tistics > Explore. This time, however, we will move the “Errors” variable into
the “Factor list” box. As we can see in Figure 3.8, the mean score for the “no
errors” group (3.68) does not fall within the CI for the “error(s) present” group
[3.23, 3.60] and vice versa, thus indicating that the difference between these two
means is statistically different. We can also calculate the effect size for the differ-
ence between these groups using one of the tools described earlier: d = .40.
Though it is not strictly necessary, we could confirm this result by running
an independent samples t test, which would produce a t value of 2.62 with an
associwated p value of .009. An advantage to following up our analysis based on
CIs with a t test is that the SPSS output will also provide a CI around the mean
difference, which can help us better understand how stable it is. In this particular
case, the mean difference between the two groups is .26, and the CI associated
Descriptives
Errors Statistic Std. Error
R_all no errors Mean 3.6837 .04301

Median 3.7500
Variance .435
Minimum 1.75
Maximum 5.00
Range 3.25
Interquartile Range 1.00
error(s) present Mean 3.4199 .09036
Median 3.5000
Variance .425
Minimum 2.17
Maximum 4.75
Range 2.58
Interquartile Range .90
FIGURE 3.8 Descriptive statistics and CIs for abstracts with vs. without errors
42 Luke Plonsky
with that difference is [.07, .46]. Yet another confirmation of the statistical dif-
ference between these means scores here is the fact that the CI around the mean
difference does not cross 0.What is perhaps more interesting is to note that the CI
is somewhat narrow, indicating that our point estimate for the difference (.26) is
rather stable and reliable. If the CI had been much larger relative to the five-point
scale, say [.20, 3.9], we would have less certainty—that is, confidence—in our
observed mean difference. For a number of worked examples and practice inter-
preting CIs, see Cumming (2012) and, in the context of L2 research, Larson-Hall
and Plonsky (2015, p. 135).
Finally, it is not sufficient to simply calculate and examine a full set of descrip-
tive statistics when analyzing quantitative data. Such results also need to be made
available in published reports and/or appendices to justify interpretations and to
enable consumers of L2 research to draw their own conclusions as well. More
complete reporting of data also assists in meta-analyses and other synthetic efforts.
For these reasons and in line with the APA (2010), all mean-based analyses should
be reported, at a minimum, with their associated means, standard deviations, CIs,
and effect sizes (again, see Larson-Hall & Plonsky, 2015).
Looking Forward
The impetus behind this chapter—the entire volume, really—is to improve and
advance L2 research practices. Toward that end, I’d like to propose a revised model
of L2 research (Figure 3.9) both as a point of contrast with the descriptive model
in Figure 3.1 and as a suggestion for how our individual and collective research
efforts ought to proceed.
Conduct a study
(e.g., the effects of A on B)
p < 0.05 p > 0.05

d=? d=?
Accumulation of results (via meta-analysis)
More precise and reliable estimate of effects
Modify relevant theory, research, practice
FIGURE 3.9 A revised model of quantitative L2 research

As with the model currently in place, the process begins when a researcher
conducts a study. Unlike the current model, however, assuming the study is well
designed, the importance of the study’s findings and its likelihood of getting pub-
lished do not hinge on the flawed notion of statistical significance. Rather, both
statistical and practical significance are considered and interpreted, and the results
of the study and others in the domain are brought together via research synthesis
and meta-analysis. By embracing a synthetic research ethic both at the primary
and secondary levels, the domain in question is able to arrive at a view of the rela-
tionships or effects in question that is more reliable, thereby enabling L2 theory
and practice to be more accurately informed by empirical efforts.
Tools and Resources

The following links provide access to very user-friendly programs for conducting
many of the analyses described in this chapter. The first, the langtest.jp devel-
oped by Atsushi Mizumoto, is an R- and web-based app (http://langtest.jp/);
the second, ESCI (http://www.latrobe.edu.au/psy/research/cognitive-and-
developmental-psychology/esci), is a set of freely downloadable Excel macros
designed by Geoffrey Cumming to help researchers consider and report results
with an emphasis on effect sizes and CIs.
Further Reading
Cohen, J., (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Erlbaum.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and
meta-analysis. New York: Routledge.
Kline, R. B. (2013). Beyond significance testing: Statistics reform in the behavioral sciences (2nd
ed.). Washington, DC: American Psychological Association.
Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS.
Chapter 4. New York: Routledge.
Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psy-
chology journals: Guidelines and explanations. American Psychologist, 54, 594–604.
Discussion Questions
1. Summarize, in your own words, the main arguments against the use of p
values and, conversely, in favor of “estimation thinking” and effect sizes.
Can you think of any counterarguments or situations in which the NHST
approach might be preferable or even justifiable?
2. Considering the current place of NHST and effect sizes in quantitative L2
research, what changes would you suggest to the field?
3. The subtitle of this chapter (“A back-to-basics approach to advancing quan-
titative methods in L2 research”) implies that power and statistical vs. practice
significance have been around for a while. If this is the case, why have we as
a field been so slow to embrace these notions in these research practice?
44 Luke Plonsky
4. Find a quantitative study in your area of interest. To what extent does it

adhere to NHST and associated data analytic techniques and interpretations?
How could the study be revised to provide more informative and precise
results?
Notes
1. These values also assume a normal distribution; variance must also be considered in
calculating power and effect sizes.
2. However, the width of CIs for effect sizes is influenced by sample size.
References
70, 426-443.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 97–1003.
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and
meta-analysis. New York: Routledge.
Cunnings, I. (2012). An overview of mixed-effects statistical models for second language
researchers. Second Language Research, 28, 369–382.
Egbert, J., & Plonsky, L. (in press). Success in the abstract: Exploring linguistic and stylistic
predictors of conference abstract ratings. Corpora.
Ellis, N. C. (2000). Editorial statement. Language Learning, 50, xi–xiii.
Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007). G*Power 3: A flexible statistical
power analysis program for the social, behavioral, and biomedical sciences. Behavior
Research Methods, 39, 175–191.
Granena, G. (2013). Individual differences in sequence learning ability and second lan-
guage acquisition in early childhood and adulthood. Language Learning, 63, 665–705.
Larson-Hall, J., & Herrington, R. (2010). Improving data analysis in second language
acquisition by utilizing modern developments in applied statistics. Applied Linguistics,
31, 368–390.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research
findings: What gets reported and recommendations for the field. Language Learning,
65, Supp. 1, 125–157.
Loewen, S., Lavolette, B., Spino, L. A., Papi, M., Schmidtke, J., Sterling, S., et al. (2014).
Statistical literacy among applied linguists and second language acquisition researchers
TESOL Quarterly, 48, 360–388.
Mackey, A., & Sachs, R. (2012). Older learners in SLA research: A first look at working
memory, feedback, and L2 development. Language Learning, 62, 704–740.
Norris, J. M. (in press). Statistical significance testing in second language research: Basic
problems and suggestions for reform. Language Learning, 65, Supp. 1.
quantitative meta-analysis. Language Learning, 50(3), 417–528.
Norris, J. M., & Ortega, L. (2006).The value and practice of research synthesis for language
learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on lan-
guage learning and teaching (pp. 3–50). Amsterdam: John Benjamins.
Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and
challenges. Annual Review of Applied Linguistics, 30, 85–110.
Plonsky, L., Egbert, J., & LaFlair, G. T. (in press). Bootstrapping in applied linguistics: Assess-
ing its potential using shared data. Applied Linguistics.
Plonsky, L., & Oswald, F. L. (2014). How big is ‘big’? Interpreting effect sizes in L2 research.
Porte, G. (2010). Appraising research in second language learning: A practical approach to critical
analysis of quantitative research (2nd ed.). Philadelphia/Amsterdam: John Benjamins.
Rosnow, R. L. & Rosenthal, R. (1989). Statistical procedures and the justification of
knowledge in psychological science. American Psychologist, 44, 1276–1284.
Sagan, C. (1996). The demon-haunted world. New York: Random House.
Stukas, A. A., & Cumming, G. (in press). Interpreting effect sizes: Towards a quantitative
cumulative social psychology. European Journal of Social Psychology.
Taylor, A.M., Stevens, J. R., & Asher, J. W. (2006). The effects of explicit reading strategy
training on L2 reading comprehension: A meta-analysis. In J. M. Norris & L. Ortega
(Eds.), Synthesizing research on language learning and teaching (pp. 213–244). Amsterdam:
John Benjamins.
Thompson, B. (1992). Two and one-half decades of leadership in measurement and evalu-
ation. Journal of Counseling and Development, 70, 434–438.
4
A PRACTICAL GUIDE TO
BOOTSTRAPPING DESCRIPTIVE
STATISTICS, CORRELATIONS,
T TESTS, AND ANOVAS
Geoffrey T. LaFlair, Jesse Egbert, and Luke Plonsky
Bootstrapping is a type of robust statistic that simulates how a study would be

replicated by resampling from a population (Beasley & Rogers, 2009; Efron, 1979;
Keselman, Algina, Lix, Wilcox, & Deering, 2008; Lee & Rogers 1998). How-
ever, when bootstrapping, the sample that you have acts as the population from
which the “new samples” are drawn. An important assumption in doing this is
the original sample is assumed to be a reasonable representation of the popula-
tion (Davison & Hinkley, 1997). Bootstrapping works by drawing, with replace-
ment, the values from the observed sample to obtain thousands of bootstrapped
samples to improve the accuracy of confidence interval (CI) estimation for one or
more statistics (see Efron & Tibshirani, 1993;Yung & Chan, 1999).This technique,
introduced to second language (L2) researchers only recently by Larson-Hall and
Herrington (2010), can be applied to a variety of statistical tests, including those
most commonly used in L2 research (i.e., t test, ANOVA, correlation). As we
explain in this chapter, bootstrapping has the potential to be a powerful nonpara-
metric analytical tool when L2 researchers are faced with problems such as small
samples and nonnormal distributions. To date, however, as a field we have not
embraced this potential; in fact, we know of only two studies of L2 research that
have analyzed their data using this technique.
This chapter begins with a brief rationale for the use of bootstrapping. We
outline a number of conditions common to L2 research that make it ripe for the
use of this procedure. The bulk of the chapter is then used to provide a practical
and detailed account of how to run, interpret, and report bootstrapped analyses.
Because of their frequency of use in the field (see Gass, 2009; Plonsky, 2013), we
focus on four types of analyses/statistics: descriptives (i.e., means, standard devia-
tions), t tests, ANOVAs, and correlations. All of these bootstrapped analyses are
accompanied with CIs for the statistic of interest. In order to reach the widest
A Practical Guide to Bootstrapping 47
audience possible, we explain this process for both SPSS and R.The chapter con-
cludes with suggestions for further reading and a set of discussion questions, both
meant to build and extend on the chapter. For readers who are interested in a
more thorough overview of conducting statistical analyses in R, we would direct
them to Larson-Hall (2015).
Conceptual Motivation
A number of reviews of quantitative L2 research have found that means-based
analyses such as t tests and ANOVAs dominate the analytical landscape (Gass,
2009; Lazaraton, 2005; Plonsky, 2013; Plonsky & Gass, 2011). This practice is not
necessarily problematic. However, such analyses are useful and meaningful only
given (a) the data conform to a set of statistical assumptions and (b) sufficient
statistical power (i.e., the ability to detect a statistically significant effect, when
present), both of which are often lacking (Phakiti, 2010; Plonsky, 2013; Plonsky &
Gass, 2011). The bootstrapped equivalents of these tests provide nonparametric
alternatives that do not make such strong assumptions about the distributions of
the data (Davison & Hinkley, 1997).
Before going on, we recognize of course that other procedures have been
designed to provide nonparametric equivalents to t tests and ANOVAs, such as
the Kruskal-Wallis and Mann-Whitney U tests. However, simulation research
carried out in the field of applied statistics has revealed that bootstrapped analyses
nearly equal their parametric equivalents in power and accuracy, when statistical
assumptions such as normality are met; when the data are not normally distrib-
uted, bootstrapped analyses provide greater statistical power (Lansing 1999; Lee &
Rogers, 1998;Tukey, 1960;Wilcox 2001), meaning that bootstrapping can provide
researchers with a method for accurately estimating their parameters of interest
(e.g., differences in means and accompanying tests statistics).
Whether or not the data conform to the requirements of parametric tests, the
sample sizes typical of L2 research provide perhaps the most compelling reason to
employ bootstrapping in place of or in addition to traditional tests. More specifi-
cally, quantitative analyses in L2 research are severely limited by the small samples
typically employed. Methodological reviews of quantitative research in the inter-
actionist tradition (Plonsky & Gass, 2011; K = 174) and the L2 domain more
generally (Plonsky, 2013; K = 606), for example, found average group/sample n
sizes of just 22 and 19, respectively. Furthermore, post hoc power calculated based
on these data and their corresponding effect sizes was only .56 and .57—that is,
slightly better than a coin toss. By resampling from the observed data, bootstrap-
ping enables researchers to obtain a data set that simulates a sample much larger
than what is typically found, simulating Ns in the thousands. Put another way,
bootstrapping provides researchers with the opportunity to overcome the lack of
statistical power and Type II error (failing to reject the null hypothesis when the
alternative is true) resulting from analyses based on small samples.
48 Geoffrey T. LaFlair et al.
Larson-Hall and Herrington (2010) illustrate this point by contrasting the

results of parametric and bootstrapped analyses using real data. The study com-
pared native speakers’ (n = 15) pronunciation with that of three groups of learn-
ers: For the sake of simplicity we will call them Groups A (n = 14), B (n = 15),
and C, n = 15). Based on a parametric ANOVA, a statistically significant main
effect was observed among the three groups. A series of Tukey post hoc tests then
showed a statistically significant difference between the native speakers and Group
A ( p = .002) but not Group B ( p = .407) or Group C ( p = .834). However, boot-
strapped equivalents of these post hoc tests (with 20% means trimming, another
robust technique) revealed a statistically significant difference for all three groups
(A, p < .0001; B, p = .01; C, p = .01). These results indicate that the nonstatistical
p values resulting from parametric analyses were due to a lack of power and that
these (true) differences could be found only with the larger sample and increased
statistical power simulated via bootstrapping; the bootstrapped analyses demon-
strated a Type II error in the original, parametric analysis. (See Wolfe & McGill,
2011, for a similar analysis that found a lower Type II error rate via bootstrapping.)
Stemming in part from the small samples discussed previously is another prob-
lem and threat to the validity of traditional parametric analyses in L2 research: the
presence of nonnormal distributions, unequal variances between/among groups,
and outliers (see Phakiti, 2010), all of which carry the potential to cause both
Type I and Type II errors. The number of iterations employed by bootstrapped
analyses—again, typically in the thousands—provides a result that is robust (i.e.,
less sensitive) to such irregularities or deviations from normality and which is
therefore more stable and reliable.
Compounding the threats to validity of parametric tests introduced by the
conditions described thus far is the field’s very heavy reliance on these same
analyses. Plonsky’s (2013) review of 606 quantitative studies in Language Learning
and Studies in Second Language Acquisition, for example, found that the majority
of the sample analyzed their data using one or more t tests and/or ANOVAs
(see also Gass, 2009; Lazaraton, 2005). These analyses are the norm—if not the
default—approach in quantitative L2 research. The statistical infelicities described
thus far, therefore, not only put into question study outcomes; they have the
potential to do so in a very large portion of the field’s research.
Motivated by many of the concerns discussed here, we sought to examine
the usefulness of bootstrapping in the field of applied linguistics (see Plonsky,
Egbert & LaFlair, in press). In order to do so, we solicited and obtained raw data
from 26 published reports that used t tests or ANOVAs. These data were then
reanalyzed using their bootstrapped equivalents. Specifically, we tested whether
and under what conditions (e.g., large vs. small sample, normal vs. nonnormal
distribution, presence vs. absence of outliers) the results of parametric and boot-
strapped t tests and ANOVAs differ. Our results found no evidence of Type II
error in L2 research (i.e., failing to reject a false null hypothesis as in the exam-
ple from Larson-Hall & Herrington, 2010). However, 4 of the 16 studies that
reported statistically significant results in the original reports were not replicated
according to the bootstrapped analyses (i.e., a Type I error misfit five times higher
than an alpha of .05). Interestingly, all four misfits achieved a post hoc power of
.99, suggesting that traditional hypothesis testing coupled with very large samples
may overestimate the importance of an effect. Put another way, if the sample is
large enough, p values of less than .05 can always be obtained, regardless of the
actual difference between group means. Based on the results, we argue in favor
of the use of bootstrapping, not as a replacement for but in conjunction with
parametric statistics, particularly when (a) samples are especially small (in order
to increase power), (b) samples are especially large (in order to offset statistically
significant results that are due to large samples rather than strong effects), (c) the
data violate one or more assumptions such as normality, and (d) when any one or
more of these situations occurs in analysis of pilot data that will be used as a basis
for collecting more data. Echoing our colleagues (e.g., Norris & Ortega, 2000,
2006; Larson-Hall, 2015; Nassaji, 2012; Plonsky, 2011, 2013), we also argued for
a diminished role of the flawed and unreliable practice of statistical significance
testing and instead for a greater emphasis on descriptive statistics—namely means,
standard deviations, CIs, and effect sizes.
By now we hope to have made clear the potential of bootstrapping as a tool
for overcoming some of the challenges facing quantitative data and data analysis
in L2 research. However, it is important to note that this does not replace the
need for good design, large samples, or replicating our experiments. In the section
that follows, we describe the steps involved in running bootstrapped equivalents
of some of the most common analyses found in the field: descriptives statistics, t
tests, ANOVAs, and correlations.
Bootstrapping in Practice
This section of the chapter presents the step-by-step processes for conducting
simple bootstrapping with descriptive statistics, correlations, t-tests, and ANOVAs
in both SPSS and R. It is organized first by software program and then by statistic.
The reason that this part of the chapter is separated by software program is the
difference in flexibility between each program. The bootstrapping options that
are available in the SPSS interface are somewhat limited. As you will see in the
one-way ANOVA example, SPSS bootstraps the CIs for all pairwise comparisons
(much like a Tukey post hoc analysis of an ANOVA). However, R offers the abil-
ity to bootstrap any statistic of interest. In the one-way ANOVA section in R, you
will learn how to bootstrap the pairwise comparisons (as in SPSS) in addition to
the omnibus F-statistic and its corresponding effect size (eta-squared).
R can require some effort to learn because to utilize it to its full capabili-
ties it is necessary to learn the R programming language. Many researchers may
not need its full capabilities or may not be able to commit to learning how to
program in R. However, the effort that is put into learning how to use it will be
repaid by increased analytical and graphical abilities. We have attempted to make

our explanations accessible to the widest audience possible, including those unfa-
miliar with R. To this end, all of the procedures in R will be accompanied with
the code that was used to run the analyses and extract the results.
Considerations in Bootstrapping
Before we begin the step-by-step procedures we need to discuss four decision
points when conducting a bootstrap analysis:
1. Nonparametric versus parametric bootstrap

2. Sampling methods
3. CIs
4. Bootstrap diagnostics
Bootstrapping was originally developed as a nonparametric procedure, how-

ever parametric bootstrap methods exist as well (Chernick, 1999). The main dif-
ference between the two is in the assumption that is made about the distributions
that are being sampled from. In the nonparametric approach, the researcher is
freed from any assumptions about the distribution. Our bootstrapped parameters
are based on samples drawn from the observed distribution (Chernick, 1999;
Efron & Tibshirani, 1993). The parametric bootstrap samples bootstrap observa-
tions from a parametric distribution with a mean and variance that are set equal
to the sample mean and variance (Efron & Tibshirani, 1993). The focus in this
chapter is on the use of the nonparametric bootstrap because we often do not
know the underlying form of the data or the sample size is too small to confi-
dently estimate parameters for a test of normality. In addition, the nonparametric
approach tends to be accurate regardless of whether or not the data are normally
distributed whereas the parametric approach is more accurate only when any
parametric assumptions about the data are correct (Efron & Tibshirani, 1993).
The second issue is the sampling procedure. For any bootstrap analysis we will
be resampling data with replacement from our data frame to create R numbers
of samples so that we can estimate the sampling distribution. The method for this
resampling procedure can vary based on the types of statistical analysis being con-
ducted or the attributes of the subpopulations in the data set.There are numerous
sampling procedures available and under research in the field of applied statistics
(e.g., simple, stratified, residual, wild). We offer a brief explanation of the differ-
ences between the simple and the stratified.We would direct the interested reader
to Davison and Hinkley (1997) for detailed explanations of other resampling
plans (e.g., the residual and wild bootstrap).
The simple resampling method resamples individual cases with replacement
from the entire data set. It will create R number of bootstrapped data sets that are
the same size as the original data set. An example of this would be participants
from the same L1 background and L2 proficiency level who have been randomly
assigned to one of two groups: a treatment group or a control group. Because the
data are homogenous and have been randomly sampled, simple resampling would
be most similar to how the data were collected (Davison & Hinkley, 1997). If you
are working with a set of data that is drawn from two considerably different sub-
populations, you should use a stratified resampling procedure. An example of this
would be in the comparison of treatment effects on two different subpopulations
such as native speakers of a language and nonnative speakers of the same language.
In this method simple case resampling is applied within each stratum. A third
resampling procedure is resampling the residuals (or errors) of a fitted model.This
is considered a semiparametric approach to resampling because the data are fit to
a parametric model (e.g., regression or ANOVA); however, the resampling is still
conducted using nonparametric procedures (Carpenter & Bithell, 2000). Resam-
pling residuals adjusts the value of each observation with a randomly sampled
residual—or the distance between an observation and the estimated parameter
value such as the sample mean. This method assumes homogeneity of variance.
Other resampling methods exist for other situations (e.g., non-homogenous vari-
ance; see Davison & Hinkley, 1997).
In the SPSS examples, all bootstrapped analyses have been performed using
the simple resampling method. In R, bootstrapped analyses of descriptive statis-
tics, correlations, and t-tests were performed using the simple resampling method.
To illustrate how residual resampling is conducted, this method was used for the
bootstrapped analyses for the ANOVA parameters (so we are assuming that the
residuals are homoscedastic).
The third in this introductory set of decisions points encountered when con-
ducting bootstrapped analyses involves the calculation of CIs. One of the goals
of bootstrapping is to estimate accurate CIs for the statistic of interest that would
closely match an exact CI for the population. A number of methods are available
and a discussion of their strengths and weaknesses are beyond the scope of this
chapter (see Davison & Hinkley, 1997 and DiCiccio & Efron, 1996 for further
discussion). Generally, the BCa method is more accurate in a wide variety of situ-
ations (Carpenter & Bithell; 2000; Chernick, 1999; Crawley, 2007; DiCiccio &
Efron, 1996). BCa stands for “bias corrected and accelerated,” and this method
adjusts CIs for skewness (bias-corrected) and nonconstant variance (accelerated)
in the bootstrapped data sets. In this chapter we will be reporting BCa intervals
from both SPSS (offers percentile and BCa) and the boot package in R (offers
five types of intervals).
The fourth consideration is bootstrap diagnostics. Canty, Davison, Hinkley, and
Ventura (2006) provide a detailed overview of four diagnostic methods to assess
the reliability of the bootstrap calculations. The procedure covered in this chapter
is jackknife-after-boot, which is useful for investigating the effect of outliers on
the bootstrapped calculations. This examines the effects of individual cases on
bootstrap samples by plotting the quantiles of bootstrap distribution with each
case removed. The jackknife-after-boot plot shows how much an individual case
affects the bootstrap statistic (Chernick, 1999; Davison & Hinkley, 1997).
Description of Example Data Set

All of the bootstrap analyses in this chapter were performed on an authentic data
set of ESL teacher beliefs about L2 teaching. This data set consists of 30 observa-
tions from each of three groups (N = 90) with different L1 backgrounds: Eng-
lish,Vietnamese, and Spanish. Each participating teacher reported (a) the amount
of time (in months) they had spent studying their L2 and (b) a self-reported
response to the following belief statements: “Students should aspire to speak like
native speakers” and “More motivated students acquire a language more easily.”
These items were measured on a six-point Likert scale, with six corresponding
to “strongly agree” and one corresponding to “strongly disagree.” Each of the
bootstrap procedures illustrated below can be replicated using this data set, which
is available at the companion website for this volume: http://oak.ucc.nau.edu/
ldp3/AQMSLR.html
Means and Standard Deviations

SPSS
To obtain bootstrapped CIs for means and standard deviations in SPSS, select
Analyze > Descriptive Statistics > Explore.
FIGURE 4.1 Explore main dialogue box

• As shown in Figure 4.1, move “Students should aspire to speak like native
speakers” to the Dependent List box.
• Move “Participant L1” to the Factor List box.
• Click Statistics in the upper right corner.
• Click Bootstrap.
 As shown in Figure 4.2, select Perform bootstrapping.
 Enter “10000” into the Number of Samples box.
 Select Bias corrected accelerated (BCa).
 Click Continue.
• Click OK in the Explore dialogue box.
FIGURE 4.2 Bootstrap dialogue box

Bootstrap Specifications
Sampling Method Simple

Number of Samples 10,000
CI Level 95%
CI Type BCa
FIGURE 4.3 Bootstrap specifications
Figure 4.3 shows the SPSS settings used for the bootstrap that was performed.
In this case, the table indicates that (a) we set Sampling Method to Simple rather
than Stratified because we resampled (with replacement) from the entire data set
rather than from within each group separately, (b) we resampled 10,000 times,1
and (c) we used a bias-corrected and accelerated 95% CI.
The Descriptives table (Figure Table 4.4) shows the results of the bootstrapped
CIs for the mean and standard deviation of responses to “Students should aspire to
speak like native speakers” grouped according to participant L1 background. The
first two columns contain the mean values and their standard errors for a variety
of descriptive statistics. The four columns on the right contain the BCa bootstrap
results, including 95% CIs for each of the statistics as well as their respective biases
and standard errors.
These results show some variation in teacher beliefs across the three L1 groups.
A comparison between the results in “96% CI for Mean” and “BCa 95% CI” for
the mean reveals some small differences between the width and endpoints of the
bootstrap CI and the original CI. These results also show the bias, which is the
difference between the average of the bootstrap statistics and the original statistic
(e.g., the difference between the original estimate of the mean and the mean of
the bootstrapped samples).
Descriptive Statistics in R
Before any analysis, we need to get the data into R. The first step is to make sure
that you have set your working directory for R to the location of your data, or
type in a file path as in the screenshot that follows. (Note that here and through-
out the chapter bolded text in the Courier New font denotes a command, as does
bolded text in the regular body font; nonbolded Courier New font is the output
produced by R.) Setting the working directory can be done from the drop-down
menus in the R interface or through the command line (using the setwd com-
mand). The next step is to read in the data (using the read command), and then
to take a quick look at the data frame. By using the head() command we can
see that we will be using the same data set and data structure for the examples
in R as we are in SPSS. The str command allows us to see the structure of our
variables. In the code sample we can see that the second line of code changed the
Participant L1 Statistic Std. Bootstrap
Error
Bias Std. BCa 95% CI
Error
Lower Upper
Students English Mean 3.07 .29 .00 .29 2.50 3.63

should 95% CI for Lower 2.47
aspire to Mean Bound 3.66
speak like Upper
native Bound
speakers. 5% Trimmed 3.04 .01 .32 2.41 3.68
Mean
Median 3.00 .04 .34 3.00 3.00
Variance 2.55 –.09 .42 1.85 3.10
Std. Deviation 1.60 –.03 .13 1.36 1.76
Minimum 1
Maximum 6
Range 5 –1 1
Interquartile 4 –1 1
Range
Skewness .047 .427 –.01 .28 –.49 .57
Kurtosis –1.23 .83 .19 .36 –1.76 .11
Vietnamese Mean 4.14 .31 .00 .31 3.50 4.74
95% CI for Lower 3.50
Mean Bound 4.78
Upper
Bound
5% Trimmed 4.21 –.01 .34 3.53 4.83
Mean
Median 4.00 .35 .66 3.00 5.00
Variance 2.84 –.10 .49 2.02 3.46
Std. Deviation 1.68 –.04 .15 1.42 1.86
Minimum 1
Maximum 6
Range 5
Interquartile 3 0 1 3 3
Range
Skewness –.38 .43 .01 .32 –1.08 .30
Kurtosis –1.17 .85 .10 .45 –1.69 .37
Spanish Mean 3.63 .32 .00 .31 3.03 4.23
95% CI for Lower 2.99
Mean Bound 4.28
Upper
Bound
5% Trimmed 3.65 .00 .34 2.98 4.30
Mean
Median 4.00 –0.30 .57 4.00 4.00
Variance 2.10 –.10 .47 2.21 3.57
Std. Deviation 1.73 –.04 .14 1.49 1.89
Minimum 1
Maximum 6
Range 5
Interquartile 3 0 1 3 3
Range
Skewness –.28 .43 –.01 .31 –.92 .29
Kurtosis –1.31 .83 .13 .42 –1.75 .41
a. Unless otherwise noted, bootstrap results are based on 10,000 bootstrap samples
FIGURE 4.4 Descriptive statistics table with bootstrapped 95% CIs for various descrip-
tive statisticsa
L1 variable into a factor with three levels: English, Vietnamese, and Spanish. We
will be using this data frame for each of the analyses and will call on subsets of the
variables depending on the analysis.
#upload data via path

> belief <- read.csv("/home/user/Dropbox/Bootchapter/
belief.csv",
+ header = TRUE, sep = ",")
#upload data after setting working directory
> belief <- read.csv(“belief.csv”, header = TRUE,
sep = “,”)
> belief$L1 <- factor(belief$L1, labels = c("English",

+ "Vietnamese", "Spanish"))
> head(belief)
ID L1 L2_months Attitude
1 1 English 250 1
2 2 English 6 1
3 3 English NA 6
4 4 English 24 3
5 5 English 60 3
6 6 English 3 5
> str(belief)
'data.frame': 90 obs. of 4 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 . . .
$ L1 : Factor w/ 3 levels "English","Vietnamese",..:
1 1 1 1 1 1 1 . . .
$ L2_months: int 250 6 NA 24 60 3 120 4 120 12 . . .
$ Attitude : int 1 1 6 3 3 5 3 5 4 3 . . .
Because R is an open-source project, there are a large number of packages that

have been created by statisticians and programmers that will be useful to us. In this
chapter we have used the boot package, the plyr package, and the moments pack-
age (Canty & Ripley, 2013; Davison & Hinkley, 1997; Komsta & Novomestky,
2012; Wickham, 2011). In the next code sample we have installed and loaded the
three packages that were needed to complete the various bootstrapping analyses.
> install.packages(c("boot", "plyr", "moments"))

> library(boot)
> library(plyr)
> library(moments)
For each bootstrapping procedure, it is necessary to create a function. Follow-

ing is an annotated function for bootstrapping the four moments of a distribu-
tion (i.e., mean, variance or standard deviation, skewness, and kurtosis). Each of
the functions that are created in this chapter follow the same general pattern.
The function first takes an argument for the data—data—and an argument for
a vector of indices i in the data to be resampled. Inside of the function, we have
created a temporary data frame temp that consists of all rows in the data frame
that is being passed to the function. Then we created an object desc to hold the
resampled moments for the column from our data called L1.This object was a list
(which can be difficult to print and display)—so it is unlisted and turned into a
data frame before it is returned.
> Desstat <- function(data, i) {
+ temp <- data[i,]
+ desc <- dlply(temp, "L1", summarize, mean =
mean(Attitude),
+ sd = sd(Attitude), skew = skewness(Attitude),
kurt = kurtosis(Attitude))
+ l.desc <- unlist(desc)
+ return(l.desc)
+}
To run the bootstrapping function we call on the boot command from
the Boot library. In this case, the first argument that the boot command takes is
the data that will be bootstrapped, the second is the function that is written in the
code sample, and the third is the number of replications. In the next sample we
have named our results DESboot, and we see twelve sets of observed statistics.
The first four are the mean, standard deviation, skewness, and kurtosis for the
first group in our data set (English L1 speakers). The second and third sets of four
statistics are the same four moments for the Vietnamese and Spanish L1 speak-
ers, respectively. What this table shows us is the original statistic, followed by the
bias—which is the same bias that was explained in the SPSS section.
> DESboot <- boot(belief, Desstat, 10000)
> print(DESboot)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = belief, statistic = Desstat, R = 10000)
Bootstrap Statistics :
original bias std. error
t1* 3.06666667 -0.001780959 0.2891626
t2* 1.59597194 -0.031907345 0.1340360
t3* 0.04501668 -0.008344614 0.2605504
t4* 1.77394466 0.093668295 0.2939846

t5* 4.13333333 -0.001285655 0.3035079
t6* 1.65536397 -0.034134402 0.1489644
t7* -0.35456522 0.015253144 0.2905516
t8* 1.88111459 0.077314819 0.3779104
t9* 3.63333333 0.004831781 0.3122716
t10* 1.73171897 -0.035771903 0.1408729
t11* -0.26918104 -0.013297072 0.2936647
t12* 1.70501835 0.118208308 0.3566554
To retrieve the CIs for each of the moments we can call them one at a time as
illustrated in MEng.ci, or we can write a short “for loop” that will put them all
in a data frame for us (DESci.s and Dci).
> MEng.ci <- boot.ci(DESboot, conf = 0.95, type = "bca",

+ t0 = DESboot$t0[1], t = DESboot$t[, 1])
> print(MEng.ci)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on

10000 bootstrap replicates
CALL:
boot.ci(boot.out = DESboot, conf = 0.95, type = "bca",
t0 = DESboot$t0[1],
t = DESboot$t[, 1])
Intervals :
Level BCa
95% (2.500, 3.625)
Calculations and Intervals on Original Scale
The next code sample illustrates a “for loop” in R that will collect the lower
and upper ends of the CIs from the boot.ci object and then put them in a data
frame with the original statistic.
> DESci.s <- NULL

#Create an empty vector
> DESt0 <- as.data.frame(DESboot[1])
#Dataframe of estimated values
> for(i in 1:length(DESboot[[1]])){
#for loop that creates CIs for
CI <- boot.ci(DESboot, conf = .95,

#all estimated values
type = "bca", t0 = DESboot$t0[i],
t = DESboot$t[,i])
DESci.s[[paste("lwr",i,sep=".")]] <- CI$bca[c(4)]
#Labeling columns
DESci.s[[paste("upr",i,sep=".")]] <- CI$bca[c(5)]
}
> Dci <- matrix(DESci.s, ncol = 2,
#Creates matrix of CIs
nrow = length(DESt0[[1]]), byrow = TRUE)
> DESboot.CI <- data.frame(DESt0[1],
lwr = Dci[, 1], upr = Dci[,2])
#Create/Label dataframe of CIs
> print(DESboot.CI)
t0 lwr upr
English.mean 3.06666667 2.5000000 3.6250000
English.sd 1.59597194 1.3615271 1.8825143
English.skew 0.04501668 -0.4550728 0.5788735
English.kurt 1.77394466 1.3456360 2.3605649
Vietnamese.mean 4.13333333 3.4855947 4.6923077
Vietnamese.sd 1.65536397 1.4104068 1.9984294
Vietnamese.skew -0.35456522 -0.9802685 0.1727889
Vietnamese.kurt 1.88111459 1.4385776 2.7561735
Spanish.mean 3.63333333 2.9677419 4.2121212
Spanish.sd 1.73171897 1.4879193 2.0225216
Spanish.skew -0.26918104 -0.8334546 0.3219356
Spanish.kurt 1.70501835 1.3560934 2.5062921
We can see from the output that the values in the results from the bootstrapped
analysis in R are slightly different than the results of the SPSS analysis. However,
the general results are the same. This will be the case for every analysis and for
any repeated bootstrapped procedure because every time a bootstrap analysis is
conducted, there are different random samples that are drawn.We can see that the
Vietnamese group differs the most from the other two L1 groups in their mean
beliefs about whether or not students should aspire to speak like native speakers.
Pearson’s Correlation in SPSS

To obtain bootstrapped CIs for a Pearson Correlation, select Analyze >
Correlate > Bivariate.
• Move “L2 months of study” and “Students should aspire to speak like native
speakers” to the Variables box.
 Select Perform bootstrapping.
 Enter “10000” into the Number of samples box.
 Click Continue.
• Click OK in the Bivariate Correlations dialogue box.
The Correlations table from SPSS contains the Pearson Correlation coefficient,
significance level, and sample size information for the original, non-bootstrapped
data set. Like the Descriptives table, it also contains the bias, standard error, and
BCa 95% CI for the bootstrap correlation coefficients. The original results show
a small, nonsignificant positive correlation between “L2 months of study” and
teachers’ beliefs. The results of the bootstrap CI for the Pearson Correlation
Correlations
L2 months Students
of study should aspire
to speak
like native
speakers.
L2 month Person Correlation 1 .198

of study
Sig. (two tailed) .110
N 66 66.0
Bootstrap Bias 0 –.002
Std. Error 0 .113
BCa 95% Confidence Lower –.047
Interval Upper .406
Students Person Correlation .198 1
should
aspire to
speak like
native
speakers
Sig. (2-tailed) .110
N 66 66
Bootstrap° Bias –.002 0
Std. Error .113 0
BCa 95% CI Lower –.047 .
Upper .406 .
a
Unless otherwise noted, bootstrap results are based on 10,000 bootstrap samples
FIGURE 4.5 Correlations output table with bootstrapped 95% CIs for Pearson correla-
tion coefficienta
coefficient is notably large, ranging from –.05 to .41.The results also show a slight
negative bias.This would indicate a lack of confidence in the accuracy or stability
of the original estimate of the correlation coefficient.
Pearson’s Correlation Coefficient in R

The steps for bootstrapping a correlation coefficient in R are similar to boot-
strapping a descriptive statistic. First, we have to create a function for the boot
command. In the next code sample, we are telling the Boot command to create
temporary vectors of randomly drawn samples from the “L2_months” and “Atti-
tude” variables, correlate them, and return the correlation coefficient. Because
there is missing data in the L2 months of study variable, we have chosen to cor-
relate only complete sets of observations (use = “complete.obs”). This is the
same as choosing listwise deletion in SPSS.
> Corstat <- function(data, i) {

+ temp <- data[i,]
+ return(cor(temp$L2_months, temp$Attitude, method =
"pearson",
+ use = "complete.obs"))
+}
Again, when we use the Boot function, we first enter our data (belief ). The
Boot function will read the data set and identify the variables that we indicated
in the function earlier (“L2_months” and “Attitude”). Again, both results for the
estimated bias, standard errors, and CIs are similar to the SPSS results. Now that
we are familiar with the numerical output of a bootstrap analysis, we will plot the
10,000 bootstrap t-statistics and their normal Q-Q plot.
> CORboot <- boot(belief, Corstat, 10000)

> print(CORboot)
Call:
boot(data = belief, statistic = Corstat, R = 10000)
t1* 0.1984088 -0.004074199 0.1120397
> CORboot.ci <- boot.ci(CORboot, conf = 0.95, type =

"bca")
> print(CORboot.ci)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 10000 bootstrap replicates
CALL :
boot.ci(boot.out = CORboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (-0.0449, 0.3955)
We can see that the results of bootstrapping the correlation coefficient in SPSS
and in R are very similar. In R, we can also plot the bootstrapped samples and
the Q-Q plot to assess whether or not the sampling distribution follows a normal
distribution. In the plot on the left in Figure 4.6, the value of the original cor-
relation coefficient is marked with a vertical dashed line.This plot, taken together
with the information from the bootstrapped CI, shows that the sampling correla-
tion coefficient is very likely going to be small, and possibly 0.The Q-Q plot and
the accompanying histogram show that the samples of our statistic are normally
distributed. Because we are simulating the sampling distribution, this provides an
indication of the shape of the population distribution.
> plot(CORboot)
Histogram of t
4
0.6
0.4
3
Density
0.2
2
t*
0.0
1
−0.2
0
−0.4 −0.2 0.0 0.2 0.4 0.6 −4 −2 0 2 4

t* Quantities of Standard Normal
FIGURE 4.6 Bootstrapped correlation coefficients and Q-Q plot

Independent Samples t-test

To obtain bootstrapped CIs for the mean difference of two groups we might
compare using an independent samples t-test, select Analyze > Compare
Means > Independent-Samples T Test.
• Move “Students should aspire to speak like native speakers” to the Test
Variable(s) box.
• Move “Participant L1” to the Grouping Variable box.
 Click Continue.
• Click OK in the Independent-Samples T Test dialogue box.
The Independent Samples Test output table in Figure 4.7 contains the mean
differences for the original data set. It also contains the same bootstrap statistics as
the Descriptives and Correlation tables (figures 4.4 and 4.5): bias, standard error,
and 95% BCa CI around the bootstrapped mean difference values. In addition,
this table also includes significance values for the bootstrapped results. These sig-
nificance values can be interpreted as the proportion of the bootstrapped mean
difference values that are more extreme than the original mean difference value.
In this case, we see that about 1.5% of the bootstrapped mean different values
were more extreme than the mean difference of –1.071 as found in the original
analysis.
Mean Bootstrapa
Difference
Bias Std.Error Sig.(2-tailed) BCa 95% CI
Lower Upper
Students Equal variances –1.071 –.004 .419 .016 –1.884 –.256

should assumed
aspire to
speak like
native
speakers
Equal variances not –1.071 –.004 .419 .015 –1.884 –.256
assumed
a. Unless otherwise noted, bootstrap results are based on 10,000 bootstrap samples
FIGURE 4.7 Independent-Samples Test output table with bootstrapped 95% CIs
Mean Differences Between Two Groups in R

To bootstrap mean differences in R as we did in SPSS, we have chosen to first run
an ANOVA in our function and then a Tukey’s HSD post hoc. This is a shorter
and less code intensive method for bootstrapping the mean differences between
two groups.
> Mdiffstat <- function(data, i) {

+ temp <- data[i,]
+ aov.temp <- aov(Attitude ~ L1, data = temp)
+ Tuk <- TukeyHSD(aov.temp)
+ return(Tuk$L1[, 1])
+}
Since we are interested only in the mean difference between two groups in
this scenario, we will pass a subset of the data frame to the boot command that
contains only the two groups.
> Mdboot <- boot(belief[1:60,], Mdiffstat, 10000)

> print(Mdboot)
Call:
boot(data = belief[1:60,], statistic = Mdiffstat,
R = 10000)
t1* 1.066667 0.0008049593 0.4103226
Again, the bootstrapped results of mean difference, the bias, and the standard
error are similar to those in SPSS. Likewise, R produces a CI showing a possible
significant difference between the two groups’ beliefs about whether or not stu-
dents should aspire to speak like native speakers. The signs of the CI produced
by R are different than those in SPSS because the ordering of the groups was
reversed and does not make any difference to the conclusions that can be drawn
about the differences in means.
> Mdboot.ci <- boot.ci(Mdboot, conf = 0.95, type =

"bca")
> print(Mdboot.ci)

CALL :
boot.ci(boot.out = Mdboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (0.233, 1.850)
At this point, we will go one step further and check a diagnostic plot of the
jackknife-after-boot to investigate whether or not there are any individual cases
that have affected the bootstrap sample distribution.
> plot(Mdboot, jack = TRUE)
The plots in Figure 4.8 show the original mean difference (dashed line) plot-
ted with the bootstrap mean differences in belief between English and Vietnamese
speakers. This and the CI obtained show that 0 is not near the original estimate
nor in the 95% CI of the original estimate. The Q-Q plot shows that the boot-
strap values follow a normal distribution. The jackknife-after-boot shows that
there are three possible influential cases (3, 39, 59). When these are removed from
the bootstrap analysis the distance between the quantiles narrows, which creates
a slightly more peaked distribution. Without these values, it is possible that the
bootstrap CI around the original mean difference may be smaller.
Independent Samples t-test in R

Here we depart from SPSS and begin to venture into the flexibility that is gained
by working with R. In this section we are interested in bootstrapping the test sta-
tistic for an independent samples t-test. This is similar to bootstrapping the mean
difference, but we are going one step further to investigate possible Type I or Type
II errors in the original, nonbootstrapped analysis (see Plonsky, Egbert, & LaFlair,
in press).
The function is created in the same fashion as all of the previous functions that
we have seen. We create a function that randomly resamples the row indices from
the data and conducts a t-test based on the already established grouping variables
associated with the resampled cases.
> TTstat <- function(data, i) {

+ temp <- data[i,]
+ t.temp <- t.test(Attitude ~ L1, data = temp,
paired = FALSE,
+ var.equal = TRUE)
+ return(t.temp$statistic)
+}
Histogram of t
1.0
2.5
0.8
2.0
1.5
0.6
Density
t*
1.0
0.4
0.5
0.2
−0.5 0.0
0.0
0 1 2 3 −4 −2 0 2 4
t*
Quantiles of Standard Normal
5, 10, 16, 50, 84, 90, 95 %−iles of (T*−t)
*** * * * *** ** ** * * * ** * ***** * * ** * * ** ** * ******* ***** *

* * * *** ** ** * * * ** * ***** * * ** * * ** ** * ******** ***** *
0.5
***
*** * * * *** ** ** * * * ** * ***** * * * * * * ** ** * ******** ***** *
0.0
*** * ** *** ** ** * * * ** * ***** * * * * * * ** ** * ********* ***** *
*** * ** *** ** ** * * * ** * ****** * * * * * * ** ** * ******* *** ** *

−0.5
*** * * * *** ** ** * * * ** * ***** * * * * * * ** ** * ******** ***** *

*** * ** *** ** ** * * * ** * ****** * * * * * * ** ** * ********** *** ** *
59 26 8 55 29 57 20 4 31 502 12
3 3 03 3 3 9 4 375 45 12 401 48
−1.0
39 18 22 58 753 10 47 16 385232
36 34 25 51
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

standardized jackknife value
FIGURE 4.8 Bootstrap mean differences, Q-Q plot, and jackknife-after-boot plot of
the mean difference between English and Vietnamese
The results of the bootstrap t-test shown next give the original test statistic
(t1* = –2.54), the bias (indicating that the average resampled test statistic was
smaller), and the standard error. The 95% BCa CI shows that 0 is not in the CI, a
standard criteria for evaluating the mean difference between two groups.
> TTboot <- boot(belief[1:60,], TTstat, 10000)

> print(TTboot)
Call:
boot(data = belief[1:60,], statistic = TTstat, R = 10000)
original bias std. errort1* -2.540798 -0.04585189
1.072339
> TTboot.ci <- boot.ci(TTboot, conf = 0.95, type = "bca")

> print(TTboot.ci)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on

10000 bootstrap replicates
CALL :
boot.ci(boot.out = TTboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (-4.664, -0.464)
Figure 4.9 shows the bootstrap t-statistic values (the original estimate is marked
by a dashed line) in a histogram, a normal Q-Q plot (showing a normal distribu-
tion), and a jackknife-after-boot plot. This plot shows how the quantiles of the
distribution change when the case marked on the bottom is removed from the
bootstrap. The purpose of this plot is to identify influential cases in the original
data set that could affect the bootstrap estimation of the sampling distribution if
the influential cases are drawn too often in the bootstrap analysis. Influential cases
would be marked by points showing large deviations from the lines that represent
the quantiles. The plot in Figure 4.9 does not show much variation in the distri-
bution when any of the cases are removed, which indicates a lack of influential
data points.
Pairwise Comparisons
SPSS does not currently bootstrap t-statistics or F-statistics. Bootstrapping
for ANOVAs in SPSS is limited to post hoc pairwise comparisons. To obtain
Histogram of t
0.4
0
0.3
−2
Density
0.2
t*
−4
0.1
−6
−8
0.0
−8 −6 −4 −2 0 2 −4 −2 0 2 4
Quantiles of Standard Normal
t*
5, 10, 16, 50, 84, 90, 95 %−iles of (T*−t)
2
************ * *** * ** * ***** *** * *** * * ** * * * ********* ** * **

********* ** * *** * ** * ***** *** * *** * * ** * * ******* ** * **
***** *** ** * *** * ** * ***** *** * *** * ** * ********* ** **
1
* * *
*********** * *** * ** * ***** *** * * ** * * ** * * ********* ** * **

0
−1
************ * *** * ** * ***** *** * * ** * * ** * * ********* ** * **

************ * *** * ** * ***** *** * * ** * * ** * * * ******** ** * **
***** ****** * ** * * ** * ****** *** * * ** * * **** * * ******* ** *

* **
−2
56144050 45 410 17 49 51 25 33
2151 23 12 11 20 5 58 6 62 6 3
21 3582 31 16 27 42 46 9 34 8 3
4824 35 54 47 19
−3
−1 0 1 2
standardized jackknife value
FIGURE 4.9 Plot of the bootstrap T-statistics, their Q-Q plot, and the jackknife-after-
boot plot
bootstrapped CIs for the post hoc pairwise comparisons of a one-way ANOVA,
select Analyze > Compare Means > One-Way ANOVA.
• Move “Students should aspire to speak like native speakers” to the Depen-
dent List box.
• Move “Participant L1” to the Factor box.
Bootstrap for Multiple Comparisons

Dependent variable: Students should aspire to speak like native speakers.
Tukey HSD
(I) participant L1 Mean Bootstrapa
( J) participant L1 Difference(I–J)
Bias Std. Error BCa 95% CI
Lower Upper
English Vietnamese –1.071 .008 .408 –1.927 –.220

Spanish –.567 .003 .426 –1.368 .282
Vietnamese English –1.071 .008 .408 .329 1.815
Spanish .505 –.005 .445 –.331 1.362
Spanish English .567 –.003 .426 –.259 1.364
Vietnamese –.505 .005 .445 –1.398 .370
a
Unless otherwise noted, bootstrap results are based on 10,000 bootstrap samples.
FIGURE 4.10 One-way ANOVA output table with bootstrapped 95% CIs
 Click Continue.
• Click OK in the One-Way ANOVA dialogue box.
The Multiple Comparisons table in Figure 4.10 contains mean differences as

well as biases, standard errors, and 95% BCa CIs for each of the pairs of L1 groups.
ANOVA in R
We can run a function in R that will bootstrap the pairwise comparisons, return
CIs for the mean difference, and return a nonparametric significance value (as in
SPSS).
> Pairstat <- function(data, i) {

+ temp <- data[i,]
+ aov.temp <- aov(Attitude ~ L1, data = temp)
+ Tuk <- TukeyHSD(aov.temp)
+ return(Tuk$L1[, 1])
+}
Again the results of this analysis are similar to those from SPSS. The largest
mean difference in the beliefs is found between teachers with English as their first
language and teachers with Vietnamese as their first language (t1*).The mean differ-
ence between English and Spanish (t2*) and Spanish and Vietnamese (t3*) is similar.
> PAIRboot <- boot(belief, Pairstat, 10000)

> print(PAIRboot)
Call:
boot(data = belief, statistic = Pairstat, R = 10000)
t1* 1.0666667 0.003789025 0.4197999
t2* 0.5666667 0.008612038 0.4276186
t3* -0.5000000 0.004823013 0.4359305
The next code sample creates a data frame of all CIs for the pairwise com-
parisons of differences between group means. The results of the bootstrap CI
show one meaningful difference among the groups’ mean beliefs about students’
aspirations to speak like native speakers.The two groups that differed in this belief
were teachers who speak Vietnamese as a native language and teachers who speak
English as a native language (0 is not in the CI).
> PAIRci.s <- NULL> PAIRt0 <- as.data.frame(PAIRboot[1])

> for(i in 1:length(PAIRboot[[1]])){
CI <- boot.ci(PAIRboot, conf = .95,
type = "bca", t0 = PAIRboot$t0[i],
t = PAIRboot$t[,i])
PAIRci.s[[paste("lwr",i,sep=".")]]<- CI$bca[c(4)]
PAIRci.s[[paste("upr",i,sep=".")]]<- CI$bca[c(5)]
}
> Pci <- matrix(PAIRci.s, ncol = 2, nrow = length
(PAIRt0[[1]]),
+ byrow = TRUE)
> PAIRboot.ci <- data.frame(PAIRt0[1], lwr = Pci[, 1],
+ upr = Pci[, 2])
> print(PAIRboot.ci)
t0 lwr upr
Vietnamese-English 1.0666667 0.2333333 1.8909219
Spanish-English 0.5666667 -0.2940099 1.3590520
Spanish-Vietnamese -0.5000000 -1.3545038 0.3525119
In R we also have the flexibility to bootstrap other statistics of interest such as

an F-test statistic or effect size of an ANOVA.
F-Statistic
In the next function, we have fit a linear ANOVA model to the data. Then we
have created a vector or data for the residuals from the ANOVA model and a
vector of data for the predicted values from the model. The function was written
so that it randomly resamples the residuals and attaches them to the randomly
resampled cases.
> m <- lm(Attitude ~ L1, data = belief)

> resid <- m$residuals
> belief$resid <- resid
> belief$g.hat <- predict(m)
> Fstat <- function(data, i) {
+ temp <- data
+ temp$r <- temp$g.hat + temp$resid[i]
+ aov.temp <- anova(lm(r ~ L1, data = temp))
+ return(aov.temp$F[1])
+}
For this type of test, we would reject the null hypothesis if the F-value was much
larger than 1.Therefore, because the number 1 falls within the CI shown at the bot-
tom of the following lines of code and their corresponding output, we would fail
to reject the null hypothesis that there is no difference between the three groups.
> Fboot <- boot(belief, Fstat, 10000)

> print(Fboot)
Call:
boot(data = belief, statistic = Fstat, R = 10000)
t1* 3.093494 1.102791 2.823289
> Fboot.ci <- boot.ci(Fboot, conf = 0.95, type = "bca")

> print(Fboot.ci)

CALL :
boot.ci(boot.out = Fboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (0.167, 8.656)
Effect Size for ANOVA

In the previous example, as in most cases, we are not only interested in determin-
ing statistical significance. We are also interested in the strength, or magnitude, of
the relationship question as well (in this case mean differences). Next is a function
that will return bootstrapped values for an effect size (R2), which is calculated in
the same way as η2 (1 – SSeffect / SStotal). Again, we are conducting residual resam-
pling as before, but instead of returning the F-value, the function will return the
measure for the amount of variance explained by the combined variance of the
three grouping variables.
> m <- lm(Attitude ~ L1, data = belief)

> resid <- m$residuals
> belief$resid <- resid
> belief$g.hat <- predict(m)
> Rsqstat <- function(data, i) {
+ temp <- data
+ temp$r <- temp$g.hat + temp$resid[i]
+ aov.temp <- summary(lm(r ~ L1, data = temp))
+ return(aov.temp$r.squared)
+}
Our results show that the original R2 value is .0664 and that the bootstrap
estimate has a bias of .0192.This means that the average R2 value of the resampled
distributions is slightly larger than the original estimate and shows that there is
some variability in the bootstrapped estimates of the effect size.This indicates that
our original estimate may not be a very accurate approximation of the effect size
of the differences. This is reflected in the bootstrap CI below. The CI for R2 is
quite large (.0035, .167), which indicates that anywhere between .3 and 16.7%
of the variance may be explained by the three groups in this particular ANOVA
model.
> Rsqboot <- boot(belief, Rsqstat, 10000)

> print(Rsqboot)
Call:
boot(data = belief, statistic = Rsqstat, R = 10000)
t1* 0.06639327 0.01920224 0.05149469
> Rsqboot.ci <- boot.ci(Rsqboot, conf = 0.95,

type = "bca")
> print(Rsqboot.ci)

CALL :
boot.ci(boot.out = Rsqboot, conf = 0.95, type = "bca")
Intervals :
Level BCa
95% (0.0035, 0.1673)
This section has provided a step-by-step introduction to using bootstrapping

with a number of statistical procedures in R and SPSS. Bootstrapping in SPSS may
be easier for the novice user, but it offers bootstrapping only for a limited set of
statistical values. Using R for bootstrapping may be more challenging initially, but
it allows the user much more flexibility in its bootstrapping functionality. Boot-
strapping techniques allow researchers important insights into the nature of their
data and offer them tools for more accurate measurements of statistical relation-
ships and differences. The next section will provide brief overviews of two exam-
ple studies that showcase ways that bootstrapping can be used in research practice.
Example Study 1
Calzada, M. E., & Gardner, H. (2011). Confidence intervals for the mean: To bootstrap or
not to bootstrap. Mathematics and Computer Education, 45(1), 28–38.
Background
It is well-known that parametric statistical tests are inappropriate when sam-
ple sizes are small or data is skewed. However, additional research is needed to
document the utility of nonparametric bootstrapping methods for such data with
varying sample sizes and degrees of skewness.
Research Aims
The goal of this study was to investigate whether bootstrap t CIs are superior to
Student’s t CIs in data from a range of sample sizes and with skewed distributions.
Method
Student’s t CIs (95%) and 100,000 bootstrap t CIs (95%) were generated for
simulated samples from a range of distributions (normal, Student’s t, continuous
uniform, Poisson, and gamma) and sample sizes (n = 5, 10, 15, 20, 25, 30, 35, 40,
and 45). The authors recorded the (a) percent of “correct” CIs that contained µ
(i.e., the “true” population mean) and (b) the precision or width of the CI.
Results
The results suggest that Student’s t CIs are appropriate for symmetric (i.e., non-
skewed) data. However, bootstrap t CIs are better for skewed data with sample
sizes n ≥ 10. The authors also emphasize the effectiveness of bootstrapping for
estimating unknown means for data sets that are skewed or small.
Example Study 2
Guan, N. C.,Yusoff, M.S.B., Zainal, N. Z., & Yun, L. W. (2012). Analysis of two independ-
ent samples with non-normality using non-parametric method, data transformation
and bootstrapping method. International Medical Journal, 19(3), 218–220.
Background
Researchers commonly encounter data that is nonnormally distributed. Three
possible avenues for addressing such issues are: (a) nonparametric statistical tech-
niques (e.g., Mann-Whitney Test), (b) data transformations, and (c) bootstrapping.
Research Aims
The authors aimed to compare the use of nonparametric tests, data transforma-
tion, and bootstrapping in order to measure differences between two independent
samples.
Method
Upon being discharged from a psychiatric hospital, the psychopathology of 202
patients was assessed using a standardized instrument. They were then divided
into two groups based on whether or not they were readmitted to the hospital
less than six months later. The original data was found to be nonnormal using
a Kolmogorov-Smirnov Test. The authors then compared the usefulness of a
Mann-Whitney Test, log transformations, and bootstrapping (500 times) in mea-
suring group differences.
Results
The authors found a significant difference between the two groups using all three
methods. They suggest that all three are useful approaches to samples that fail to
meet assumptions of normality. However, they mention that one advantage of
bootstrapping over the other two methods is that it allows researchers to estimate
CIs for a range of statistics, including effect sizes.
1. What are the primary goals of bootstrapping?
2. What is the difference between simple, stratified, and residual resampling?
3. What is one advantage of conducting bootstrap analyses instead of nonpara-
metric techniques or transformations when data are non-normal?
4. Using the belief data set available on the companion website (http://oak.ucc.
nau.edu/ldp3/AQMSLR.html), answer the research question below. First,
conduct a traditional parametric ANOVA and any necessary post hoc analyses.
Second, bootstrap the ANOVA and check the model fit. Third, compare the
results of the bootstrap ANOVA with the traditional ANOVA. If you are using
R, perform a jackknife-after-boot diagnostic analysis. What conclusions can
you make about your data based on your bootstrapped parameters and CIs?
RQ: Is there a mean difference between beliefs of the effects of motivation on language
learning for instructors from different first-language (L1) backgrounds?
Further Reading
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge:
Cambridge University Press.
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence limits. Statistical Science, 11(3),
189–228.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics,
7, 26.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York:
Chapman & Hall.
Lapage, R., & Billard, L. (Eds.). (1992). Exploring the limits of the bootstrap. New York: John
Wiley & Sons.
Larson-Hall, J. (2012). Our statistical intuitions may be misleading us: Why we need
robust statistics. Language Teaching, 45, 460–474.
31, 368–190.
Lee, W.-C., & Rogers, J. L. (1998). Bootstrapping correlation coefficients using univariate
and bivariate sampling. Psychological Methods, 3, 91–103.
Plonsky, L., Egbert J., & LaFlair, G. (in press). Bootstrapping in applied linguistics: Assess-
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In I. Olkin,
S. G. Ghwyne, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds), Contributions to prob-
ability and statistics: Essays in honour of Harold Hotelling (pp. 448–485). Stanford: Stanford
University Press.
Yung,Y.-F., & Chan, W. (1999). Statistical analyses using bootstrapping: Concepts and
implementation. In R. H. Hoyle (Ed.), Statistical strategies for small sample research
(pp. 82–105). Thousand Oaks, CA: Sage.
Note
1. Chernick (1999) recommends 5,000–10,000 for most cases and reviews other methods
for estimating the number of replications needed in a bootstrap analysis (pp. 112–122).
References
Beasley, W. H., & Rogers, J. L. (2009). Resampling methods. In R. E. Millsap & A.
Maydeu-Olivares (Eds.), The Sage Handbook of quantitative methods in psychology
(pp. 362–386). London: Sage.
Canty, A. J., Davison, A. C., Hinkley, D. V., & Ventura, V. (2006) Bootstrap diagnostics and
remedies. The Canadian Journal of Statistics, 34, 5–27.
Canty, A. J., & Ripley, B. (2013). boot: Bootstrap R (S-Plus) Functions [Computer Soft-
ware]. R package version 1.3–9.
Carpenter, J., & Bithell, J. (2000). Bootstrap confidence intervals: When, which, what?
A practical guide for medical statisticians. Statistics in Medicine, 19, 1141–1164.
Chernick, M. R. (1999). Bootstrap methods: A practitioners guide. New York: John
Wiley & Sons.
Crawley, M. J. (2007). The R book. West Sussex, England: John Wiley & Sons.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge:
DiCiccio, T. J., & Efron, B. (1996). Bootstrap confidence limits. Statistical Science, 11,
189–228.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 26.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York:
Chapman & Hall.
Gass, S. (2009). A survey of SLA research. In T. Ritchie & W. Bhatia (Eds.) Handbook of
second language acquisition (pp. 3–28). Bingley, UK: Emerald.
Keselman, H. J., Algina, J., Lix, L. M., Wilcox, R. R., & Deering, K. N. (2008). A generally
robust approach for testing hypotheses and setting confidence intervals for effect sizes.
Psychological Methods, 13, 110–129.
Komsta, L., & Novomestky, F. (2012). Moments: Moments, cumulants, skewness, kurtosis
and related tests [Computer Software].R package version 0.13.http://CRAN.R-project.
org/package=moments
Lansing, L. (1999). Bootstrapping versus the Student’s t: The problems of Type I error and power.
Unpublished master’s thesis, Lehigh University, Bethlehem, PA.
Larson-Hall, J. (2015). A guide to doing statistics in second language research using SPSS and R.
31, 368–390.
Lazaraton, A. (2005). Quantitative research methods. In E. Hinkel (Ed.), Handbook of research
in second language teaching and learning (pp. 109–224). Mahwah, NJ: Erlbaum.
Lee, W.-C., & Rogers, J. L. (1998). Bootstrapping correlation coefficients using univariate
and bivariate sampling. Psychological Methods, 3, 91–103.
Nassaji, H. (2012). Significance tests and generalizability of research results: A case for
replication. In G. Porte (Ed.), Replication research in applied linguistics (pp. 92–115). Cam-
bridge: Cambridge University Press.
Norris, J. M., & Ortega, L. (2006).The value and practice of research synthesis for language
learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on lan-
guage learning and teaching (pp. 3–50). Philadelphia, PA: Benjamins.
Phakiti, A. (2010). Analysing quantitative data. In B. Paltridge & A. Phakiti (Eds.), Contin-
uum companion to research methods in applied linguistics (pp. 39–49). London: Continuum.
Plonsky, L. (2011).The effectiveness of second language strategy instruction: A meta-analysis.
Language Learning, 61, 993–1,038.
Plonsky, L., Egbert J., & LaFlair, G. (in press). Bootstrapping in applied linguistics: Assessing
its potential using shared data. Applied Linguistics.
Tukey, J. W. (1960). A survey of sampling from contaminated distributions. In I. Olkin, S. G.,
Ghwyne, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds), Contributions to probability
and statistics: Essays in honour of Harold Hotelling (pp. 448–485). Stanford, CA: Stanford
University Press.
Wickham, H. (2011). The Split-Apply Combine Strategy for Data Analysis. Journal of Sta-
tistical Software, 40(1), 1–29.
Wilcox, R. (2001). Fundamentals of modern statistical methods: Substantially improving power and
accuracy. New York: Springer.
Wolfe, E. W., & McGill, M. T. (2011). Comparison of asymptotic and bootstrap item fit indices
in identifying misfit to the Rasch model. Paper presented at the National Conference on
Measurement in Education, New Orleans, LA.
Yung,Y.-F., & Chan,W. (1999). Statistical analyses using bootstrapping: Concepts and imple-
mentation. In R. H. Hoyle (Ed.), Statistical strategies for small sample research (pp. 82–105).
Thousand Oaks, CA: Sage.
5
PRESENTING QUANTITATIVE
DATA VISUALLY
Thom Hudson
As soon as you have collected your data, before you compute any statistics, look at
your data.
(Wilkinson and the Task Force on Statistical
Inference, 1999, p. 597)
Background
Graphical charts and tables afford important avenues for exploring data and pre-
senting vivid and transparent representations of statistical findings.Their form and
use, however, is often given little conscious attention in discussions of reporting
statistical results. This is an unfortunate state of affairs. Just as statistical reporting
requires reporting of both data centrality and dispersion for an in depth under-
standing of data, it benefits from careful graphic representation that provides clear
visuals representing how data behave. As Cleveland (1985) explains, graphic repre-
sentations can display a large amount of quantitative information in ways that can
be absorbed thoroughly, perhaps more thoroughly and immediately than through
the presentations of means, standard deviations, effect sizes and p-values.There are
numerous graphic forms: tables, histograms, line graphs, box plots, scatter plots,
and more.The choice of which graphic type to use depends upon the type of data
(univariate, bivariate, categorical, continuous) and on the function (comparison,
description, exploration), and on the audience.
Too often graphics are seen as an afterthought in the data analysis process.
This is despite the work by Tufte (1983, 1997, 2006), Cleveland (1985, 1993),
Kosslyn (2006), Klass (2008), Few (2009, 2012), Robbins (2013), Larson Hall and
Herrington (2010), and others who have demonstrated the power of compel-
ling graphic display in communicating information. Many of these writers note
that part of the problem is an overreliance on computers and the software that
Presenting Quantitative Data Visually 79
goes with them. Microsoft Excel can produce eye-catching graphics, and SPSS
will produce bar charts with a few mouse clicks. However, accompanying the
ease with which these may be produced is often a lack of close examination
of the data. “Computers can’t make sense of data; only people can” (Few, 2009,
p. 2). Thoughtful use of graphics can help us make sense of and effectively com-
municate the implications of our data. The thoughtful use of graphics involves
establishing the hierarchy of purposes for the graphic before designing any display.
Graphics can be used for exploration of data, communication of discovery, and
archival of collected data for future use. Each of these uses requires different deci-
sions in the design process. The focus in the present chapter will be on the use of
graphic display for communication of information to those in, or who wish to
join, the community of second language (L2) researchers
Graphics are frequently used in L2 research reporting. During the year prior
to the writing of this paper, 136 data-based research articles in five major jour-
nals in the field presented 514 tables and 207 figures.1 It is noteworthy there are
over twice as many tables as figures. Although many of the tables provide textual
information such as research study procedures, variable definitions, and scoring
rubrics, a large number contain descriptive and inferential statistics results, cor-
relation matrices and results that could easily be more informatively presented in
a graphical chart. A breakdown of the types of graphical charts among the figures
in the journals is presented in Table 5.1 (see also Larson-Hall, in preparation).
TABLE 5.1 Types of graphical charts and frequency of use found in the last four regular
issues of five L2 journals
MLJ LL SSLA TQ AL Total
Line Graph 14 14 6 8 5 47
Grouped Bar Chart 10 11 14 4 1 40
Diagram (Text) 11 5 5 8 2 31
SEM or Path Diagram 5 7 3 4 3 22
Scatter Plot 9 5 0 1 3 18
Pictures 9 1 1 1 1 13
Bar Chart 2 4 2 1 2 11
Spectrogram/Acoustic 4 0 2 0 0 6
Display
IRT Rasch map 0 0 0 5 0 5
Box and Whisker 0 0 0 1 2 3
Dot Chart (CIs) 0 3 0 0 0 3
Stacked Bar Chart 0 0 0 0 3 3
Pie Chart 0 0 0 0 2 2
Stacked Bar Chart 3-D 0 0 0 0 2 2
Forest Plot 0 1 0 0 0 1
Total 64 51 33 33 26 207
AL = Applied Linguistics, LL = Language Learning, MLJ = Modern Language Journal,

SSLA = Studies in Second Language Acquisition,TQ = TESOL Quarterly
80 Thom Hudson
From Table 5.1, it can be seen that there is a very strong reliance on line graphs
and bar charts (either regular or grouped). This practice is unfortunate in that
these types of visual displays fail to provide the rich information available in other
chart types. It should also be noted that the Diagram (Text) graphics, the third
most common type of graph, are predominately either visual presentations of
theoretical models or descriptions of study procedures, and are thus not providing
quantitative information.
Tables and Graphs

One reason for taking to heart Wilkinson et al.’s (1999) quoted admonition to
look at your data before statistical analysis can be seen from an example offered
by Cleveland (1993). Immer, Hayes, and Powers (1934) published an article with
tabular data presenting the yields of 10 strains of barley at six different locations in
Minnesota for 1931 and 1932. Fisher (1966) included the data in his publication
The Design of Experiments in his discussion of ANOVA, and several other noted
statisticians used the data in discussions of ANOVA. Cleveland plotted the data in
a trellis display reproduced in Figure 5.1.
Cleveland noted an anomaly in the patterns of data for the Morris location.
In all other venues, the barley yield for 1932 (+) was lower than the yield for
1931 (°). However, the pattern was reversed for the Morris location. Several stat-
isticians, including Fisher (1966), Anscome (1981), and Daniel (1976) had ana-
lyzed the data for ANOVA demonstration purposes, but had not attended to the
anomaly. Cleveland carried out a number of analyses to determine what might
have caused the discrepancy. His ultimate conclusion was that there was an error
in the initial data reporting and that the dates for the Morris data are reversed.
This error would likely not have been observed from tabular data alone. Nor did
the ANOVA treatments cause critical scrutiny. The graphic representation pro-
vided input that would not have been available otherwise.This exploratory use of
graphics can lead to an understanding of the data prior to any statistical analysis.
A number of guidelines for presenting tables and graphics have been proposed
as a means to both prevent errors and enhance interpretation of quantitative data.
Keep in mind that graphics serve two functions. They can be used in exploring
the data for the researcher to get a sense of what is going on, and they can be
used in presenting study results to consumers of the research.The particular func-
tion will govern what guidelines operate on the form of the graphic. Tables and
graphs represent numerical relationships. They provide information about central
tendencies of measures and measures of dispersion. It is important to understand
these numerical relations, and data should be clear and unambiguous. Both tables
and graphs provide different approaches to data reduction, but the goal of both is
to provide useful and meaningful insight (Few, 2004).
Scholars in numerous disciplines have reviewed, described, and prescribed vari-
ous practices in the graphic display of data (e.g., Cleveland, 1985, 1993, 1994; Few,
1932 1931
Waseca
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
Crookston
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
Morris
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
University Farm
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
Duluth
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
Grand Rapids
Trebl
Wisconsin No. 38
No. 457
Glabron
Peatland
Velvet
No. 475
Manchuria
No. 462
Svansola
15 30 45 60
Barly Yield (bushels/acre)
FIGURE 5.1 Cleveland’s 1993 graphic display of barley harvest data from Immer,
Hayes, & Powers (1934)
82 Thom Hudson
2009, 2012; Klass, 2008; Kosslyn, 2006; Nicol & Pexman, 2010; Larson-Hall &
Plonsky, 2015; Robbins, 2013; Tufte, 1983, 1997, 2006). Although these authors
are not unanimous in their recommendations, overlap in their recommendations
indicates that quantitative graphical displays should reflect these general rules for
displaying qualitative information:
1. Show the data.

a. Make large data sets coherent.
b. Make the data stand out.
2. Avoid distorting the data.
a. Attend to the scale (x and y).
b. Consider the inclusion of 0 (it may be essential for accurate interpretation).
3. Present many numbers with a minimum of ink.
a. Avoid superfluity and clutter (unneeded gridlines, heavy gridlines).
b. Make large data sets coherent.
c. Present two-dimensional data in two dimensions.
4. Present the data unambiguously, efficiently, clearly, and meaningfully.
a. Clearly label all elements of a graph (titles, axes, legends).
b. Lead the viewer to think about the substance rather than about meth-
odology (graphic design, technology of graphic production, graph color,
or something else).
c. Encourage the eye to compare different pieces of data.
d. Reveal the data at several levels of detail (from a broad overview to fine
structure).
e. Serve a clear purpose (e.g., description, exploration, comparison, or
tabulation).
f. Provide closely integrated statistical and verbal descriptions of the data.
As noted earlier, there is no unanimity in how to apply these concepts, par-

ticularly since they can be in conflict at times. For example, there can be disagree-
ments in the application of “minimum ink” and “encourage the eye to compare
different pieces of data.” A clear area of potential contention would be in the
Figure 5.1 example from Cleveland’s presentation of the barley data with the use
of grid lines. Some might argue that the grid lines are an excessive use of ink
while others would argue that there is a need for the gridlines in order to match
the location of each barley type with its value in the plot. However, in general,
there is agreement that ink should be kept to a minimum in relation to the data
presented. The following discussion of tables, charts and graphs will reference
and flesh out the issues identified in the general rules as well as note potential
disagreements.
Tables
Tables allow complex units of data to be presented in an organized way. They are
effective when the reader needs to examine exact values of different variables.
They allow the consumer to make additional calculations and make exact com-
parisons. Additionally, they are useful when there is a need to provide information
together that is in different units of measurement (e.g., proficiency level, years
studying an L2, mean length of utterance). Graphic display generally begins with
a data table of some form. Tables systematically display numbers and effectively
structure and present concentrated data and small data sets.
Tabular data should be presented unambiguously. This requires that descrip-
tive text be presented in the table titles, headings, footnotes, labels, and source
information. Clark (1987) points out that numbers are only one element of the
overall data. She notes that a data set includes the words that link numbers to the
phenomena that are under consideration and ties the content elements as needed
to make clear the who, what, how, where, and when associated with the numerical
information. Titles should be clear, sample sizes should be apparent, quantities
should be specified, and time frames should be obvious.
Wainer (1997) provides four rules for table construction: (1) round heavily,
(2) order rows and columns in a sensible way, (3) include summary rows and
columns when important, and (4) add spacing to aid perception. He argues that
people cannot process more than two digits easily, that we cannot generally justify
more than two digits statistically because of standard errors, and that most people
almost never care about more than two digits. However, the two-digit recom-
mendation is simply a standard against which data reporting can be evaluated. For
example, when data have historically been reported on College Entrance Exami-
nation Board CEEB scales (used for the SAT, GRE, and TOEFL paper-and-pencil
test), three digits may be a convention that is most conveniently followed. The
information in Table 5.2 is an initial, though slightly edited, presentation of infor-
mation on the 2009 National Assessment of Educational Progress Reading scale
results.The results are for grade 12 public schools for the 11 U.S. states reported. It
was produced by the NAEP State Comparisons Tool from the National Center for
Educational Statistics (http://nces.ed.gov/nationsreportcard/statecomparisons).
A number of aspects of this table can be addressed from the perspective of the
general guidelines presented in the previous section as well as the specific guide-
lines for tables proposed by Wainer. First, the shading and grid lines do not add
information to the graphic or aid in perception. They simply serve to increase
the amount of ink without increasing the amount of information. Second, the
“Order” variable is an empty category and the repetition of “2009” and “Scale
Score” in the second and third rows are redundant information provided in the
table title. Generally, if the value in any row is the same across columns (or any
column is the same across rows), the row or column should be removed and per-
haps included elsewhere such as a header or footnote.The label “National public”
84 Thom Hudson
TABLE 5.2 2009 average reading scale score sorted by gender, grade 12 public schools
Male-Female
All students Male Female Difference
2009 2009 2009 2009
Order Jurisdiction Scale Score Scale Score Scale Score Scale Score
N/A National public 287.0595571 280.956378 292.9596502 –12.00327228
N/A Arkansas 279.8846598 271.1364272 288.6513578 –17.51493065
N/A Connecticut 292.3508196 284.9950077 299.7995149 –14.80450727
N/A Florida 282.6334833 275.5068654 289.3160879 –13.8092225
N/A Idaho 290.1409912 284.654741 296.0461276 –11.3913866
N/A Illinois 291.5195945 285.5453884 297.310035 –11.76464663
N/A Iowa 290.6223739 283.895896 297.7043157 –13.80841973
N/A Massachusetts 295.4572734 289.9215732 301.1007774 –11.17920417
N/A New
Hampshire 292.9695062 283.7600856 302.3824946 –18.62240893
N/A New Jersey 288.0905513 281.6284422 294.3658039 –12.73736175
N/A South Dakota 291.9890962 285.7366043 298.5041494 –12.76754502
N/A West Virginia 279.3981132 270.7917682 287.8034791 –17.01171092
Note: The NAEP Reading scale ranges from 0 to 500.
Source: U.S. Department of Education, Institute of Education Sciences, National Center for
Education Statistics, National Assessment of Educational Progress (NAEP), 2009 Reading Assessment.
is not clear since all schools in the sample are public schools rather than private.
Further, the scale scores are carried out to seven decimal places. As noted at the
bottom of the table, the reading scale ranges from 0 to 500. It is unlikely that
anyone reading the table would be interested in decimal places. Finally, the font is
too small. Thus, we can tidy the display as shown in Table 5.3.
The revised table is less cluttered and the NAEP scale scores are easier to pro-
cess without the excessive digits after the decimal. Since the scores are reported
on a scale from 0 to 500, the decimals are unnecessary. Unnecessary information
below the table has been eliminated.
Now, we need to address two additional issues. First, while the table presents
the scale scores for each of the states with available information, no overall com-
parison statistics are provided. Summary statistics below the table would be infor-
mative for comparative purposes. Second, the ordering of the states is alphabetical,
an arbitrary order that is almost never satisfactory or informative. However, in
the present case, it is a judgment call as to whether the table is presented just for
someone to look up his or her own state’s score, in which case alphabetical order
might make sense, or whether there is the need for comparative information
across states, in which case alphabetical order is not informative. Ordering the
state scores in descending order of overall scale score will allow for facilitation of
score comparisons. Likewise, the table would benefit from having the labels more
TABLE 5.3 2009 average NAEP reading scale scores by gender for grade 12 public schools
in 11 states (first revision)
All Male-Female
Jurisdiction Students Male Female Difference
National Mean 287 281 293 –12.00

Arkansas 280 271 289 –17.51
Connecticut 292 285 300 –14.80
Florida 283 276 289 –13.81
Idaho 290 285 296 –11.39
Illinois 292 286 297 –11.76
Iowa 291 284 298 –13.81
Massachusetts 295 290 301 –11.18
New Hampshire 293 284 302 –18.62
New Jersey 288 282 294 –12.74
South Dakota 292 286 299 –12.77
West Virginia 279 271 288 –17.01
Note: The NAEP Reading scale ranges from 0 to 500.

Source: U.S. Department of Education, Institute of Education Sciences, National Center for Education
Statistics, National Assessment of Educational Progress (NAEP), 2009 Reading Assessment.
TABLE 5.4 2009 average NAEP reading scale scores by gender for grade 12 public schools
in 11 states sorted on state mean scores (second revision)
Male-Female
Jurisdiction All Students Male Female Difference
Massachusetts 295 290 301 11.18

New Hampshire 293 284 302 18.62
Connecticut 292 285 300 14.80
South Dakota 292 286 299 12.77
Illinois 292 286 297 11.76
Iowa 291 284 298 13.81
Idaho 290 285 296 11.39
New Jersey 288 282 294 12.74
Florida 283 276 289 13.81
Arkansas 280 271 289 17.51
West Virginia 279 271 288 17.01
National Mean 289 282 295 14
Standard Deviation 5.27 6.00 4.92 2.53
Note: The NAEP reading scale ranges from 0 to 500.

Source: U.S. Department of Education, Institute of Education Sciences, National Center for Education
Statistics, National Assessment of Educational Progress (NAEP), 2009 Reading Assessment.
centrally aligned with the numbers. In addition, the negative sign indication for
the differences between male and female scores is an artifact of how the gender
categories were ordered in calculations and is unnecessary in displaying actual
magnitude. Table 5.4 reflects these changes.
86 Thom Hudson
The revisions have improved information perception. Removing the shading

and most grid lines creates fewer distractions and allows better information dis-
play. The ordering of the states in descending order provides easier comparisons.
The white space between the columns and rows provides a natural grid. Placing
the national mean score along with the standard deviation below the state scores
allows comparisons across states. Eliminating the negative sign before the values
for the differences in scores eliminates clutter.Through these changes,Table 5.4 is
much more easily understood than Table 5.2, and reflects the recommendations
by Wainer as well as those on p. 82. Although it would increase the density of
the table somewhat, inclusion of each state’s standard deviations would also add
information.
Charts and Graphs

The ability to create clear graphic charts is inherently related to an understand-
ing of the data and how they behave. While tables show exact values of variables,
graphical charts tend to show relative relationships. Or, in Wilkinson et al.’s (1999)
words, “graphics broadcast; statistics narrowcast” ( p. 597). The process of making
these relationships clear requires adherence to the general guidelines given earlier.
For example, suppose we wanted to present more graphic information about the
list of graphic types from the journals in Table 5.1. We could present Figure 5.2.
14
12
10
8 AL
TQ
6 SSLA
LL
4
MLJ
0
1
3
5
7
9
11 AL
13
15
FIGURE 5.2 Types of graphics used over last four regular issues of five applied linguis-
tics journals
The chart in Figure 5.2 presents the distribution of chart types across the
different journals described in Table 5.1. It looks quite impressive and fetch-
ing in my opinion, and was especially so in the original color version on
screen. However, this chart violates virtually all of the suggestions for good
graphic design. It does not present the data unambiguously, efficiently, clearly,
or meaningfully. First, it does not have a clear purpose. It is unclear why any-
one would need to see graphically the different numbers of every graphic
type by each journal. The graph is not useful for description, exploration, or
tabulation. Any comparative function would be much better displayed in the
table format rather than a three-dimensional bar chart. Further, if one wanted
to compare the different graph types graphically across journals, it would be
more advisable to take only the five or six most common graph types. Also,
the elements of the graph are not labeled clearly. It is not clear that the x-axis
represents the graph type and the y-axis displays the number of tokens of each
graph type for each of the different journals. The rotation of the graph makes
it difficult to make any comparisons of the data. For example, it is not possible
to compare MLJ and LL for graph type 1 because the bars are hidden behind
other bars.
In short, the graph does a poor job of showing the data. The data do not
stand out because the graph is cluttered. The y-axis scale appears to be distorted
in that it is very extended vertically. This accentuates the apparent differences
between the numbers of occurrences. For example, the difference between TQ
and SSLA for graph type 1 is only 2, but the difference looks more striking in
Figure 5.2. Further, the three-dimensional representation makes it difficult to
interpret actual values. How many occurrences correspond to LL graphic type
15? The three-dimensional cylindrical columns serve no function. The gridlines
are deceptive in that each gridline does not correspond to an actual numerical
difference between the y-axis score numbers. The excessive number of gridlines
creates clutter and adds ink without adding information.
We can see that there are a number of ways to go wrong with graphs. We will
now look at several different graph types and examine their uses. We will also
discuss ways to keep these graphs within the guidelines provided earlier. The data
for the chart types that I describe are based on four data sources: a hypothetical
set of data representing language test performance and questionnaire information
for 45 examinees in a language program; the journal use of graphic information
data in Table 5.1; the NAEP data from Table 5.4; and data from an introductory
L2 studies course I taught online.
Bar charts and histograms. These basic charts that are fairly simple to pro-
duce and read. Bar charts are used for discrete/categorical variables along the
x-axis (e.g., yes/no, country of origin), while histograms are used for continu-
ous variables, sometimes broken into categories (e.g., 10–19, 20–29, 30–39, etc.).
Examples are shown in Figure 5.3 and Figure 5.4. Note that the bar chart has
space between the bars while the histogram typically does not. The bar chart
describes the subjects’ mean scores on a listening test across the three rating
88 Thom Hudson
categories (i.e., a grouping variable) for self-confidence; the histogram provides

the frequency of subjects in two-point “bins” or levels.
The data for Figure 5.3 are the mean listening scores of students grouped on
the basis of self-ratings of confidence in using the target language. The bar chart
in Figure 5.3 shows the mean of students selecting each option and provides
the 95% confidence interval (CI) around that mean. The CIs are fairly narrow.
This bar chart does not include gridlines because they do not seem necessary to
interpret the data presented. Likewise, the bars do not have a defined black bor-
der because it did not seem necessary in order to distinguish between the three
categories.
One of the primary visual differences between a bar chart and a histogram
can be seen by comparing Figure 5.3 with Figure 5.4. In Figure 5.4, there are no
spaces between the bars unless there are gaps along the x-axis score continuum.
The bars in a histogram represent groupings of scores on either interval or ordinal
level variables.
100
80
Mean scores on listening test
60
40
20
0
1 2 3
Errors Bars: 95% CI
Grouping on self-confidence rating
FIGURE 5.3 Bar chart showing means of listening scores for each category of self-rated
confidence ratings with 95% CI (N = 45)
There are variations on the simple bar graph and histogram. First, there are
grouped bar charts as shown in Figure 5.5. Additionally, there are stacked bar
charts such as that in Figure 5.6. However, the stacked bar charts in Figure 5.6
point out the difficulty of comparing lengths that do not have a common base-
line. For example, it is difficult to accurately compare the gender composition
of proficiency group 2. This is a common problem with using information in
stacked bar charts.
5.0 Mean = 10.82

Std Dev = 3.845
N = 45
Frequency of Cases
0.0
5.0
0.0
−1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Speaking Scores
FIGURE 5.4 Histogram of speaking scores (N = 45)
20 Gender
Female
Male
15
Mean Speaking Score
10
0
1 2 3
Course level by gender with 95% CI
FIGURE 5.5
Grouped bar chart for speaking scores by gender with 95% CI course level
by gender with 95% CI
If a researcher is committed to using bar charts, a grouped bar chart configura-

tion might be preferable. Figure 5.7 provides this type of graph for the gender by
proficiency level comparisons. Although this graphic does not provide much dis-
persion information, it is easier to process than the stacked bar charts in Figure 5.6.
90 Thom Hudson
Total Group
1.00
60.0% 2.00
3.00
4.00
50.0% 5.00
40.0%
Percent
30.0%
20.0%
10.0%
0.0%
0 1
Gender
FIGURE 5.6 Percentage of students in each proficiency level by gender
The discussion of Table 5.1 indicated that bar charts represent a very large pro-
portion of the graph types that appear in the journals surveyed. However, Tufte
(1983, 1997) criticizes bar charts for several reasons. First, they frequently have too
much ink and too little information. On one hand, in Figure 5.5, some form of
texture or shading is provided for both gender categories. Less ink would be used
if one category were simply left without shading. On the other, the Figure 5.3
and Figure 5.5 bar charts provide CIs along with the mean scores. This at least
provides minimal additional information about the precision of measurement.
However, bar charts do not show much information about the distribution of the
scores.They do not provide information about the range of actual scores or about
the standard deviations. For the most part, they only focus on means.
In order to improve on some of the shortcomings of bar charts, some writ-
ers encourage the use of box-and-whisker plots such as those in Figure 5.8
(see Larson-Hall & Plonsky, 2015). The graph displays the median value for all
scores in addition to the median for all scores above the median (the 55th per-
centile) and for all scores below it (the 25th percentile). The area between these
last two medians represents the middle 50% of all scores (i.e., the 25th to 75th
12 Gender 2
Female
Male
10
8
Count
0
Basic Plus
Intermediate
Upper
Inter. Plus
Basic
Course Membership
FIGURE 5.7 Number of students in each proficiency level by gender
percentile). This area is represented as the box in the plot. The whiskers extend
out to the upper and lower extreme scores. Some authors refer to the top end
of the box as the upper hinge and the lower end of the box as the lower hinge. The
lower hinge is also frequently referred to as the lower quartile (the first through
25th percentiles) and the upper hinge as the top quartile (the 75th to 100th per-
centiles). Figure 5.8 represents two box-and-whisker plots for the speaking test
results by gender.
One way to look at a box plot is to see it as a histogram turned on its side.
This comparison, however, highlights one criticism of box plots, namely, that a
single box plot does not provide as much information as a histogram in cases in
which the data do not cluster around the median score (Klass, 2008). For example,
the histogram in Figure 5.4 provides more information than a single box plot.
Yet, box plots are very useful when several box plots are used to compare some
distribution across groups. For example, Figure 5.9 provides a better sense of
how the data are distributed around proficiency levels than the histogram shown
previously in Figure 5.4. The box plots show more decreasing internal dispersion
in scores as the proficiency level increases. This information is not available in a
single histogram across speaking scores, nor would it be apparent in a bar graph.
92 Thom Hudson
16
15
14
13
Speaking Scores
12
11
9
8
7
5
4
3
Female Male
Gender
FIGURE 5.8 Box-and-whisker plot for the speaking test by gender
Because the marked points in box plots represent percentiles, box plots can also be
criticized in comparison with bar charts that contain 95% CIs because the latter
would enable a visual determination of statistically significant differences between
groups. Given the multiple and severe weaknesses of statistical significance (e.g.,
Plonsky, Chapter 3 in this volume), however, the informational richness provided
by box plots is a worthwhile trade-off.
Line graphs. This type of graph frequently displays data in a time series. Fig-
ure 5.10 shows that across five tests, the student scores increased each administra-
tion except at time 3. The line indicates a continuum along which the students
develop. A consideration for Figure 5.10 is whether the graph should in fact begin
with a zero point along the y-axis. It is generally good practice to include a zero
point. In order to provide the reader with an accurate baseline, an argument could
be made in the present case that the zero point is needed to provide perspective,
particularly if the test happened to be a commonly known test and readers would
have reference to what a score of 55 or 70 means. However, it could also be
argued that the real comparison here is over time and that beginning the y-axis
scale with zero is a waste of space.
It is not uncommon to find line graphs with categorical data types. However,
this practice should be employed cautiously because the line between categories
can imply a continuum that is not warranted. The decision is not always clear,
however. Figure 5.11 presents scores for students across three levels on three sub-
tests. The graph has several problems. In some instances, it may be acceptable to
interpret the different levels as acting as proxies for ordered scale intervals. How-
ever, at other times, it may not be warranted. For example, the three different
20
15
Speaking Scores
10
0
1.00 2.00 3.00 4.00 5.00
Grouping by Proficiency Level
FIGURE 5.9 Box-and-whisker plots for the five proficiency levels across the speaking
test scores
80
Total test percentage scores with 95% CI
60
40
20
0
1 2 3 4 5
Test Times
FIGURE 5.10 Student scores (means and CIs) on five tests administered three weeks
apart over a semester (N = 45)
94 Thom Hudson
language levels might indicate intervals while designations such as freshman,

sophomore, and junior may not represent clear-cut levels. In that case, a bar chart
or box plot might serve better, but lines between would be misleading.
Another problem with Figure 5.11 rests with the lack of interpretability of
the CIs. The overlaps at each course level are not clear at all. So, the attempt to
provide at least some information about the score distribution is unsuccessful. In
this case, it may be preferable to present a graph with more graphic information
relating to score distributions. Lane and Sandor (2009) advocate the inclusion
of distributional information. Figure 5.12 presents a line graph with the data on
subtest scores across the different levels as well as box plot information about the
distribution and overlap of subtest score bands.
Scatter plots and dot plots. These types of graph are used to display rela-
tionships between two continuous quantitative variables, and as noted in Table 5.1
are relatively common in L2 research.They are among the most efficient forms of
graphic data display, with the potential for a high information-to-ink ratio. Fig-
ure 5.13 displays the relationship between reading scores and grammar scores for
the hypothetical student data. It shows a strong linear relationship, but also allows
for spotting potential outliers among some students with lower grammar scores.
Dot plots can also be used to indicate the values of categorical variables. Fig-
ure 5.14 shows the mean state score on the NAEP data from Table 5.4. For
many statistical computer programs, the default order is alphabetical. However,
100
Reading
Listening
Grammar
80
Mean Score
60
40
20
0
1 2 3
Proficiency Level
FIGURE 5.11 Mean scores and 95% CIs on reading, listening, and grammar for three
proficiency levels
100
90
Reading
Subtest scorces
80 Listening
Grammar
70
60
50
100
o Reading
Listening
Score distribution by subset
90 Grammar
o
80
70
60
o
50
1 2 3
Level
FIGURE 5.12 Graphic representation of score data across levels with box chart display
of distributions
as previously mentioned, alphabetical order is often not the most communicative

ordering. Figure 5.15 shows the same data ordered in terms of score value.
Figure 5.15 is much more informative than Figure 5.14. This follows the gen-
eral recommendation to make the data stand out. Given the lack of denseness in
the graph, it might benefit from having the data labels placed within the graph
along with each point.
Another type of scatter plot is a scatter plot matrix. Figure 5.16 presents a
visual representation of the correlation information for all subtests in the hypo-
thetical student score data. The matrix is useful with multivariate data sets and in
showing how, in this case, the subtests tend to have linear relationships with one
another except for with the speaking subtest. Thus, the reader can clearly see that
the speaking subtest may be measuring something quite different from the other
subtests.This visual evidence can lead to investigations into the nature of the tests.
100
90
Reading
80
70
60
50
50 60 70 80 90 100
Grammar
FIGURE 5.13 Scatter plot for the relationship between reading scores and grammar
scores (N = 45)
West Virginia
South Dakota
New Jersey
New Hampshire
National Mean
Jurisdiction
Massachusetts
Iowa
Illinois
Idaho
Florida
Connecticut
Arkansas
275 280 285 290 295

All Students
FIGURE 5.14 Mean state scores for NAEP data in Table 5.4
Massachusetts
New Hampshire
South Dakota
Illinois
Connecticut
Iowa
Idaho
New Jersey
National Mean
Florida
Arkansas
West Virginia
275 280 285 290 295
FIGURE 5.15 Mean state scores for NAEP data in Table 5.4 ordered by state score
Sparklines. This relatively recent type of graphic was developed by Tufte

(2006). He writes that they “are wordlike graphics, with an intensity of visual dis-
tinctions comparable to words and letters” ( p. 48). They present an overall shape
and aggregate pattern, though they lack an explicit scale of measurement. The
data in Figure 5.17 present the number of posts by students in one of my online
classes, which required three posts per week. Tufte says that sparklines can be read
the same way as words, but more slowly and carefully. So, for Tufte, if the student
with ID #6 is Dana, one could write, “Dana, your post activity is .”
Additionally, we can see that most students started out with three posts but gener-
ally posted more than that.
Sparklines have a very high information-to-ink ratio. They are also useful and
elegant. For further discussion on and examples of sparklines, see Larson-Hall &
Plonsky (2015).
Pie charts. By contrast, pie charts are neither useful nor elegant. They have
limited value in presenting data. One pernicious feature of some current com-
puter programs is that they permit, even encourage, the overuse of pie charts and
the use of three-dimensional charts. For example, the pie charts in Figure 5.18
provide the distribution of students by level.
The pie chart can be interpreted reasonably well. There are only three pieces
to the pie. However, more than this and the comparisons become difficult to
98 Thom Hudson
Reading
Listening
Grammar
Speaking
Reading Listening Grammar Speaking
FIGURE 5.16 Scatter plot matrix of correlations between four subtests
ID Number of Posts per week Mean SD Min Max Sparklines

1 3 3 6 3 6 4 4 4 7 6 3 4 5 4 5 4.47 1.30 3 7
2 3 3 2 4 4 2 6 3 2 3 2 5 3 3 3 3.20 1.15 2 6
3 3 3 1 2 4 4 4 0 3 1 3 3 1 3 2 2.47 1.25 0 4
4 3 3 2 2 3 3 3 2 1 3 1 1 3 1 1 2.13 0.92 1 3
5 3 3 5 6 5 4 5 5 3 3 3 3 4 3 3 3.87 1.06 3 6
6 3 3 4 3 1 5 4 4 0 0 2 5 4 0 3 2.73 1.75 0 5
7 3 3 6 2 4 3 5 3 3 2 3 5 4 3 3 3.47 1.13 2 6
8 3 3 5 5 5 4 7 5 7 8 3 4 3 2 4 4.53 1.73 2 8
9 3 3 3 4 1 4 4 6 4 4 3 3 5 5 8 4.00 1.60 1 8
19 3 3 8 7 6 4 5 5 3 4 6 3 5 4 5 4.73 1.53 3 8
FIGURE 5.17 Number of weekly online posts with sparklines showing the online post-
ing activity for each student
make. The same information would be easily envisioned from a simple table or
numbers in the text. However, the pie chart on the right is very difficult to inter-
pret. Its elongation makes it difficult to compare the darkest area with the lightest
area. Further, the depth dimension adds no information at all. It leads the eye away
from the data.
Level
1 1
2 2
3 3
15 15
15
FIGURE 5.18 Example pie charts for student distribution
Producing Tables and Graphs

There are a number of software choices for the production of tables and graphs.
Three that are somewhat representative of options in L2 research are Excel, SPSS,
and R. Each of these has advantages and disadvantages. Excel provides several
types of tables and graphs. It also can produce sparklines. However, Excel also
may be seen as allowing too many options. It allows (some would say encourages)
three-dimensional charts, and allows for many garish and distracting fill options.
The type of unbridled graphic enthusiasm it allows can be seen in Figure 5.2.
Nonetheless, it is a powerful tool with many graphic options. SPSS provides many
basic graph types. One advantage with this is that the graphic options are built
into the statistical program that many L2 researchers use. However, in using SPSS
graphics (and to some extent Excel as well) it is best not to simply accept the
default graphic that is produced. SPSS basic graphs can be edited fairly easily to
improve their appearance and facilitate interpretation. For example, Figure 5.19
is a default SPSS bar chart showing the mean speaking score for three levels of a
language course. It has a light gray background with gray fill for the bars.
In Figure 5.20 I have edited Figure 5.19. The way I have edited it represents
my own sense of how a graphic should be displayed. It is a minimalist approach
based on the question “What is the function of the graph?” The bar chart is
intended to quickly display the mean differences in speaking scores across the
three course levels.
The suggested changes here are not meant to imply that the representation
in Figure 5.20 is unquestioningly the best graph for this purpose. It is rather
presented to demonstrate that the default graphic can be edited. Unfortunately,
many L2 journals contain the bare bones default graphics from the computer
programs. For Figure 5.19, in my opinion, the gray background does not add any-
thing to the graph, and the graphic is better displayed with a white nondistracting
background. Second, it is relatively easy to request CIs to add information about
relative spread. The gray bar fill can be eliminated without losing information in
order to address Tufte’s ink/information ratio. Finally, the bars can be narrowed to
help reduce the implication that the bars represent volume of any sort. Whether
12.5
10.0
Mean Speak
7.5
5.0
2.5
0
1 2 3
Level
FIGURE 5.19 Initial SPSS bar chart for speaking mean scores by level
Mean Speaking Scores by Course Level

15
Mean Speaking
10
0
Beggining Intermediate Upper
Course Level
Error bars: 95% CI
FIGURE 5.20 Edited SPSS bar chart for speaking mean scores by level
you appreciate my particular graphic decisions or not, it is clear that it is not nec-
essary to accept the default graphics provided by SPSS and most other programs.
Finally, the R programming language has incredibly powerful graphing capa-
bilities. This is not to downplay the steep learning curve involved with R. How-
ever, two R statistical graphics packages in particular stand out: lattice (Sarkar,
2008) and ggplot2 (Wickham, 2009). An example ggplot2 line graph of a statisti-
cal interaction between gender and level is presented in Figure 5.21.
This line graph shows that there is a change in relative position of scores with
the hi level in contrast to the int and low levels. Additionally, the graph provides
the score points and CIs for each level with gender classification. The ggplot2 R
package allows for extensive customization of graphic components.
Closing Remarks
When presenting quantitative data visually, it is important for authors not to sim-
ply rely on the default graphics that are available through a computer program.
Additionally, just as it is not wise to uncritically adopt a research design from a
100
Gender
Male
Female
90
Score on Listening
80
70
60
50
Hi Int Low
Level
FIGURE 5.21 Listening score interaction of gender by proficiency level

102 Thom Hudson
published paper, it is not most productive to select a graphic format because it is

the most frequent type. Rather, it is important to reflect on the underlying form
of the data, the function of the data reporting, and the audience for which it is
intended.
STUDY BOX 5.1

Shah, P., & Hoeffner, J. (2002). Review of graph comprehension research: Implica-
tions for instruction. Educational Psychology Review, 14, 47–69.
Background
This study addresses the questions, “What makes a graph better or worse at
communicating relevant quantitative information?” and “How can students
learn to interpret graphs more effectively?” It reviews the cognitive literature
on how viewers comprehend graphs and the factors that influence viewers’
interpretations.
The Study
Shah and Hoeffner (2002) note that analyses of graph comprehension have
looked at three major component processes: Viewers must encode the visual
array and identify important features, they must relate the visual features to
the conceptual relations represented by the features, and they must deter-
mine the referent of the concepts being quantified. In processing graphics,
viewers are more likely to describe x-y trends and retrieve the information
accurately when viewing line graphs than when viewing bar graphs. The
literature tends to indicate that line graphs are good for depicting x-y trends,
bar graphs for discrete comparisons, and pie charts for relative proportions.
Three-dimensional displays proved better than two-dimensional displays
when integration of information across three dimensions was needed. How-
ever, despite the potential benefits of three-dimensional displays, the use of
three-dimensional linear perspective drawings can degrade information. In
addition to global decisions about the general format, a graph can involve
additional visual features: color, size, and aspect ratio.
Knowledge about graphs affects how viewers encode and remember the
graphics. Viewers expect dependent variables to be plotted as a function
of the y-axis and independent variables on the x-axis. Additionally, viewers
rely on prior knowledge in interpreting graphs. Graph viewers are better at
understanding some types of content compared with others such as those
representing change.
Implications
Shah and Hoeffner infer nine principles from the research review. Six of these
are relevant for the current discussion:
1. Choose the format depending upon the communication goal.
2. Use multiple formats to communicate the same data.
3. Use the “best” visual dimensions to convey metric information when
possible.
4. Reduce working memory demands.
5. Choose aspect ratio and data density carefully.
6. Make graphs and text consistent.
Tools and Resources
• The Work of Edward Tufte and Graphics Press provides excellent resources
about graphic display: http://www.edwardtufte.com/tufte/.
• The Gallery of Data Visualization: The Best and Worst of Statistical Graphics
presents examples with the view that the contrast may be useful, inform cur-
rent practice, and provide some pointers to both historical and current work:
http://www.datavis.ca/gallery/.
• Visual Statistics: Seeing Data with Dynamic Interactive Graphics: http://
www.uv.es/visualstats/Book/.
• Statistical Graphs, Charts and Plots: Statistical Consulting Program: http://
pages.csam.montclair.edu/~mcdougal/SCP/statistical_graphs1.htm.
• Hadley Wickam is interested in gaining better understanding of statistical
models through data visualization. His website (http://had.co.nz/) is an
excellent resource, particularly for the R programming language.
• The Top Ten Worst Graphs: http://www.biostat.wisc.edu/~kbroman/
topten_worstgraphs/.
Further Readings
Few, S. (2009). Now you see it: Simple visualization techniques for quantitative analysis. Oak-
land, CA: Analytics Press.
Few, S. (2012). Show me the numbers. Burlingame, CA: Analytics Press.
Kistler, S. J., Evergreen, S., & Azzam, T. (2013). Toolography. In T. Azzam & S. Evergreen
(Eds.), Data visualization, part 1. New Directions for Evaluation. #139: 73–84.
Tufte, E. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press.
Tufte, E. (1997). Visual explanations. Cheshire, CT: Graphics Press.
104 Thom Hudson
Wainer, H. (1997). Improving tabular displays, with NAEP tables as examples and inspira-
tions. Journal of Educational and Behavioral Statistics, 22(1), 1–30.
Wainer, H. (2005). Graphic Discovery. Princeton, NJ: Princeton University Press.
1. Of the suggestions on page 82, which do you identify as most important for
the graphic display of quantitative data? Which are the least important? Why?
2. When should a table be used instead of a graph and vice versa? When might
you want to include both a graph and a table for a given data set?
3. What problems can you identify in the following graph?
100
80
60
40
Reading
20
Listening
0
1 4
7 10
13 16
19 22
25
28 31
34
37
40
43
Reading
Note
1. The selected journals were Language Learning issue 62(4)–63(3), The Modern Language
Journal issues 96(4)–97(3), TESOL Quarterly issues 46(2)(4), 47(1–2), Studies in Second
Language Acquisition issues (34[3–4]), 35(1)(3), Applied Linguistics issues 34(1–4). Tables
and figures were counted from regular articles, excluding special issues of the journals,
and excluding graphics from article appendices or additional article information pro-
vided on internet sites.
References
Anscombe, F. J. (1981). Computing in statistical science through APL. New York: Springer.
Clark, N. (1987). Tables and graphs as a form of exposition. Scholarly Publishing, 19 (1),
24–42.
Cleveland, W. S. (1985). The elements of graphing data. Monterey, CA: Wadsworth Advanced
Books and Software.
Cleveland, W. S. (1993). Visualizing data. Murray Hill, NJ: AT&T Bell Laboratories.
Cleveland, W. S. (1994). The elements of graphing data (revised ed.), Murray Hill, NJ: AT&T
Bell Laboratories.
Daniel, C. (1976). Applications of statistics to industrial experimentation. New York: Wiley.
Few, S. (2004). Show me the numbers: Designing tables and graphs to enlighten. Oakland, CA:
Analytics Press.
Few, S. (2009). Now you see it: Simple visualization techniques for quantitative analysis. Oakland,
CA: Analytics Press.
Few, S. (2012). Show me the numbers. Burlingame, CA: Analytics Press.
Fisher, R. A. (1966). The design of experiments (8th ed.). Edinburgh: Oliver and Boyd, Ltd.
Immer, R. F., Hayes, H. K., & Powers, L. (1934). Statistical determination of barley varietal
adaptation. Journal of the American Society of Agronomy, 26, 403–419.
Klass, G. M. (2008). Just plain data analysis: Finding, presenting, and interpreting social science data.
Lanham, MD: Rowman & Littlefield Publishers, Inc.
Kosslyn, S. M. (2006). Graph design for the eye and mind. Oxford: Oxford University Press.
Lane, D.M. & Sandor, A. (2009). Designing better graphs by including distributional
information and integrating words, numbers and images. Psychological Methods, 14,
239–257.
Larson-Hall, J. (in preparation). Graphics and data accountability in L2 acquisition research.
31, 368–390.
findings: What gets reported and recommendations for the field. Language Learning, 65,
Supp. 1, 125–157.
NAEP State Comparisons Tool from the National Center for Educational Statistics
at: http://nces.ed.gov/nationsreportcard/statecomparisons/. Retrieved 26 Novem-
ber 2013.
Nicol, A.A.M., & Pexman, P. M. (2010). Displaying your findings: A practical guide for creating
figures, posters, and presentations. Washington, DC: American Psychological Association.
Robbins, N. B. (2013). Creating more effective graphs. Wayne, NJ: Chart House.
Sarkar, D. (2008). lattice: Multivariate data visualization with R. New York: Springer.
Tufte, E. (1983). The visual display of quantitative information. Cheshire, CT: Graphics Press.
Tufte, E. (1997). Visual explanations. Cheshire, CT: Graphics Press.
Tufte, E. (2006). Beautiful evidence. Cheshire, CT: Graphics Press.
Wainer, H. (1997). Improving tabular displays, with NAEP tables as examples and inspira-
tions. Journal of Educational and Behavioral Statistics, 22(1), 1–30.
Wikham, H. (2009). ggplot2: Elegant graphics for data analysis. New York: Springer.
Wilkinson, L. and the Task Force on Statistical Inference. APA Board of Scientific Affairs.
(1999). Statistical methods in psychology journals. American Psychologist, 54(8), 594–604.
6
META-ANALYZING SECOND
LANGUAGE RESEARCH*
Luke Plonsky and Frederick L. Oswald
Before we outline the major steps and key considerations when conducting a
meta-analysis, we will define the term meta-analysis in both a narrow and broad
sense. The narrower definition of meta-analysis refers to a statistical method for
calculating the mean and the variance of a collection of effect sizes across studies,
usually correlations (r) or standardized mean differences (d). The broader defini-
tion of meta-analysis includes not only these narrower statistical computations,
but also the conceptual integration of the literature and the findings that gives the
meta-analysis its substantive meaning. This integration involves the meta-analyst’s
expert understanding, translation, and communication of the research studies and
samples involved, along with the best theory that researchers offer across (and
beyond) the set of studies. The current chapter focuses primarily on the practi-
cal aspects of meta-analysis under this broad definition, where we describe how
meta-analysis addresses (if not solves) three major problems inherent to narrative
or qualitative reviews in second language (L2) research.
The first problem is that narrative reviews are qualitative in nature: They may
survey and describe study effects verbally, but in doing so, they do not consider
how sampling error variance confounds the interpretation of variation in research
findings. Specifically, small samples alone can contribute to variation (statistical
imprecision) in study effects, independent of the particular theories, samples, mea-
sures, or settings that also contribute to variation (substantive variance) in study
effects. Rather than treating effect sizes and the accompanying study narrative in
a qualitative manner, a meta-analysis is a more objective method in which study
effects with larger sample sizes are more statistically precise and therefore con-
tribute more heavily to meta-analytic results. The second problem with narrative
reviews is their general overreliance on the ritual of null hypothesis significance
testing (NHST; see, for example, Plonsky, Chapter 3 in this volume). If a narra-
tive review focuses narrowly on p values from NHST instead of effect sizes, two
Meta-analyzing L2 Research 107
dangers are likely to arise: Some statistically significant results will be given too
much attention (i.e., when the actual effect is negligible, but with a small p value
because it is based on a large sample) and nonsignificant results may be ignored,
yet many nonsignificant results across studies may be suggestive of a practically
and statistically significant effect if they are aptly combined in a meta-analysis.The
third problem with narrative reviews is that although experts in L2 research have
a vast storehouse of discipline-specific knowledge, as humans, they are fallible and
subject to the foibles of human memory and emotion, making imperfect or incon-
sistent decisions and interpretations regarding a body of research. To be sure, the
expertise and judgment of L2 researchers remain essential to any literature review
process, yet meta-analysis serves as one critical quantitative tool that supplements
expertise and judgment and is more objective and systematic in nature. Without
such tools, narrative reviews may pay greater attention to those empirical findings
that are accompanied with more compelling verbal rationale or are published in
prestigious journals, even when other empirical findings are equally legitimate.
The Prominence of Meta-analysis

As found in the work of Pearson and his eponymous correlation coefficient more
than a century ago (Pearson, 1904), scientific researchers have long since engaged
in the practice of averaging effects found across a set of studies or observations;
however, meta-analysis has developed over the past 35 years as a formalized sta-
tistical method for doing so. Meta-analysis has become essential to those dis-
ciplines that were first introduced to it: psychology, education, and medicine.
Literally thousands of meta-analyses have been published since the inception of
the method (Dalton & Dalton, 2008), and they often end up as canonical citations
in the literature.
In L2 research, meta-analysis was first used only about 15 years ago (Ross,
1998), yet its application in the field has expanded dramatically since then. The
growing body of methodological papers on meta-analysis in L2 research provides
further evidence for the interest in the topic (e.g., In’nami, & Koizumi, 2010;
Norris & Ortega, 2007, 2010; Oswald & Plonsky, 2010; Plonsky & Oswald, 2014)
as does a 2013 colloquium on meta-analysis at the American Association for
Applied Linguistics, special issues on meta-analysis in Applied Linguistics (Li, Shin-
tani, & Ellis, forthcoming) and English Teaching and Learning (In’nami & Koizumi,
2014), and doctoral seminars on meta-analysis taught in recent years (at George-
town University, University of Hawai’i, and Northern Arizona University).
How to Do a Meta-analysis
Meta-analysis has many parallels with the primary studies it attempts to sum-
marize. In both cases, the researcher must define the domain of interest, develop
measures, collect and analyze data, and interpret the theoretical and practical sig-
nificance of those findings.
108 Luke Plonsky and Frederick L. Oswald
Defining the Research Domain

Defining the research domain of a meta-analysis is the critical first step that
impacts all subsequent steps. This often involves bridging the territory that a par-
ticular theory (or theories) has staked out with what has been covered empirically,
is available, and has reported findings that can be converted to the same effect size
metric. Such a task can be deceptively challenging. To illustrate this point, con-
sider a few of the 18 studies that have meta-analyzed research on corrective feed-
back (CF). Li (2010) cast the net somewhat broadly by including primary studies
that measured the effects of oral or computer-mediated feedback on any type of
L2 feature. By contrast, Russell and Spada (2006) restricted their meta-analysis
to grammatical forms. Others pursued an even narrower focus by examining
feedback effects in classroom contexts (Lyster & Saito, 2010) or on L2 writing
(Poltavtchenko & Johnson, 2009; Truscott, 2007). Still others have meta-analyzed
the effects of different types of CF as subsets of larger syntheses on L2 instruction
(Norris & Ortega, 2000), L2 interaction (Mackey & Goo, 2007), and pronuncia-
tion instruction (Lee, Jang, & Plonsky, in press). Although the effects of CF were
meta-analyzed on several occasions, each study defined the domain uniquely and,
consequently, arrived at a unique result (see Plonsky & Brown, 2015, for a discus-
sion on the unique domains covered by each CF meta-analysis).
The scope of research covered is sometimes guided by the statistical question
regarding the minimum number of primary studies required for an appropri-
ate meta-analysis. As with primary research, a larger sample of studies, and their
corresponding effect sizes, provides the researcher with greater statistical power,
more general results, and more refined moderator analyses (Cohn & Becker, 2003;
Valentine, Pigott, & Rothstein, 2010). Nevertheless, there is also value in con-
ducting a local meta-analysis on a small set of narrow replications (Oswald &
McCloy, 2003), which can be easier to interpret when the goal is to under-
stand relationships dealing with the same measure, setting, and research question.
We recommend that meta-analyses find a balance between these two endpoints:
Meta-analyses can take a broader approach in the literature search and the analy-
sis, followed by narrower and more focused subgroup analyses. In this way, a
meta-analysis can have the conceptual and statistical breadth that reflects the lit-
erature, along with more refined hypotheses and analyses of importance, as the
study data allow.
Conducting the Literature Search

Specifying the breadth of the research domain to be meta-analyzed requires bal-
ancing the researcher’s a priori goals with whatever research literature is actually
available. Even a task as routine as searching an academic database requires the
meta-analyst to choose which databases and keywords to search, both of which
will influence the studies that ultimately are found. The most popular databases
among L2 meta-analysts have been the Education Resources Information Center

(ERIC; http://www.eric.ed.gov), Linguistics and Language Behavior Abstracts
(LLBA; http://www.proquest.com/products-services/llba-set-c.html), and Psy-
cINFO (http://www.apa.org/pubs/databases/psycinfo/index.aspx) (In’nami &
Koizumi, 2010; Oswald & Plonsky, 2010; Plonsky & Brown, 2015). In addition
to these and other databases (e.g., Academic Search Premier at http://www.
ebscohost.com/academic/academic-search-premier and ProQuest Dissertations
and Theses at http://www.proquest.com/products-services/pqdt.html), we also
recommend databases that show who has cited a particular article, such as Web
of Science (http://wokinfo.com/) and Google Scholar (http://scholar.google.
com/). For additional search strategies, see White (2009) and Harzing (2007).
Despite their convenience and accessibility, databases and other computer
resources can be incomplete and should always be used in concert with other lit-
erature searching strategies (see McManus et al., 1998). Eligible studies might also
be found by manually searching book chapters, journal archives, conference pro-
grams, technical reports, websites of government and nongovernment agencies
(e.g., Center for Applied Linguistics,Title VI Language Research Centers), as well
as more personal and/or interactive venues that Cooper (2010) refers to collec-
tively as the “invisible college” (p. 55), such as academic listservs, professional web-
sites of well-known scholars in a particular area, and individual researchers who
may be contacted for manuscripts that would otherwise be inaccessible. Plonsky
and Brown (2015) observed very few of these non-database search techniques
in the L2 meta-analyses included in their review. We would encourage future
meta-analyses to employ a wider range of strategies in locating candidate studies.
Generally, it is much better to over search the literature than to under search
it. A thorough search will likely result in a number of studies that appear relevant
initially but fail to meet one or more of the search criteria. These studies may be
useful because their reference sections may contain citations to useful literature.
Although it can be somewhat laborious, these and all other procedures involved
in the literature search should be documented as the search is conducted by main-
taining, at a minimum, a log of which databases, websites, conference programs,
and so forth have been searched and which search terms and keywords have been
used. This helps keep a record of the process and avoid repeat searches, document
the proportion of studies that were retained after inclusion criteria were applied,
and inform readers’ assessments of comprehensiveness. This is one area where
many meta-analyses tend to fall short (Aytug, Rothstein, Zhou, & Kern, 2012).
(Ross’s 1998 meta-analysis of the validity of L2 self-assessment, for instance, listed
only the inclusion criteria and provided no details of how or where the search for
primary studies was carried out. See In’nami & Koizumi, 2009, pp. 223–225, for
their detailed description of the process by which primary reports were searched
for and culled.) The methods and criteria that are developed and applied during
the literature search may seem somewhat mechanical in nature but can affect the
study outcome dramatically.
Designing a Coding Sheet

In most meta-analyses, a coding sheet serves as the principal data collection
instrument. As such, it requires a careful and thorough design with categories
broad enough to absorb data from a set of studies with potentially mixed concep-
tual and methodological approaches, but narrow enough to allow for subgroup
analyses wherever enough study data have accumulated. Perhaps more than any
other, this stage depends on the meta-analyst’s substantive expertise and research
creativity in determining study characteristics to be coded. As with previous steps,
we recommend tending toward an inclusive approach, coding for more variables
predicted by the theoretical and empirical literature rather than fewer.
Lipsey and Wilson (2001) categorize the items in a meta-analysis coding sheet
into two general categories: study descriptors and study outcomes. Four types of
study descriptors are usually coded: (a) study identifiers, (b) study sample and con-
text, (c) research design, and (d) measures. Study quality is a fifth category that can
be coded for. These data can then be used to describe and evaluate methodologi-
cal practices, to weight study effects so that those of higher quality contribute
more to the meta-analytic average, and to assess the relationship between mea-
sured methodological features and study outcomes (see Plonsky & Gass, 2011, and
Plonsky, 2013, for assessments of study quality in L2 research). The other general
category, study outcomes, consists of effect sizes (usually d values or correlations)
or the descriptive statistics that allow for their computation (e.g., group means,
standard deviations, regression weights).
TABLE 6.1 Suggested categories for coding within meta-analyses of L2 research
Coding Category Items
Identification Author, year, source/venue, journal, title

Study context Second vs. foreign language, classroom/laboratory, type
of institution (e.g., elementary, university), age, target
language(s), L1/L2 proficiency level
Design and treatment Observational vs. (quasi-)experimental, pretest (Y/N),
delayed posttest (Y/N), number of delayed posttests,
interval(s) between treatment and delayed posttest(s),
comparison group (Y/N), random assignment (Y/N), N
for comparison group(s), N for treatment group(s), length
of intervention in minutes/hours, length of intervention
in days/weeks, teacher-, researcher- or teacher/
researcher-led intervention, pretreatment equivalence of
groups
Measures Dependent variable(s), type(s) of outcomes measures (e.g.,
open ended, Likert scale, recall, grammaticality judgment
task), reliability (alpha, test-retest, interrater)
Outcomes Means and SDs for both control and experimental groups,
effect sizes (d-value, correlation), frequencies, percentages,
p values, statistical test results (e.g., t or F values)
Although no one coding sheet will work for everyone, certain information
will be common to almost all meta-analyses (see Table 6.1; see Lipsey & Wilson,
2001, for an example that is not domain specific). Other information particular
to the domain being meta-analyzed will also need to be coded. For example, a
meta-analysis of reading comprehension intervention studies might code for vari-
ables such as a study’s text length and genre, learners’ L2 vocabulary knowledge,
and first-language (L1) reading ability. A coding manual that defines each variable
and its associated values is also needed in order to train coders, resolve inter-
coder ambiguities, and generally ensure that the coding stage leads to a reliable
and justifiable data set. (See Wilson, 2009, for a thorough discussion of decision
points and procedures related to developing a valid and reliable coding scheme
for meta-analysis.)
Finally, “the first draft of a coding guide should never be the last” (Cooper,
2010, p. 86). The meta-analyst should be prepared to pilot, revise, and repilot the
coding sheet before and even during the coding process (e.g., Aytug et al., 2012;
Kepes, McDaniel, Brannick, & Banks, 2013).
The Coding Process

If meta-analytic results are the meal that readers will feast upon, coding is
what happens back in the kitchen to prepare the food for the meal. Coding
is the essential process of meta-analysis, whereby information from a variety of
formats—graphs, tables, text, and so forth—from each study is translated into a
standardized format on the coding sheet previously described. It is also the most
time-consuming process involved. Our experience suggests that the most effi-
cient approach to take at this stage is to code studies directly into a spreadsheet.
During the coding process, expert knowledge will prove especially useful as
the meta-analyst discovers important characteristics across studies that were not
anticipated. Sometimes this will mean going back and recoding an initial set of
studies to capture data related to such features. In this sense, the coding process
is often an iterative one. Furthermore, as primary studies are coded, it will be
apparent that some variables will be coded in a very straightforward manner
(e.g., target language) and others will require more judgment (e.g., task complex-
ity). Still other variables may appear straightforward with explicitly stated values
or terminology reported in primary studies, but the coding for these variables
may actually be much more complex. For example, consider how L2 proficiency
might be coded in a meta-analysis. Should coding be based on year or months
of exposure to the language, number of semesters of target language instruction,
scores on standardized proficiency tests, class grades, or some combination of
these? What if this information is reported unevenly across studies? This is one
example of why it is good practice to record all decisions made during the coding
process, and particularly whenever a value recorded in the coding sheet was not
stated explicitly in the study and had to be inferred (cf. Orwin & Cordray, 1985).
By keeping a log, the meta-analyst can then report the extent to which data for
certain variables were inferred, imputed, or left out.
At least one additional rater should be trained, and then raters are asked to
code as many of the studies being meta-analyzed as possible. Lipsey and Wilson
(2001) recommends double coding of at least 20 but ideally 50 or more studies.
However, with a median sample of only 17 studies in the 91 L2 meta-analyses
reviewed by Plonsky and Oswald (2014), it may often be possible to double
code all of the studies in the meta-analysis (see Lee et al., in press). It is then very
important to report some measure of interrater agreement to determine coding
accuracy (e.g., intraclass correlation, Cohen’s kappa, percent agreement), along
with some description of the number and nature of rating discrepancies and how
their resolution was achieved. Additionally, we also urge L2 meta-analysts to make
their coding procedure and all coding sheets directly accessible to their readership
available as supplementary material (e.g., Microsoft Excel sheets). These docu-
ments can be made available through journals’ or individual researchers’ websites
by providing a link in the written report or a footnote similar to that in Plonsky
(2011), which states “In order to facilitate replication and/or re-analysis, the data
set used in this study will be made available upon request.” Template versions of
coding schemes can and should also be made available in the aforementioned
venues and/or through the IRIS database for L2 instruments.
Analysis
As we stated at the outset of this chapter, meta-analysis essentially involves cal-
culating a mean effect size and its corresponding variance from a particular body
of research. Whereas the literature searching and coding stages help ensure the
body of research and corresponding effect sizes are appropriate, the analysis stage
is where the meta-analyst estimates this overall mean and variance. Despite the
seeming simplicity of calculating a mean and variance, there can be some impor-
tant challenges and decisions to make. A single study, for example, may report
multiple effect sizes on the same relationship, based on multiple settings, mul-
tiple groups, multiple measures, and/or multiple time points. It may be justifiable
merely to average them prior to the meta-analysis. But the multiple effects in stud-
ies like these are often complex, and the underlying heterogeneity is important to
understand. For instance, caution must be exercised when handling a set of studies
where some effects are pretest-posttest designs and others are between-groups
designs. Although most meta-analyses of L2 research have treated effects from
both types of studies as comparable, they should generally be treated sepa-
rately, because pretest–posttest designs tend to produce larger effects (see Mor-
ris, 2008). A related issue is how L2 meta-analyses have mistakenly applied the
between-groups formula for the d value to pretest–posttest designs. This is a mis-
take because in the latter case, calculation of an appropriate d value requires the
correlation between pre- and posttests. This correlation is almost never reported
in primary studies, but without its value (or some reasonable estimate), the effect
size will be biased (Cheung & Chan, 2004; Gleser & Olkin, 2009). In Plonsky
and Oswald’s (2014) synthesis of effects across 91 meta-analyses of L2 research, the
researchers provide empirical evidence for this bias. The median meta-analytic d
values resulting from between-groups (independent samples) and within-groups
(pretest-posttest) contrasts were .62 versus 1.06, respectively.
Another common issue in the analysis phase involves dealing with missing
data. Studies often lack information critical to meta-analysis. Sometimes the only
option is to exclude such studies, the choice made by most L2 meta-analyses to
date. However, if the number of available studies for meta-analysis itself is pre-
ciously small, then a second option might be to estimate unreported values (cf.
Higgins,White, & Wood, 2008). A meta-analyst must weigh the benefits of retain-
ing studies that at least provide partial information (e.g., means) by estimating the
data that they lack (e.g., standard deviations), with the potential drawbacks of esti-
mating or assuming too much out of the missing data. A third option is to request
missing data directly from the study’s researchers. Although this last decision may
be the ideal solution, it may be a challenge to contact researchers successfully and
have them comply with data requests (see Orwin, 1994; McManus et al., 1998).
A small number of L2 meta-analyses have reported using this strategy, leading
generally to a positive response of approximately 30% (e.g., Lee et al., in press;
Plonsky, 2011; but cf. Plonsky, Egbert, & LaFlair, in press).
Weighting Effect Sizes

Once all effect sizes have been compiled or calculated, it is time to compute the
meta-analytic mean and variance, both of which usually require weighting the
effect sizes. One could merely average the effect sizes, but this would generally
be inappropriate because some effect sizes are more accurate than others. At the
very least, a meta-analysis will usually weight effect sizes by their correspond-
ing sample size; a more technically correct procedure is to weight effects by
their inverse sampling error variance. (Excel templates for both are provided
on this book’s companion website: http://oak.ucc.nau.edu/ldp3/AQMSLR.
html.) Either approach operationalizes the assumption that larger samples pro-
duce statistics that are more accurate and should therefore contribute more to
the meta-analytic estimates of the mean and variance. Extended techniques for
weighting effect sizes are more complex, such as those that attempt to account
for the attenuating effects of measurement unreliability and range restriction:
Study effects based on measures that are more reliable or using samples with less
range restriction contribute more information to the meta-analysis than simi-
lar effects based on measures that are less reliable or with samples having more
range restriction (Hunter & Schmidt, 2014; for an example in L2 research, see
Jeon & Yamashita, 2014). A meta-analyst can also create a weight that multi-
plies the individual weights for sampling error variance and reliability, possibly
incorporating other factors such as rated study quality (see Hunter & Schmidt,
2014, and Schmidt, Le, & Oh, 2009, for detailed information on this approach).
This method may be worth pursuing once meta-analysis in L2 research has
matured and studies routinely report information on measurement reliability. In
general, it is the meta-analyst’s responsibility to strike a balance between choosing
a meta-analysis method that is too simple versus one that is too complex in order
to summarize the data in a reliable and maximally informative manner.
Choice of Meta-analysis Model

The choice of meta-analysis model determines the approach to estimating the
meta-analytic mean and variance. A fixed effects (FE) model assumes that there
is only one population effect size, and all effect sizes are sample realizations of
that population effect. Therefore, under the FE model, any observed variation in
effects across studies is assumed to be based on predictable effects such as sampling
error variance, differences in measurement reliability, or moderator variables. The
Q test for homogeneity of effect sizes is a post hoc test of this assumption. A sta-
tistically significant result provides evidence that the FE model is not strictly true
due to unmodeled variance. Sometimes the studies can be subgrouped such that
the Q test within each group supports the FE model. A random effects (RE)
meta-analysis model, however, directly estimates the meta-analytic variance in
effect sizes rather than assume it is 0. This variance estimate can be used to deter-
mine the practical significance heterogeneity: Take the square root to get the
standard deviation, then create a 90% or 95% credibility interval to estimate the
range of effects across study populations.
In our view and in the view of current meta-analysts, the RE model is pre-
ferred over the FE model because it is more flexible (Schmidt, Oh, & Hayes,
2009). Specifically, if the RE model arrives at a variance estimate that is 0, the RE
model is essentially the same as the FE model (except for the degrees of freedom
spent on estimating random effects). By contrast, a statistically significant Q test
in the FE model suggests an RE model but itself does not indicate practically
significant variance as the variance estimate and credibility interval in the RE
model does.
We felt the need to introduce FE and RE models briefly because these are
now entrenched in the meta-analysis literature. In terms of practical benefit, how-
ever, we believe that meta-analysis in the L2 domain is much more produc-
tive and useful for computing weighted averages within and across subgroups,
and the choice of meta-analysis model generally does not change this average
very much. Despite the conceptual appeal of RE models, the variance estimates
they produce are notoriously inaccurate—meaning that if we trust them, we may
often make the mistaken inference that homogeneity exists when it does not, or
that study effects are heterogeneous when they are not (Hedges & Pigott, 2001;
Oswald & Johnson, 1998; although see Sutton & Higgins, 2008). It is much better
to take an a priori approach to understanding variance in effect sizes by dividing
study effects into a priori subgroups determined by theory and/or coded vari-
ables, meta-analyzing the subgroups, and then comparing the meta-analytically
weighted average effects. This approach is far superior to the post hoc approaches
of estimating effect sizes in the RE model or testing for effect size heterogeneity
with the Q test.
Finally, in line with the saying that “a picture is worth a thousand words,”
graphs and plots of data serve as critical tools in any data analysis (Wilkinson &
Task Force on Statistical Inference, 1999), and the forest plot and funnel plot are
the primary visualization tools used in meta-analysis (Borenstein, Hedges, Hig-
gins, & Rothstein, 2009). A forest plot presents the size of the effect on the x-axis
with the names of the studies being ordered (alphabetically or by the magnitude
of the effect) on the y-axis (see Figure 6.1). The plotted points usually bisect a
symmetric horizontal bar that shows the 95% CI, and in the bottom row is the
meta-analytic mean and its 95% CI. A funnel plot provides similar information to
a forest plot: It is a scatter plot of the effect size on the x-axis, with some func-
tion of measurement precision associated with the effect on the y-axis (e.g., the
sample size, the inverse of the sampling error variance). If the level of imprecision
in some studies is much larger than the variance in the effects of the underlying
study populations (as is usually the case), then this plot will tend to show a funnel
shape, hence the name (see Figure 6.2). Asymmetries in the funnel plot can serve
as an indicator of publication bias, such as when authors, editors, and reviewers
suppress small or statistically nonsignificant effects (see Figure 6.3). Asymmetries
can also indicate the need to examine moderator effects (subgroup analyses) or
other anomalies, such as the question of whether effect sizes from one research
team tend to be much larger than the rest. In short, the forest plot and funnel plot
for publication bias are indispensable visualization tools that can indicate mean-
ingful patterns in the meta-analytic database.
In closing this section on the analysis stage in meta-analysis, we want to point
out our bottom-line intent. Our goal is for L2 researchers to understand how
studies are weighted in meta-analysis and how FE and RE meta-analytic models
assume or estimate the variance across study effect sizes. However, we ultimately
recommend that meta-analytic estimates be considered in combination with
graphs of the effects (forest or funnel plots), and of course, a solid knowledge of
the research associated with the effects under study. Only then can meta-analysts
attempt to give partial insight into three fundamental questions: (a) Are all stud-
ies similar enough to be considered replicates of one another? (b) Do subgroups
of effect sizes differ in meaningful ways (e.g., lab vs. classroom studies)? (c) Are
there effect sizes that are outliers or that otherwise show unique characteristics
(e.g., a single large-sample military language-learning study within a sample of
college-classroom studies)?
Study 1 (d = .2)
Study 2 (d = .4)
Study 3 (d = .2)
Study 4 (d = .8)
Study 5 (d = −1.3)
Study 6 (d = .06)
Study 7 (d = −.37)
Study 8 (d = −.2)
Study 9 (d = −1.5)
Study 10 (d = .25)
−2 −1 0 1 2 3
Standardized Mean Differences (d value) from

Primary Studies
FIGURE 6.1 Example of a forest plot
160
140
120
100
Sample Size
80
60
40
20
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
Effect Size (d)
FIGURE 6.2 Example of a funnel plot without the presence of publication bias
160
140
120
100
Sample Size
80
60
40
20
0
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3
Effect Size (d)
FIGURE 6.3 Example of a funnel plot with the presence of publication bias
Interpreting the Results

Once the analysis is complete, the numerical results must be presented in a mean-
ingful way. Meaningfulness is distinct from the size of the effect: Not all large
effects are meaningful, and not all small effects are meaningless. Gauging the
meaningfulness or importance of an effect depends on a number of factors, several
of which we discuss in this section (see also Kirk, 1996; Plonsky & Oswald, 2014;
Stukas & Cumming, in press).
The d value is the effect size metric used most often in meta-analyses of
L2 research (Oswald & Plonsky, 2010). It is most literally interpreted as the
meta-analytic average difference between mean scores in terms of standard devia-
tion units. Several meta-analyses of L2 research have taken this literal approach
to interpreting their results (e.g., Taylor, Stevens, & Asher, 2006). However, it has
been the convention—if not the default—across the social sciences to interpret
the magnitude of d values according to Cohen’s (1988) benchmarks for standard-
ized mean differences (i.e., .20 for small, .50 for medium, and .80 for large; Cohen
provides a similar set of benchmarks for other effect size indices such as correla-
tion coefficients and eta-squared). Although these benchmarks serve as a useful
starting point for discussing practical significance as opposed to focusing solely on
statistical significance, they were never intended to be applied universally. In fact,
Cohen (1988) and others (e.g., Hedges, 2008) have argued for years against such a
one-size-fits-all approach. Responding to the need for field-specific standards for
effect sizes, Plonsky and Oswald (2014) synthesized the d values (and correlation
coefficients) of 91 L2 meta-analyses and 346 primary studies. Based on observed
effects in this corpus covering a wide variety of L2 research domains, we suggest
a general approach to interpreting d values from between-groups contrasts with
d = .40 representing a generally small effect, .70 medium, and 1.00 large, which
correspond roughly to the 25th, 50th (median), and 75th percentiles of effects.
(For within-groups contrasts, we suggest the same three general descriptors for
d = .60, 1.00, and 1.40, respectively.) Observed correlation coefficients (r) were
.25 (25th percentile), .38 (50th), and .60 (75th). We are not suggesting that these
values should be applied universally to the breadth of L2 research but rather as a
single step away from Cohen and toward more field-specific interpretation of the
practical significance of effect sizes from L2 meta-analyses (and primary studies).
Plonsky and Oswald (2014) also discuss in depth additional factors worthy of
consideration when interpreting effect sizes. The findings of previous syntheses,
for example, can help researchers gauge the relative magnitude of observed effect
sizes. Plonsky (2011), for example, discussed the findings of his meta-analysis of
L2 strategy instruction in relation to meta-analyses of strategy instruction in L1
educational contexts (e.g., Hattie, Biggs, & Purdie, 1996). Another way to refine
these benchmarks is to examine the historical trajectory of effects in the area
being investigated. Effect sizes that decrease over time may indicate that the vari-
ables of interest are being examined at an increasingly more nuanced level (Kline,
2013), in light of theoretical refinements and as research shifts from the laboratory
setting (where independent variables are strongly manipulated) to the classroom
setting (where independent variables are more naturalistic). Plonsky and Gass
(2011), for example, reviewed 174 studies of L2 interaction, finding that average
d values tended to decrease steadily over time (1980–1989, d = 1.62; 1990–1999,
d = .82; 2000–2009, d = .52). This finding was attributed in part to increasingly
subtle models of interaction that have been introduced, developed, and tested over
the last 30 years. In a similar vein, Mackey and Goo (2007) and Plonsky (2011)
calculated effects for subgroups based on research context, and substantially larger
d values were found in both meta-analysis for lab over classroom studies (.96 vs.
.57 and .79 vs. .43, respectively; see Sample Study 2).
In an alternative scenario of how effect sizes may change over time, improve-
ments to design and measurement in a particular research area might overcome
the flaws of past research and lead to larger effect sizes (Fern & Monroe, 1996).
Meta-analyses by Spada and Tomita (2010) and Mackey and Goo (2007) found
that more recent studies have used open-ended test formats more often, which
were found to produce larger effect sizes than more constrained test formats in
Li (2010; see Sample Study 2) and Lyster and Saito (2010) (but larger effects
were not found for open-ended formats in Mackey & Goo, 2007, or Norris &
Ortega, 2000). It should be noted that the two trends described here may occur
simultaneously and cancel each other out or lead to increased variation in effects
over time.To be sure, locating and explaining patterns related to the maturity of a
domain is complex, and the data will not speak for themselves, necessitating once
again the substantive knowledge and perspective of the expert reviewer.
One final consideration with respect to interpreting meta-analytic effect sizes
is the degree to which independent variables in primary research are manipulated.
From a practical standpoint, a particular intervention may not be feasible (despite
producing a large effect) if it is excessively involved, financially prohibitive, or
laborious. Conversely, an instructional treatment leading to small effects may be

well worth integrating into L2 pedagogy when the effort in applying it is mini-
mal. Lee and Huang (2008), for instance, found a meta-analytic d value of .22 for
the effect of input enhancement on L2 grammar learning. Although this effect is
small by most standards, the benefits of enhanced input may justify the minimal
cost required to achieve them.
SAMPLE STUDY 1
Plonsky, L. (2011). The effectiveness of second language strategy instruction:
A meta-analysis. Language Learning, 61, 993–1038.
Background
Research on L2 strategy instruction has been extensive, but methods and
results in this area have been inconsistent. The goals of this study were to
summarize current findings and examine theoretical moderators of the
effects of strategy instruction.
Research questions
• How effective is L2 strategy instruction?
• How is strategy instruction affected by different learning contexts,
treatments, outcome variables, and research methods?
Method
Conventional database searches, Web of Science, and Google Scholar were
used to locate a total of 95 unique samples from 61 studies (N = 6,791) that
met all the inclusion criteria. Each study was then coded on 37 variables. Five
of 15 authors who were contacted provided missing data for studies report-
ing insufficient information to calculate an effect size.
Statistical tools
Effect sizes (Cohen’s d) were weighted by sample size and combined to cal-
culate the meta-analytic average, standard error, and CIs. Publication bias
was examined using a funnel plot. Summary effects were also calculated for
subgroups based on study characteristics (i.e., moderators).
Results
The (weighted) meta-analytic d value for the effects of L2 strategy instruc-
tion was .49, smaller than most effects in the L2 domain but comparable to
the results of similar syntheses in L1 educational contexts. Results indicated

clear relationships between the effects of strategy instruction and research
contexts, type and number of strategies taught, length of intervention, skill
areas, and several indicators of methodological quality.
SAMPLE STUDY 2
Li, S. (2010). The effectiveness of corrective feedback in SLA: A meta-analysis.
Background
The theoretical and practical centrality of corrective feedback has led to
extensive research testing its effects, yet disagreement remains over how
empirical findings can inform L2 theory and practice. It is also unclear how
different types of feedback, learning contexts, and targeted L2 features
might relate to its effectiveness.
Research questions
• What is the overall effect of corrective feedback on L2 learning?
• Do different feedback types impact L2 learning differently?
• Does the effectiveness of corrective feedback persist over time?
• What are the moderator variables for the effectiveness of corrective
feedback?
Method
Li searched two academic databases, manually searched the archives of over
a dozen journals of L2 research, and scanned the references of review arti-
cles. This study also included 11 dissertations for a total of 33 unique study
reports.
Statistical tools
The Comprehensive Meta-Analysis software program enabled a relatively
sophisticated meta-analysis, statistically speaking. All results were calcu-
lated and presented using both RE and FE models, and availability and
publication bias were addressed using a funnel plot and a trim-and-fill anal-
ysis. (Trim-and-fill is a nonparametric statistical technique that adjusts the
meta-analytic mean. It does so by estimating effects that appear to be miss-
ing if a FE model and no systematic bias are assumed.) Additionally, Li tested
for several subgroup differences between studies.
Results
The overall d value for CF according to the FE model was .61 (RE = .64).
Moderator results were also found for feedback types, delayed effects, differ-
ent contexts (e.g., classroom vs. lab). There was some evidence of publica-
tion bias, yet the effect sizes from the 11 nonpublished dissertations in this
study were larger on average than in published studies.
Conclusion
Meta-analysis has immense potential to summarize L2 research in a systematic
manner, adding clarity to the current status of theoretical claims while provid-
ing critical insights and directions for future research. Along with the benefits,
however, taking on a meta-analytic approach introduces a set of challenges that
include both those inherent to the method as well as particular to the field. In
light of these challenges, we close with suggestions that summarize the approach
and perspective that we have presented throughout this chapter.
First, despite the inherently greater objectivity embodied in the meta-analytic
approach, there is no single or best way to do a meta-analysis. Each step involves
multiple decisions that must be made in accordance with the researcher’s goals, the
substantive domain being synthesized, and the practical constraints of the available
data. As a principle, we believe that better decisions are usually the simpler ones,
such as analyses that are clear and understandable as opposed to more sophis-
ticated analyses that are technically correct but confusing and without practical
benefit. Second, as each of these important decisions is made, it is essential that the
meta-analyst maintain precise records so the results are understood appropriately in
the context of the entire process that led to them.Third and last, we have attempted
to identify and translate some of the general insights that other disciplines have
gained through decades of experience with meta-analysis, and we hope that other
L2 researchers will do the same in these critical formative years for meta-analysis in
the field.With some confidence, we can predict for L2 research what has happened
in all other major disciplines that have been exposed to meta-analysis:The coming
years will continue to show an exponential gain in the publication of meta-analytic
results. Meta-analysis will begin to be the microscope through which past L2
research is interpreted as well as the telescope through which theoretical develop-
ments and future L2 research efforts will be directed. Exciting times lie ahead as
meta-analysis becomes an essential tool in the L2 researcher’s toolbox.
Tools and Resources

• David B. Wilson’s “Meta-analysis stuff ”: http://mason.gmu.edu/~dwilsonb/
ma.html
• Comprehensive Meta-Analysis (Borenstein’s commercial software program):

http://www.meta-analysis.com
• Task View: Meta-Analysis (R code for meta-analysis and related procedures):
http://cran.r-project.org/web/views/MetaAnalysis.html
• Meta-analysis made easy (meta-analysis in Excel using MIX 2.0): http://
www.meta-analysis-made-easy.com
• Meta-analysis in Applied Linguistics (Plonsky’s online bibliography): http://
oak.ucc.nau.edu/ldp3/bib_metaanalysis.html
• Template for effect size weighting (see companion website): http://oak.ucc.
nau.edu/ldp3/AQMSLR.html
Further Reading
History
• Introduction of meta-analysis to the field of education: Glass (1976).

• Introduction of meta-analysis to industrial/organizational psychology:
Schmidt and Hunter (1977).
• Early work on combining quantitative results across studies: Rosenthal
(1978).
• Meta-analysis as an alternative to qualitative reviews: Cooper and Rosenthal
(1980).
• One of the first book-length treatments of meta-analysis: Hedges and Olkin
(1985).
Current Methods
• Accessible introduction to meta-analysis: Lipsey and Wilson (2001).

• Advanced meta-analytic techniques (e.g., correcting for measurement error
and range restriction): Hunter and Schmidt (2014).
• Publication bias in meta-analysis: Rothstein, Sutton, and Borenstein (2005).
• APA reporting standards for primary studies and meta-analyses: APA Pub-
lications and Communications Board Working Group on Journal Article
Reporting Standards (2008).
• Current, thorough coverage of meta-analytic techniques: Borenstein et al.
(2009).
• Specialized handbook on meta-analysis: Cooper, Hedges, and Valentine
(2009).
Meta-analytic Methods in L2 Research
• First edited volume of meta-analyses in L2 research: Norris and Ortega

(2006a).
• Introduction to meta-analysis and research synthesis in L2 acquisition: Norris
and Ortega (2006b, 2007).
• Database searches for meta-analysis: In’nami and Koizumi (2010), Plonsky &
Brown (2015).
• Introduction to research synthesis: Ortega (in press).
• Timeline of research synthesis and meta-analysis: Norris and Ortega (2010).
• Review of meta-analysis in L2 research: Oswald and Plonsky (2010).
• Meta-analysis and replication: Plonsky (2012).
• Guide to interpreting effect sizes in meta-analysis: Plonsky & Oswald (2014)
Specific
1. Defining the domain: Consider two or three meta-analyses of CF. How

do the studies differ in terms of their domain of interest and inclusion/
exclusion criteria? How did their different operationalizations of CF relate to
their research questions and results?
2. Locating primary studies: Examine the search strategies used in any three
meta-analyses of L2 research. Are they the same? Are they different? Does
one search appear more exhaustive than the others?
3. Designing a coding sheet: Choose an area of L2 research that you are very famil-
iar with. Make a list of the substantive and methodological variables you would
code for if you were going to meta-analyze the body of research in that area.
4. Coding: Using your response to Question 3, which items would you con-
sider to be high inference and low inference? Try to write a definition for
each of the high-inference items, then try coding one or more studies in that
area according to your coding sheet and reflect on the precision and useful-
ness of your definitions.
5. Analyses-1: Part of the data from Lee et al.’s (in press) meta-analysis of pro-
nunciation instruction is available on this book’s companion website: http://
oak.ucc.nau.edu/ldp3/AQMSLR.html. Use this data to compare the effects
for the different categorical moderator variables provided (e.g., classroom- vs.
lab-based; pronunciation instruction with vs. without feedback). Using SPSS,
simply select Analyze > Descriptives > Explore. Move the dependent
variable (d_bt_ave, which stands for d value_between groups_average [the
“average” part refers to the fact that this is the average of multiple dependent
measures, when used]) into the Dependent List dialogue box.Then move any
one or more of the moderator/grouping variables into the Factor List box
and click OK. (Note: These results won’t be exactly the same as the ones in
Lee et al., in press, because these effects won’t be automatically weighted as
in the study.) Examine the means, standard deviations, and CIs for the differ-
ent subgroups. Do the moderator results align with what you would expect
to find?
6. Analyses-2: Using the same data set in the previous question, examine the
change in effects over time by creating a scatter plot with the “Year” variable
on the x-axis and the d values on the y-axis. In SPSS, select Graphs >
Legacy Dialogs > Scatter/Dot > Simple Scatter, then move the vari-
ables into their respective boxes. How would you describe the pattern of
change, if any, in relation to the scenarios described in the earlier section on
Interpreting the Results?
7. Analyses-3: Examine and compare the funnel plots in Norris and Ortega
(2000, p. 452), Li (2010, p. 331), and Plonsky (2011, p. 1007). Do you see any
evidence for publication bias in those plots? If so, which one(s)? Do you see
any other irregularities? How might they be explained?
8. Interpreting the results:The overall findings in Plonsky’s (2011) meta-analysis
of strategy instruction are interpreted in a variety of ways (e.g., compared to
a meta-analysis of LI strategy instruction, Cohen’s benchmarks, the bench-
marks described in Oswald & Plonsky, 2010, standard deviation units).Which
one(s) do you find most informative or relevant to the discussion? Why?
General
9. Which steps in carrying out a meta-analysis are the most/least objective and
subjective? How might each step of a meta-analysis affect the results that are
obtained?
10. Which areas of L2 research do you think might be good candidates currently
for meta-analysis? Why?
11. Describe the most important similarities between primary research and
meta-analysis.
12. Meta-analyses depend entirely on past research, but they can also be used to
direct future research. Select an L2 meta-analysis and consider its implica-
tions for future empirical efforts.
13. Imagine that you were carrying out a meta-analysis in a particular area of L2
research and wanted to investigate the quality of studies in your sample. How
would you operationalize and measure study quality?
14. What are some of the benefits and drawbacks of using benchmarks such as
Plonsky and Oswald’s (2014) to explain the magnitude of effects found in a
meta-analysis?
Note
∗ This chapter is an updated and adapted version of Plonsky, L., & Oswald, F. L. (2012).
How to do a meta-analysis. In A. Mackey & S. M. Gass (Eds.), Research methods in second
language acquisition: A practical guide (pp. 275–295). London: Basil Blackwell.
References
APA Publications and Communications Board Working Group on Journal Article Report-
ing Standards. (2008). Reporting standards for research in psychology:Why do we need
them? What might they be? American Psychologist, 63, 839–851.
Aytug, Z. G., Rothstein, H. R., Zhou, W., & Kern, M. C. (2012). Revealed or concealed?
Transparency of procedures, decisions, and judgment calls in meta-analyses. Organiza-
tional Research Methods, 15, 103–133.
Borenstein, M., Hedges, L. V., Higgins, J.P.T., & Rothstein, H. R. (2009). Introduction to
meta-analysis. Chichester, UK: Wiley.
Cheung, S. F., & Chan, D. K-S. (2004). Dependent effect sizes in meta-analysis: Incorporat-
ing the degree of interdependence. Journal of Applied Psychology, 89(5), 780–791.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum.
Cohn, L. D., & Becker, B. J. (2003). How meta-analysis increases statistical power. Psychologi-
cal Methods, 8(3), 243–253.
Cooper, H. (2010). Research synthesis and meta-analysis: A step-by-step approach (4th ed).
Thousand Oaks, CA: Sage.
Cooper, H., Hedges, L. V., & Valentine, J. C. (Eds.). (2009). The handbook of research synthesis
and meta-analysis (2nd ed.). New York: Russell Sage Foundation.
Cooper, H. M., & Rosenthal, R. (1980). Statistical versus traditional procedures for sum-
marizing research findings. Psychological Bulletin, 87(3), 442–449.
Dalton, D. R., & Dalton, C. M. (2008). Meta-analyses: Some very good steps toward a bit
longer journey. Organizational Research Methods, 11(1), 127–147.
Fern, E. F., & Monroe, K. B. (1996). Effect-size estimates: Issues and problems in interpreta-
tion. Journal of Consumer Research, 23(2), 89–105.
Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher,
5, 3–8.
Gleser, L. J., & Olkin, I. (2009). Stochastically dependent effect sizes. In H. Cooper,
L. V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis and meta-analysis
(second ed., pp. 357–376). New York: Sage.
Harzing, A. W. (2007). Publish or perish, available from http://www.harzing.com/pop.htm.
Hattie, J. A., Biggs, J., & Purdie, N. (1996). Effects of learning skills interventions on student
learning: A meta-analysis. Review of Educational Research, 66(2), 99–136.
Hedges, L. V. (2008). What are effect sizes and why do we need them? Child Development
Perspectives, 2(3), 167–171.
Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. Orlando, FL: Academic
Press.
Hedges, L. V., & Pigott,T. D. (2001).The power of statistical tests in meta-analysis. Psychologi-
cal Methods, 6(3), 203–217.
Higgins, J.P.T., White, I. R., & Wood, A. M. (2008). Imputation methods for missing out-
come data in meta-analysis of clinical trials. Clinical Trials, 5(3), 225–239.
Hunter, F. L., & Schmidt, J. E. (2014). Methods of meta-analysis: Correcting error and bias in
research findings (third ed.). Thousand Oaks, CA: Sage.
In’nami,Y., & Koizumi, R. (2009). A meta-analysis of test format effects on reading and lis-
tening test performance: Focus on multiple-choice and open-ended formats. Language
Testing, 26(2), 219–244.
In’nami, Y., & Koizumi, R. (2010). Database selection guidelines for meta-analysis in
applied linguistics. TESOL Quarterly, 44(1), 169–184.
In’nami, Y., & Koizumi, R. (Eds.) (2014). Research synthesis and meta-analysis in second
language learning and testing. Special issue of English Teaching and Learning.
Jeon, E. H., & Yamashita, J. (2014). L2 reading comprehension and its correlates:
A meta-analysis. Language Learning, 64, 160–212.
Kepes, S., McDaniel, M. S., Brannick, M. T., & Banks, G. C. (2013). Meta-analytic reviews
in the organizational sciences: Two meta-analytic schools on the way to MARS (the
meta-analytic reporting standards). Journal of Business Psychology, 28, 123–143.
Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and
Psychological Measurement, 56(5), 746–759.
Lee, S-K., & Huang, H-T. (2008). Visual input enhancement and grammar learning:
A meta-analytic review. Studies in Second Language Acquisition, 30(3), 307–331.
Lee, J., Jang, J., & Plonsky, L. (in press). The effectiveness of second language pronunciation
instruction: A meta-analysis. Applied Linguistics.
Li, S. (2010). The effectiveness of corrective feedback in SLA: A meta-analysis. Language
Learning, 60(2), 309–365.
Li, S., Shintani, N., & Ellis, R. (Eds.) (forthcoming). The complementary contribution of
meta-analysis and narrative review in second language acquisition research. Applied
Linguistics, special issue.
Lipsey, M. W., & Wilson, D. B. (2001). Practical meta-analysis. Thousand Oaks, CA: Sage.
Lyster, R., & Saito, K. (2010). Oral feedback in classroom SLA: A meta-analysis. Studies in
Second Language Acquisition, 32(2), 265–302.
Mackey, A., & Goo, J. (2007). Interaction research in SLA: A meta-analysis and research
synthesis. In A. Mackey (Ed.), Conversational interaction in second language acquisition: A col-
lection of empirical studies (pp. 407–451). New York: Oxford University Press.
McManus, R. J., Wilson, S., Delaney, B. C., Fitzmaurice, D. A., Hyde, C. J., Tobias, R. S.,
Jowett, S., & Hobbs, F.D.R. (1998). Review of the usefulness of contacting other experts
when conducting a literature search for systematic reviews. British Medical Journal, 317,
1562–1563.
Morris, S. B. (2008). Estimating effect sizes from pretest–posttest-control group designs.
Organizational Research Methods, 11(2), 364–386.
quantitative meta-analysis. Language Learning, 50(3), 417–528.
Norris, J. M., & Ortega, L. (2006a). Synthesizing research on language learning and teaching.
Philadelphia, PA: John Benjamins.
Norris, J. M., & Ortega, L. (2006b). The value and practice of research synthesis for lan-
guage learning and teaching. In J. M. Norris & L. Ortega (Eds.), Synthesizing research on
language learning and teaching (pp. 3–50). Philadelphia, PA: John Benjamins.
Norris, J. M., & Ortega, L. (2007). The future of research synthesis in applied linguistics:
Beyond art or science. TESOL Quarterly, 41(4), 805–815.
Norris, J. M., & Ortega, L. (2010). Research Timeline: Research synthesis. Language Teach-
ing, 43, 61–79.
Ortega, L. (in press). Research synthesis. In B. Paltridge & A. Phakiti (Eds.), Companion to
research methods in applied linguistics. London: Continuum.
Orwin, R. G. (1994). Evaluating coding decisions. In H. Cooper & L. V. Hedges (Eds.),
Handbook of research synthesis (pp. 139–162). New York: Russell Sage Foundation.
Orwin, R. G., & Cordray, D. S. (1985). Effects of deficient reporting on meta-analysis:
A conceptual framework and reanalysis. Psychological Bulletin, 97(1), 134–147.
Oswald, F. L., & Johnson, J. W. (1998). On the robustness, bias, and stability of statistics from
meta-analysis of correlation coefficients: Some initial Monte Carlo findings. Journal of
Applied Psychology, 83(2), 164–178.
Oswald, F. L., & McCloy, R. A. (2003). Meta-analysis and the art of the average. In
K. R. Murphy (Ed.), Validity generalization: A critical review (pp. 311–338). Mahwah, NJ:
Lawrence Erlbaum.
Oswald, F. L., & Plonsky, L. (2010). Meta-analysis in second language research: Choices and
challenges. Annual Review of Applied Linguistics, 30, 85–110.
Pearson, K. (1904). Report on certain enteric fever inoculation statistics. British Medical
Journal, 3, 1243–1246.
Plonsky, L. (2011). The effectiveness of second language strategy instruction: A meta-
analysis. Language Learning, 61, 993–1038.
Plonsky, L. (2012). Replication, meta-analysis, and generalizability. In G. Porte (Ed.), Repli-
cation research in applied linguistics (pp. 116–132). New York: Cambridge University Press.
Plonsky, L., & Brown, D. (2015). Domain definition and search techniques in meta-analyses
of L2 research (or why 18 meta-analyses of feedback have different results). Second Lan-
guage Research, 31, 267–276.
Plonsky, L., Egbert, J., & LaFlair, G. T. (in press). Bootstrapping in applied linguistics:
Assessing its potential using shared data. Applied Linguistics.
Plonsky, L., & Gass, S. M. (2011). Quantitative research methods, study quality, and out-
comes: The case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Oswald, F. L. (2014). How big is ‘big’? Interpreting effect sizes in L2 research.
Poltavtchenko, E., & Johnson, M. D. (2009, March). Feedback and second language writ-
ing: A meta-analysis. Poster session presented at the annual meeting of TESOL,
Denver, CO.
Rosenthal, R. (1978). Combining results of independent studies. Psychological Bulletin,
85(1), 185–193.
Ross, S. (1998). Self-assessment in second language testing: A meta-analysis and analysis of
experiential factors. Language Testing, 15(1), 1–20.
Rothstein, H. R., Sutton,A. J.,& Borenstein, M. (Eds.). (2005). Publication bias in meta-analysis:
Prevention, assessment and adjustments. Chichester, England: Wiley.
Russell, J., & Spada, N. (2006). The effectiveness of corrective feedback for the acquisition
of L2 grammar: A meta-analysis of the research. In J. M. Norris & L. Ortega (Eds.),
Synthesizing research on language learning and teaching (pp. 133–164). Philadelphia: John
Benjamins.
Schmidt, F. L., & Hunter, J. E. (1977). Development of a general solution to the problem of
validity generalization. Journal of Applied Psychology, 62(5), 529–540.
Schmidt, F. L., Le, H., & Oh, I-S. (2009). Correcting for the distorting effects of study arti-
facts in meta-analysis. In H. Cooper, L. V. Hedges, & J. C.Valentine (Eds.), The handbook
of research synthesis and meta-analysis (second ed., pp. 317–333). New York: Russell Sage
Foundation.
Schmidt, F. L., Oh, I-S., & Hayes, T. (2009). Fixed versus random effects models in
meta-analysis: Model properties and an empirical comparison of differences in results.
British Journal of Mathematical and Statistical Psychology, 62(1), 97–128.
Spada, N., & Tomita, Y. (2010). Interactions between type of instruction and type of lan-
guage feature: A meta-analysis. Language Learning, 60(2), 263–308.
Stukas, A. A., & Cumming, G. (in press). Interpreting effect sizes: Towards a quantitative
cumulative social psychology. European Journal of Social Psychology.
Sutton, A. J., & Higgins, J. P. T. (2008). Recent development in meta-analysis. Statistics in
Medicine, 27(5), 625–650.
Taylor, A., Stevens, J. R., & Asher, J. W. (2006). The effects of explicit reading strategy train-
ing on L2 reading comprehension: A meta-analysis. In J. M. Norris & L. Ortega (Eds.),
Synthesizing research on language learning and teaching (pp. 213–244). Philadelphia, PA:
John Benjamins.
Truscott, J. (2007). The effect of error correction on learners’ ability to write accurately.
Journal of Second Language Writing, 16(4), 255–272.
Valentine, J. C., Pigott, T. D., & Rothstein, H. R. (2010). How many studies do you need?:
A primer on statistical power for meta-analysis. Journal of Educational and Behavioral
Statistics, 35(2), 215–247.
White, H. D. (2009). Scientific communication and literature retrieval. In H. Cooper,
V. Hedges, & J. C. Valentine (Eds.), The handbook of research synthesis (second ed.,
L.
pp. 51–71). New York: Russell Sage Foundation.
Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychol-
ogy journals: Guidelines and explanations. American Psychologist, 54(8), 594–604.
Wilson, D. B. (2009). Systematic coding. In H. Cooper, L. V. Hedges, & J. C.Valentine (Eds.),
The handbook of research synthesis (second ed., pp. 159–176). New York: Russell Sage
Foundation.
PART III
Advanced and
Multivariate Methods
7
MULTIPLE REGRESSION
Eun Hee Jeon
What Is Multiple Regression and Why Do We Use It?

Multiple regression analysis (MRA) refers to a group of correlation-based statisti-
cal techniques that examine the relationship between a criterion variable (CV,
also called a dependent variable, or DV, in some textbooks and in SPSS) and multiple
predictor variables (PVs, also called independent variables, or IVs, in some textbooks
and in SPSS). Generally speaking, the primary purpose of MRA is to generate
a linear or a nonlinear (e.g., curvilinear) equation that creates a line that fits the
data well while ideally achieving parsimony in the model. Using mathematical
notations, a linear regression equation that includes k independent variables is
expressed as the following:
Y′ = A + B1X1 + B2X2 + . . . . + BkXk,
when Y′ is the predicted value of the CV, A is the intercept, Bs indicate the
parameters (regression coefficients) being estimated, Xs indicate the PVs, and k
represents the number of the PVs. This equation is also thought of as a “predic-
tion equation” (Tabachnick & Fidell, 2012, p. 123) as it yields the predicted (not
observed) value, Y′. Predicted Y′s are then correlated with observed values, Ys,
to obtain the multivariate correlation, R. As its name indicates, this is a multi-
variate equivalent of the bivariate correlation, r. The squared value of the mul-
tivariate correlation R, namely R2, denotes the amount of variance in the CV
accounted for by the set of PVs in the equation. Less technically speaking, MRA
is a means to explain variance in the CV as a function of one or more PVs. Once
a well-fitting regression model is generated, it enables the researcher to closely
predict the value of the CV from the values of the PVs.
132 Eun Hee Jeon
To use an example from second language (L2) research, MRA can be used
to examine how different components of reading comprehension such as L2
decoding, vocabulary knowledge, and grammar knowledge individually and col-
lectively predict the reading comprehension of an L2 reader. MRA can also be
compared to analysis of covariance (ANCOVA) in that it can be used to exam-
ine the predictive power of an individual PV after the variance in the CV due
to a certain PV or PVs has been partialled out. Going back to the example of
L2 reading research, if the researcher is interested in finding out whether L2
vocabulary knowledge still stands as an important predictor of L2 reading ability
after the variance due to L2 grammar knowledge is partialled out, he or she can
simply enter L2 grammar knowledge first and L2 vocabulary knowledge second
into the equation, then check whether L2 vocabulary still manages to explain a
statistically significant amount of reading variance, which is indicated by the R2
change between the first model (with grammar only) and the second model (with
grammar and vocabulary). As can be seen in the example of L2 reading research,
this feature of a (hierarchical) regression analysis makes it possible to compare CV
variances accounted for by different models comprised of different sets of PVs,
thereby determining the best-fitting model.
MRA is not one analysis, but a family of analyses. This means that depending
on the purpose of the study and the type of research questions posed, there are
different analyses the researcher can choose. One of the primary factors determin-
ing the type of MRA to be used is the nature of the variables under investigation:
Is your CV categorical (e.g., admitted to a degree program vs. not admitted to a
degree program) or continuous (e.g., TOEFL score)? If the former, the appropri-
ate analysis would be logistic regression. In a similar vein, if your PVs include cat-
egorical variables with more than two levels, you would still use MRA. However,
because MRA can only handle categorical variables that are dichotomous, an
additional intermediate step, namely dummy variable coding, would be needed.
Let’s say, for example, that you hypothesize that first language (L1) background
(e.g., Spanish vs. Chinese vs. Russian) will affect reading comprehension of L2
English. In this case, rather than including a single three-level PV, you would cre-
ate two new dichotomous variables such as L1_Spanish and L1_Chinese, each
with possible values of 0 (not L1 Spanish/Chinese) or 1 (Spanish/Chinese).There
is no need to create a third variable for Russian because those participants would
be represented in the model by 0 in both of the other two newly created variables.
(In other words, when dummy coding, the number of new variables that need
to be created is equal to one fewer than the number of levels of the categorical
variable.)
Although few L2 researchers may be aware of this, if all the PVs are categori-
cal, the mathematical equation yielded by MRA equals ANOVA or ANCOVA
(Tabachnick & Fidell, 2012) and their effect sizes, despite the difference in the
labels (R2 in MRA and eta-squared, or η2, in ANOVA or ANCOVA), represent
Multiple Regression 133
the same concept (see Cohen, 1968, and Chapter 11 in Howell, 2012, for more
detailed explanation on the link between R2 in MRA and eta-squared, or η2, in
ANOVA a part of the general linear model, or GLM.) Given the availability of
different types of analyses developed to address variables of a different nature, it is
crucial that the researcher not compromise the true nature of each variable and
select the most suitable analysis for the variables under investigation; for example,
converting a continuous variable (e.g., L2 proficiency level) to a categorical vari-
able (e.g., low, intermediate, high) and adopting an ANOVA instead of adopting
an MRA should be avoided whenever possible to preserve variance in continuous
PVs (for a relevant discussion, see Plonsky, 2013).
Because of the range of analyses that fall under the umbrella of MRA, it is not
uncommon for some graduate programs in quantitatively oriented disciplines (e.g.,
educational psychology, sociology) to offer semester-long seminars on this topic.
Therefore, before I progress further, I would like to note that this chapter should
be considered only as a guide to the most frequently used types of MRA in L2
research. Specifically, I will focus on MRAs that involve continuous PVs and a con-
tinuous CV. The additional steps necessitated by different members of the MRA
family will also be integrated into the discussion when appropriate. Last, I would
like to note that many of my explanations in this chapter are based on Cohen,
Cohen, West, and Aiken (2003) which, while being arguably the most definitive
volume on MRA in the market, may come across as too technical for novice to
intermediate users of MRA. In addition, Cohen et al. (2003) does not include a
section on how to use statistical packages to run relevant analyses.This chapter aims
to render the information in Cohen et al. (2003) more accessible to novice to inter-
mediate users of MRA and provide directions on how to run MRA using SPSS.
How to Conduct MRA

Considerations in the Pre–Data Collection
Stage: Precision and Power
As noted earlier elsewhere (e.g., Plonsky, Chapter 3 in this volume), as with most
statistical analyses, it is highly recommended that a researcher planning to use
MRA first ensure that the observation to be made through MRA is (a) as precise
and (b) as close to the truth as possible (i.e., sufficiently powered) (Cohen et al.,
2003). One great advantage of computing precision and power before data col-
lection is that if they are not at a satisfactory level, the researcher can make nec-
essary changes such as decreasing the number of predictor variables (in the case
of precision) or increasing the sample size (in the case of power) to optimize the
observational conditions. In this section I describe how to compute precision and
power using a hypothetical example related to L2 research.
Precision. The precision of a statistic such as R2 in the case of MRA is deter-
mined by the standard error (standard deviation of the CV after the effects of
134 Eun Hee Jeon
PVs have been removed) and the confidence interval (CI) set by the researcher
(e.g., 80%, 95%, 99%). Let’s imagine a situation where the researcher is inter-
ested in the amount of reading variance accounted for by the reader’s L1 lit-
eracy and L2 language knowledge (e.g., vocabulary, grammar). Based on previous
research or theory (e.g., Bernhardt & Kamil, 1995), the researcher knows that
three variables—namely, L1 literacy, L2 vocabulary knowledge, and L2 grammar
knowledge—explain about 50% of individual variance (R2) in L2 reading com-
prehension. Let’s suppose that the researcher currently has access to at least 120
participants available for data collection but wonders if this is a big enough sam-
ple. Based on these two conditions, the researcher can compute a standard error
of R2 (i.e.,SE R2 2 ). The formula for SE R2 2 is as follows, where n and k respectively
denote the number of currently available study participants (i.e., sample size) and
the number of PVs (Cohen et al., 2003, p. 88).
4R 2 (1 − R 2 ) ( n − k − 1)
2 2
SE 2
=
R2
(n 2
− 1) ( n + 3 )
Now, substituting R2 with .50, n with 120, and k with 3 (L1 literacy, L2 vocabu-
lary, L2 grammar), we get the following:
4 (.50 ) (1 − .50 ) (120 − 3 − 1)

2 2
SE 2
=
R2
(120 2
− 1) (120 + 3 )
= .0038
SE R2 2 is, therefore, .0038.

Using this value, the researcher can now compute the CI that she or he con-
siders to be appropriate for the purpose of the study. Although 95% is perhaps
the most frequently used bound of probability, depending on the purpose of the
study, the researcher may choose to use different bounds of probability (Cohen
et al., 2003) such as 99% or 80% probability. Should the researcher decide to use
such an alternative bound of probability, it can be computed by multiplying the
SE R2 2 by one of the approximate constant multipliers listed below:
a. For 99% CI, multiply the SE R2 2 , by 2.6.

b. For 95% CI, multiply the SE R2 2 by 2.
c. For 80% CI, multiply the SE R2 2 by 1.3.
d. For 66.6% CI, multiply the SE R2 2 by 1.
Given that the R2 is .50, if we decided to choose the conventional 95% prob-
ability, the CI would be .492–.508. The interpretation of this is that assuming our
R2 (i.e., .50) was drawn from the distribution composed of population parameters
of R2, 95% of the time, the R2 would be a value within the range of .492–.508.
If we chose the less conventional but more stringent level of 99% probability, the
CI would be .490–.510, indicating that under the same assumption as the one
earlier, 99% of the time (therefore, with more confidence), R2 would be a value
within the range of .490–.510. Since neither of the 95% or the 99% CIs includes
0, which would discount the reliability of the observed value of R2, we can con-
clude that the current sample size and the number of PVs are appropriate to yield
a reliable value.
Another point to consider when examining the CI is the range. If the CI is
too wide, for example, it fails to provide precision of observation. To illustrate,
let’s suppose a situation where the 95% CI of R2 was .10–.90. Such a large CI
fails to offer useful information. In such a case, the researcher can adjust the CI
by increasing the sample size or by decreasing the number of predictor variables.
Once the data are collected and entered into a statistical software package such as
SPSS, the researcher can easily compute CIs of various probability levels. Step-by-
step instructions for computing a CI of the researcher’s choice are provided later
in this chapter.
Power. The technical definition of power is the probability of correctly reject-
ing a false null hypothesis, or more simply, the probability of finding statistical
significance of the observation when such a relationship exists in truth. To use an
example from MRA, the probability of finding the R2 to be significantly different
from 0 when it is in fact different from 0, is power.The problem with low a priori
power (i.e., an estimated power that is lower than .80 with the prospective sample
size prior to actual data collection) is evident. Even if the researcher somehow
managed to find a (seemingly) statistically significant finding, if the power was
very low to start with, the researcher risks claiming a statistical relationship where
it may not exist. Much like the procedures involved in the examination of preci-
sion discussed earlier, the computation of a priori power in the case of an MRA
also begins with locating the expected value of R2, based on previous research
and/or theory. For the sake of simplicity, let’s continue with the same example
we used earlier, namely, R2 of .50. The researcher selects the suitable probability
level—or to follow Cohen, 1992, significance criterion—(e.g., α = .01 or α =.05).
For now, let’s go with the more conventional value of .05. The minimum sample
size (N = 120) and the number of predictors (k = 3) are the remaining determin-
ers we need to compute power.
With these determiners in hand, we now first compute the population effect
size ( f ) using the following formula (p. 92, Cohen et al., 2003):
R2
f =
1 − R2
Replacing R2 with .50, we get the following:
.50
f = =1
1 − .50
136 Eun Hee Jeon
Now that we have the f value, we use its value to determine L, which is a value
we need to identify power in the L table of the selected probability level (or sig-
nificance criterion, i.e., α = .01 or α =.05) (Cohen et al., 2003). L is determined
using the following formula (p. 92, Cohen et al., 2003):
L = f 2 (N – k – 1)
Continuing on our previous example of a reading MRA study, let’s now replace
f with 1, N with 120, and k with 3. We then get:
L= 12 (120 – 3 – 1) = 116
With this L value and df (which is equal to k, and therefore equal to 3, in our case),
we now identify predicted power in the L table. For convenience in Figure 7.1
I provide the part of the L table relevant to this example. The values in the top
row are power and the values in the first column are df or k.
For easier navigation, follow the arrows drawn on the figure. Since our df or k
value is 3 and the L value is 116, we now locate the L value on the row listed next
to the df or k value of 3 that is closest to 116, which, in our case here, is 23.52 and
corresponds to the predicted power value of .99. Our a priori power, therefore,
is well above the .80 standard and there is no need to increase the sample size.
Once the a priori checks are done and the data have been collected, we can
now submit the data to statistical analyses. However, prior to main analyses, the
researcher must first make sure that the data meet the assumptions of multivari-
ate analyses such as MRA (i.e., data screening), transform the data if they do not
meet the assumptions, and finally submit the data to the analysis proper, namely,
0.1 0.3 0.5 0.6 0.7 0.75 0.8 0.85 0.9 0.95 0.99
1 0.43 2.06 3.84 4.90 6.17 6.94 7.85 8.98 10.51 13.00 18.37
2 0.62 2.78 4.96 6.21 7.70 8.59 9.64 10.92 12.65 15.44 21.40
3 0.78 3.30 5.76 7.15 8.79 9.77 10.90 12.3 14.17 17.17 23.52
4 0.91 3.74 6.42 7.92 9.68 10.72 11.94 13.42 15.41 18.57 25.24
5 1.03 4.12 6.99 8.59 10.45 11.55 12.83 14.39 16.47 19.78 26.73
FIGURE 7.1 Partial L value table (shortened from Cohen et al, 2003, p. 651)
an appropriate type of MRA. The same preparatory steps should be followed for
all types of MRA.
Data Screening for MRA (or for Any Type of

Multivariate Statistical Analysis)
Step 1, set the sample size. Hopefully, this was already dealt with when you
were estimating precision and power, but make sure that the sample size is equal
or larger than 50 + 8k (where k is the number of PVs) (Tabachnick & Fidell,
2012). Another somewhat less conservative rule of thumb is offered by Stevens
(1996), who suggests having at least 15 participants for each PV in order to ensure
a reliable regression equation.
Step 2, conduct univariate data screening. Remove univariate outliers
(cases whose score on one variable is extreme with a z-score equal to or larger
than ± 3.29) and check data for normality. If necessary, transform data to achieve
normality.
Step 3, conduct multivariate data screening. Remove multivariate outli-
ers (cases with a combination of extreme scores on two or more variables). Usu-
ally, removing univariate outliers helps reducing multivariate outliers, but it is still
possible that there are remaining multivariate outliers. Compute the Mahalanobis
distance (think of this as how far each case is from the multivariate mean) for each
case and remove multivariate outliers.
Use these SPSS commands to obtain the Mahalanobis distance, as shown in
Figure 7.2: Select Analyze > Regression > Linear. For Dependent, choose
“Study ID” (or whichever variable name you gave for case identification. I usu-
ally use “ID.”) For Independent(s) select all the other variables under investigation.
Then click Save in the Linear Regression dialogue box and check Mahalanobis in
the Distances section, then click Continue.
You will now see in the Data View page that a new column, “MAH_1,” which
lists the Mahalanobis distance of each and every case was added, as can be seen in
Figure 7.3 (I highlighted this column for you).
Unlike removing univariate outliers, for which we used a predetermined yard-
stick of ± 3.29, the yardstick for multivariate outlier screening is based on the
number of variables involved and the probability level of choice. In order to
identify this value, we refer to the chi-square (χ²) table, which is available in most
statistics textbooks. The number of variables, which is our degrees of freedom
(4, in our case) and the probability level (p < .001, conventionally) will be our
guide. According to the chi-square table listed in Tabachnick and Fidell (2012),
the critical value of chi-square in our case is 18.467 (χ² [4] = 18.467, p < .001).
Your task now is to remove all the cases where the Mahalanobis distance is equal
to or higher than this value.
Step 4, check for multicollinearity. To check this assumption, run a simple
bivariate correlation analysis using all the variables under investigation (i.e., CV
138 Eun Hee Jeon
FIGURE 7.2 Mahalanobis distance dialogue boxes in SPSS
and all PVs). For example, if two PVs are highly correlated (r equal to or higher
than .90 or –.90) (Allison, 1999; Tabachnick & Fidell, 2012), you have a multi-
collinearity problem. In such a case, consider either collapsing the highly cor-
relating PVs into one variable or eliminating one of them from the analysis. This
old-fashioned approach to checking multicollinearity, however, is not a foolproof
solution because it is possible for all bivariate correlations to be in an acceptable
range even when multicollinearity is present (Allison, 1999). In order to avoid
such an oversight, Allison (1999) recommends that researchers refer to the Toler-
ance statistic or variance inflation factor (VIF), which is the multiplicative inverse
of Tolerance (VIF = 1/tolerance).
Use these SPSS commands to obtain the Tolerance and VIF (see Figure 7.4):
Analyze > Regression > Linear. For Dependent, select one of the independent
variables (IVs). For Independent(s), select all other IVs under investigation. Click
Statistics in the Linear Regression dialogue box. Remove the check mark from
all items except Collinearity Diagnostics, then click Continue. Click the Plots tab on
FIGURE 7.3 Mahalanobis distance column in SPSS data view
the Linear Regression dialogue box, and make sure that nothing is selected, then
click Continue.
Now, in the Output view, you will get the following table.
Allison (1999) suggests as a rule of thumb, a tolerance value lower than .40
(VIF higher than 2.50) indicates multicollinearity. As shown in Table 7.1, neither
the Tolerance values nor the VIFs are out of the acceptable range and there-
fore do not indicate a concern. Please note that this is only the first step of the
Tolerance/VIF statistic check. Now, we need to reiterate this process alternat-
ing the Dependent variable: We used “Voc” as the Dependent variable in the
first analysis, so this time we enter “Grm” as the Dependent variable and “Voc”
and “Metacog” as Independent variables. In the final step, “Metacog” will be
the Dependent variable and “Voc” and “Grm” will be Independent variables. If
140 Eun Hee Jeon
FIGURE 7.4. Tolerance statistic dialogue boxes in SPSS
TABLE 7.1 SPSS output for tolerance statistics
Coefficientsa
Model Collinearity Statistics
Tolerance VIF
1 Grm .814 1.229

Metacog .814 1.229
a. Dependent Variable:Voc
multicollinearity is detected, you will have to decide how to handle this problem;
the simpler solutions include removing the most intercorrelated variable(s) from
the analysis or combining the two variables and using them as one variable. One
must take care, however, to avoid compromising the theoretical motivation of the
research by eliminating or combining variables.
Step 5, ensure a linear relationship. Check to see if the CV and PVs have
a linear relationship when observed pairwise and collectively. Linearity is one of
the assumptions of multivariate normality as Pearson’s r only captures linear rela-
tionships (Tabachnick & Fidell, 2012). You can check linearity by checking the
bivariate scatter plots of variables and residual plots. If some relationships are not
linear despite the removal of univariate and multivariable outliers and transforma-
tion of problem variables (both of which have been completed in previous steps),
you might consider transforming the problem variable further to ensure linearity.
Step 6, check for homoscedasticity. Also known as assumption of equal

variance, homoscedasticity is an assumption that the variance in one continuous
variable remains approximately the same at all values of another continuous vari-
able.This can also be checked by examining bivariate scatter plots, which you can
generate as part of the regression analysis using SPSS. I will revisit this concept
with an example later.
Choosing the Right MRA

Once the data have been checked and, when appropriate, rendered for a multi-
variate analysis, the researcher now needs to determine which type of MRA is
most appropriate for his or her research question(s). As I noted earlier, MRA is a
family of several different types of analyses; explaining all of them with sufficient
detail is beyond the scope of this chapter. For this reason, I will provide a concep-
tual review of three main types of MRA: standard regression analysis, hierarchi-
cal regression analysis, and stepwise regression analysis. Then, as MRA Type 3 is
likely less common among L2 researchers, I will explain the procedural details of
standard regression analysis and hierarchical regression analysis.
MRA Type 1: standard multiple regression. In standard multiple regres-
sion, all PVs are entered into the regression simultaneously and the respective
contribution of each PV is computed as if it was entered into the equation after
all the other PVs. Therefore, the R2 produced by standard multiple regression is
the sum of unique contributions made by each of the PV without accounting/
adjusting for the overlapping contribution among the PVs. In this sense, standard
multiple regression yields a conservative level of predictive power of PVs since all
PVs are subjected to the most stringent test for their ability to predict variance
in the CV. The rigor of standard multiple regression, however, is a double-edged
sword: Even if a PV is highly correlated with the CV, if it also strongly correlates
with other PVs, its contribution may appear less important than it really is. By
the same logic, if a PV does not have much overlap with other PVs, it may still
appear to be an important contributor even if it does not correlate with the CV
very strongly. For this reason,Tabachnick and Fidell (2012) suggest that the results
of a standard multiple regression be considered along with the overall correlation
results.
MRA Type 2: hierarchical regression analysis. The primary purpose of
hierarchical regression analysis is to assess the fit of multiple models in search of
the best-fitting (although preferably parsimonious) model. For this reason, the
researcher decides the order in which the PVs enter the equation. The order
of entry is usually determined by previous research findings or a theoretical
motivation. For example, let’s suppose that a researcher is interested in finding
out how much additional variance in L2 reading comprehension is explained
as each of the following PVs enters the equation: L2 vocabulary, L2 grammar,
and reading-related metacognition (e.g., knowledge of strategies, self-monitoring
while reading). In order to investigate this, the researcher can choose to enter
Code dummy
Is the CV No, it’s Are all the PVs No variables, then
categorical? continuous continuous? proceed with...
Yes Yes
Logistic Multiple
Regression Regression
Analysis
Which MRA should I use?
Are you
Do you want interested Standard
to determine No in the unique contribution Yes Multiple
the order of of each and Regression
PV entry? every PV?
Yes No
Hierarchical Do you want to

Regression Stepwise
determine PV
Analysis Yes Regression
entry by strictly
Analysis
statistical criteria?
FIGURE 7.5 Multiple regression analysis decision tree

L2 vocabulary first, L2 grammar second, and metacognition scores last into the
equation, then examine the amount and statistical significance of incremental
reading variance at each step (i.e., change in total R2). This procedure is akin to
ANCOVA, where the effects of one independent variable (the covariate) are
removed or partialled out in order to isolate the effects of another.
MRA Type 3: stepwise regression analysis. Of the three types of MRA
introduced here, the most caution is advised when using stepwise regression
analysis. This is because unlike the first two types of MRA, the model specifica-
tion in stepwise regression analysis relies strictly on statistical criteria, namely,
the size of the correlation between the CV and PVs. To illustrate this point, let’s
take an example from forward selection (one of the three methods of stepwise
multiple regression, which include forward selection, backward deletion, and
stepwise regression). Let’s say that the PV with the highest correlation with the
CV, L2 reading comprehension, was L2 vocabulary knowledge. In the forward
selection method, the first PV to enter the equation is thus determined to be
L2 vocabulary. The contribution of L2 vocabulary includes both the unique
contribution made by L2 vocabulary and the potentially overlapping area with
another PV to be selected shortly. Next, in order to select the second PV, mod-
els including all possible pairs of PVs with L2 vocabulary as the default PV of
the two PVs (e.g., L2 vocabulary and L2 grammar, L2 vocabulary and meta-
cognition) are compared for their predictability, and the higher contributing
PV is selected as the second PV of the equation. Only the unique contribution
of the second PV is considered. As can be illustrated in this example, due to
its strictly statistical nature (the reason why stepwise regression analysis is also
called statistical regression analysis), should a researcher choose stepwise regres-
sion analysis over other types of MRA, the observed relative importance of a
PV should be considered with caution and in the context of previous research
findings, theory, and sample size (see also Tabachnick & Fidell’s, 2012 advice
on this matter).
To help you choose the appropriate type of MRA, in Figure 7.5 I present a
decision tree designed for this purpose. As depicted in the diamond in the upper
left corner of the diagram, your first decision hinges on whether the CV is cat-
egorical or continuous. If the CV is categorical, the appropriate analysis is logistic
regression. If the CV is continuous, however, the researcher should determine the
type of MRA by navigating further along the tree. The two types of MRA that
will be further discussed in this chapter are marked with ovals.
How to Run MRAs Using SPSS
MRA Type 1: Standard Multiple Regression

Use these SPSS commands to generate a standard multiple regression: Select
Analyze > Regression > Linear (see Figure 7.6). For Dependent, select and
144 Eun Hee Jeon
enter the CV of your choice. For Independent(s), simultaneously select all the
PVs of your choice. Click the Statistics tab to make selections for statistics of
interest. Here I selected model fit (probably the most important information),
CIs (notice you can adjust the probability level of CIs), Durbin-Watson (to check
for the independence of observation/independence of residuals). Click Continue.
In the Linear Regression dialogue box, click the Plots tab and select *ZRESID
(short for z residual) for the y-axis and *ZPRED (short for z predictor) for the
x-axis as illustrated in Figure 7.7.
By making these selections, you can create a residual scatter plot using stan-
dardized scores (thus the labels “z residual” and “z predictor”) and can check the
normality of residual distribution; if you have normality, the residual scatter plot
should reveal a pile of residuals in the center of the plot, which should resemble
a rectangular shape with residuals trailing off symmetrically in all four directions
from the center of the rectangle. In the next two figures I present two plots,
one of which shows normality (Figure 7.7) and the other a lack of normality
(Figure 7.8). If a lack of normality is detected, it is recommended that the
researcher transform the data appropriately to achieve normality.
FIGURE 7.6 SPSS standard multiple regression dialogue boxes: the first dialogue box
and selections in the Statistics tab
FIGURE 7.7 SPSS standard multiple regression dialogue boxes: selections in the Linear
Regression Plots dialogue box
Scatterplot
Dependent Variable: gtelprc
2
Regression Standardized Residual
−1
−2
−3
−2 −1 0 1 2 3
Regression Standardized Predicted Value
FIGURE 7.8 A scatter plot indicating normality

146 Eun Hee Jeon
Scatterplot
Dependent Variable: psedcomp
4
Regression Standardized Residual
−2
−3 −2 −1 0 1 2 3
Regression Standardized Predicted Value
FIGURE 7.9 A scatter plot indicating nonnormality
Interpreting the Results of Standard Multiple Regression

The first table that SPSS will generate in the output file (see Table 7.2) is titled
“Variables Entered/Removed” and will show all the models produced by the
analysis. Since standard multiple regression produces only one model, you will
see number 1 in the row below the left most “Model” column. Make sure that all
the PVs you chose appear in the row under the “Variables Entered” column, and
that the dependent variable is indeed the CV you chose. In this case, the CV is
TOEFLRC (Reading Comprehension section score of the TOEFL test).
The next important table is the “Model Summary” table (Table 7.3), which
shows how well our regression equation fits the data.
How to read this table:
1. Model: As noted earlier, for standard multiple regression, the number should
be 1, indicating one model was generated.
2. R: This is what we call multiple correlation coefficient. This can be consid-
ered as a kind of multivariate equivalent of r (correlation coefficient between
two variables). Just like r, R ranges from 0 to 1, and is an index of how well
the CV is predicted by the set of PVs.
3. R Square (R2): As the name indicates, this is computed by multiplying R by
itself (.691 × .691), and is the proportion of variance in the CV accounted
TABLE 7.2 SPSS output for variables entered/removed
Variables Entered/Removed b
Model Variables Entered Variables Removed Method
1 Metacog,Voc, Grma Enter

a. All requested variables entered
b. Dependent variable: TOEFLRC
TABLE 7.3 SPSS output for regression model summary
Model Summaryb
Model R R Adjusted Std. Error Durbin-Watson

Square R Square of the
Estimate
dimension0 1 .691a .478 .459 3.90711 2.049

a. Predictors: (Constant), Metacog,Voc, Grm
for by the PVs. In other words, an R2 of .478 indicates that 47.8% of the vari-
ance in the CV is accounted for by the PVs.
4. Adjusted R Square: R2 is based on the study sample, not on the population
from which the sample was drawn. For this reason, the R2 value has a ten-
dency to be inflated (or positively biased). Adjusted R2 takes into account this
bias (thus the term, “adjusted,”) and provides a more conservative value.
The third table you should pay attention to is the ANOVA table. You might
wonder why there is an ANOVA table in the MRA output. The reason for this is
that an R2 value cannot be tested for statistical significance as it simply indicates
the proportion of the variance in the CV accounted for by the PVs. How do we
test, then, the statistical significance of the regression model that we have just gen-
erated? In other words, how can we determine that knowing a value of a certain
PV allows us to statistically significantly predict the value of the CV than when
we don’t know the value of the PV (i.e., when the regression coefficient of this
PV is 0 and creates a flat line with no slope, which is essentially the null hypothesis
of the MRA)? In the case of group comparison (i.e., a categorical PV), we test
whether or not participants’ group membership (treatment group 1 vs. treatment
group 2 vs. treatment group 3) provides extra information about the mean (i.e.,
the null hypothesis of ANOVA). Do you now see that although we use MRA
and ANOVA to investigate different types of research questions, they both rely on
similar principles? In fact, we can think of ANOVA as a type of MRA in which
the PV(s) are all categorical. This is why we use F-ratio to examine the statistical
significance of MRA as well (see Table 7.4).
148 Eun Hee Jeon
TABLE 7.4 SPSS output for ANOVA resulting from regression
ANOVAb
Model Sum of Squares df Mean Square F Sig.
1 Regression 1144.718 3 381.573 24.996 .000a

Residual 1251.769 82 15.265
Total 2396.487 85
a. Predictors: (Constant), Metacog,Voc, Grm
b. Dependent Variable: TOEFLRC
1. Take a look at the “Mean Square” column. This is where the mean sum of
squares of the regression model and that of the residual are reported. The
former divided by the latter (381.573/15.265) is expressed as the F-ratio
(24.996) in the next column.
2. Check the “Sig.” column for the associated significance level of this F-ratio.
It is .000, which is smaller than the typical .05 probability level, indicating
that the chances of the regression line departing from the flat line are beyond
random chance level. Since the model is statistically significant, we can now
continue to report other details of the model.
The next table of interest (Table 7.5), “Coefficients” reports the regression
coefficients (B) and their 95% CIs.
1. The values reported under “B” in the “Unstandardized Coefficients” col-

umns indicate the regression weight for each PV and are used to construct
the regression equation as illustrated later.
The predicted TOEFL Reading Comprehension subsection score of a
person who received 38 points on the vocabulary test, 50 points on the gram-
mar test, and 35 points on the metacognitive questionnaire = 10.987 + (38
× .240) + (50 × .577) + (35 × .121) = 53.192. To further illustrate, the pre-
dicted reading comprehension score should be interpreted as the mean score
of all the people in the population who scored 38, 50, and 35 points on the
vocabulary test, grammar test, and metacognition questionnaire, respectively.
Keep in mind that these values are unstandardized values and therefore are
not on the same scale. The regression coefficients that are based on the same
scale across all variables are reported in the “Beta” column under “Standard-
ized Coefficients.” Here, the coefficients (.171 for vocabulary, .558 for gram-
mar, and .121 for metacognition) can be understood as slopes.That is, for every
additional (now standardized) unit in vocabulary, reading comprehension is
TABLE 7.5 SPSS output for regression coefficients
Coefficientsa
Model Unstandardized Standardized t Sig. 95.0% CI for B

Coefficients Coefficients
B Std. Beta Lower Upper Bound

Error Bound
1 (Constant) 10.987 6.098 1.802 .075 –1.144 23.119

Voc .240 .132 .171 1.825 .072 –.022 .502
Grm .577 .100 .558 5.750 .000 .377 .776
Metacog .121 .173 .063 .704 .483 –.222 .465
a. Dependent variable: TOEFLRC
expected to increase by an average of .171 units. Also note that although this
was not the case with the current example, it is possible to have a negative
coefficient (e.g., –.171). In such a case, the interpretation would be in the
reverse direction: e.g., for every additional unit in testing anxiety, reading com-
prehension test performance is expected to decrease by an average of .171 unit.
2. The “Sig.” column shows the significance level of each regression coefficient.
In our case, only the variable “Grm” (Grammar test) has a statistically signifi-
cant coefficient.
3. The “95.0% CI for B” columns show the 95% CI associated with each
regression coefficient. You can see that the CIs of the two nonsignificant
regression coefficients (“Vocabulary” and “Metacognition”) both include 0,
indicating lack of reliability associated with their coefficients.
MRA Type 2: Hierarchical Regression Analysis

Use the following SPSS commands to perform a hierarchical regression analysis.
Note: Although the commands for hierarchical regression analysis and those
for standard multiple regression overlap in many areas, they do differ at times. Pay
close attention to the differences.
Select Analyze > Regression > Linear (see Figure 7.10). For Dependent,
select and enter the CV of your choice. For Independent(s), select the PV or PVs
(simultaneously) that you want to use as a covariate (or covariates).1 Click Next
to move to the Block 2 of 2 box, where you select the next PV. Repeat “Next”
=> Select the next PV until you have defined all the models you want to assess.
As you would with standard multiple regression, click the Statistics tab and
select the statistics of your choice, as shown in Figure 7.11. Of importance are
model fit, R2 change (to see if the second model significantly improves the vari-
ance accounted for in the CV by adding the last PV), and CIs. Click Continue to
close the Statistics dialogue box and OK to run the analysis.
150 Eun Hee Jeon
FIGURE 7.10 SPSS hierarchical regression analysis dialogue boxes: selections of PVs
for the first model
Interpreting the Results of Standard Multiple Regression

As with the standard multiple regression, the first table in the hierarchical regres-
sion analysis output also summarizes the PVs in the models (see Table 7.6). The
difference lies, however, in the number of models. Since a hierarchical regression
analysis compares the model fit between at least two models, the model column
also reflects this characteristic. Below I explain in detail how to read the Variables
Entered/Removed table using our example.
1. You will notice in the “Model” column that two models are presented. This
is because this hierarchical regression analysis examines whether Model 2,
which includes grammar, vocabulary, and metacognition, offers a signifi-
cantly better fit than Model 1, which only includes grammar and vocabulary.
Check the variable names and their corresponding models to make sure that
you entered the PV (or a set of PVs in case you entered multiple PVs) at the
correct step.
FIGURE 7.11 SPSS hierarchical regression analysis dialogue boxes: selections of PV for
the second and final model and selection of statistics
Now, let us review the next table, Model Summary (Table 7.7).
1. Model column: Both models here are standard regression models with the
same CV but with different sets of PVs; Model 1 has two PVs (grammar and
vocabulary) while Model 2 has three (grammar, vocabulary, and metacogni-
tion). Interpretation of R, R2, Adjusted R2 for each model is, therefore, the
same as that of standard multiple regression (see above).
2. Change Statistics:This is what distinguishes hierarchical regression from standard
multiple regression.The R2 change of Model 2 indicates the increase in the pro-
portion of the variance in the CV when the full model (i.e., Model 2) includes
metacognition as the third PV.The statistical significance between Model 1 and
Model 2 can also be tested using the F-test, and the result is reported in the “Sig.
F Change” column. In our case, the addition of the third PV did not result in
a statistically significant change. Therefore, including metacognition as the third
PV, although it would be helpful in explaining a small additional amount of vari-
ance in the CV, would not be helpful in pursuing a parsimonious model. Further
evidence of the lack of variance accounted for by metacognition can also be
observed in the nearly identical R2 values for the two models.
152 Eun Hee Jeon
TABLE 7.6 SPSS output for variables entered/removed in hierarchical regression model
Variables Entered/Removed b
Model Variables Entered Variables Removed Method
dimension0 1 Grm,Voca Enter

2 Metacog a Enter
a. All requested variables entered
TABLE 7.7 SPSS output for hierarchical regression model summary
Model Summary c
Model R R2 Adjusted Std. Error Change Statistics Durbin-

R2 of the Watson
Estimate R2 Change F Change df1 df2 Sig. F
Change
1 .689a .475 .462 3.89522 .475 37.474 2 83 .000

2 .691b .478 .459 3.90711 .003 .496 1 82 .483 2.049
a. Predictors: (Constant), Grm,Voc
b. Predictors: (Constant), Grm,Voc, Metacog
c. Dependent variable: TOEFLRC
The following ANOVA table (Table 7.8) reports on the statistical significance
of the two models generated in this analysis.
How to read this table: Since both models are essentially standard multiple
regression models, you can read this table just as you would read the ANOVA
table of standard regression analysis output described earlier. Here we can see
from the last column, “Sig.”, that both Model 1 and Model 2 are statistically sig-
nificant, although as I discussed previously, the latter model lacks parsimony and
therefore is not recommended.
The last table to note is the Coefficients table (Table 7.9). Again, the interpre-
tation of coefficients for each model is the same as that of the previously reviewed
standardized regression model. Since the focus of a study that employs a hierar-
chical regression analysis is often on the full model with all PVs included, the
reporting of the coefficients of Model 2 is likely to be your primary task.
1. The regression weights for each PV in Model 1 and Model 2 are reported
in the “B” column under “Unstandardized Coefficients.” The statistical
significance of each regression weight is reported in the “Sig.” column, indi-
cating that in both models, only grammar had a statistically significant regres-
sion weight.
TABLE 7.8 SPSS output for ANOVA resulting from hierarchical regression
ANOVAc
Model Sum of df Mean F Sig.

Squares Square
1 Regression 1,137.151 2 568.575 37.474 .000a

Residual 1,259.336 83 15.173
Total 2,396.487 85
2 Regression 1,144.718 3 381.573 24.996 .000b
Residual 1,251.769 82 15.265
Total 2,396.487 85
a. Predictors: (Constant), Grm, Voc
b. Predictors: (Constant), Grm, Voc, Metacog
c. Dependent Variable: TOEFLRC
TABLE 7.9 SPSS output for hierarchical regression coefficients
Coefficientsa
Model Unstandardized Standardized t Sig. 95.0% CI for B

Coefficients Coefficients
B Std. Beta Lower Upper

Error Bound Bound
1 (Constant) 13.821 4.568 3.025 .003 4.735 22.906

Voc .257 .129 .183 1.989 .050 .000 .513
Grm .599 .095 .579 6.308 .000 .410 .787
2 (Constant) 10.987 6.098 1.802 .075 –1.144 23.119
Voc .240 .132 .171 1.825 .072 –.022 .502
Grm .577 .100 .558 5.750 .000 .377 .776
Metacog .121 .173 .063 .704 .483 –.222 .465
a. Dependent Variable: TOEFLRC
2. The 95% CIs associated with each regression coefficient are reported in the
rightmost column. Here we can see that in Model 1, the 95% CI for vocabu-
lary included 0, and that in Model 2, the 95% CIs for both vocabulary and
metacognition included 0, indicating a lack of reliability associated with their
regression weights.
STUDY BOX 1
Jeon, E. H. (2012). Oral reading fluency in second language reading, Reading in a
Foreign Language, 24 (2), 186–208.
154 Eun Hee Jeon
Background
Despite increasing interest in fluency and its role in L2 reading, investigation
of fluency in the context of other key reading components is scarce. This
study aimed to (a) expand the current understanding of L2 oral reading flu-
ency by identifying its relationship with other key reading predictors (e.g.,
decoding, vocabulary knowledge, grammar knowledge, and metacogni-
tion), and (b) to examine the predictive power of oral reading fluency on L2
reading comprehension, thereby examining the potential of reading fluency
as a proxy for L2 reading comprehension.
Research Questions
1. How does oral reading fluency relate to other components of L2
reading?
2. Are word-level reading fluency and passage reading fluency substan-
tially different from each other? If so, why?
3. Can oral passage reading fluency be considered a proxy for L2 reading
comprehension among the present study participants?
Method
255 10th graders in South Korea who had been studying English for 7.5 years
were assessed on nine variables (three fluency variables, five other key read-
ing components, and reading comprehension): pseudoword reading, word
reading, passage reading, morphological awareness, word knowledge,
grammar knowledge, listening comprehension, metacognitive awareness,
reading comprehension.
Statistical Tools
Pseudoword reading, word reading, and passage reading scores were used
as predictor variables and reading comprehension was used as the criterion
variable in an MRA. Four hierarchical regression analyses were carried out,
alternating the entry order each time.
Results
The regression analysis results showed that the three reading fluency vari-
ables collectively explained a statistically significant 21.2% (p < .001) of vari-
ance in silent reading comprehension and that passage reading fluency was
a more potent explanatory variable than word-level fluency variables. As the
first variable to enter the regression, oral passage reading fluency explained
a significant 20.9% (p < .001) of reading variance. When entered follow-
ing the Pseudoword Reading Test and the Word Reading Test, the Passage
Reading Test still accounted for an additional 12.4% of reading variance (p

< .001). In contrast, when entered into the regression analysis after the Pas-
sage Reading Test, neither the Pseudoword Reading Test nor the Word Read-
ing Test made a statistically significant additional contribution to explain
reading variance. This result indicated the relatively stronger predictability
of passage-level reading fluency on reading comprehension compared to
word-level reading fluency, and corroborated previous research findings.
STUDY BOX 2
Jin, T., & Mak, B. (2013). Distinguishing features in scoring L2 Chinese speaking
performance: How do they work? Language Testing, 30 (1), 23–47.
Background
Research on the link between distinguishing features (fluency, vocabulary)
and overall oral proficiency is well-established in L2 English but not in L2
Chinese. This study aims to investigate the predictive power of seven dis-
tinguishing features representing four constructs (pronunciation, fluency,
vocabulary, grammar) on holistically graded speaking performance.
Research Questions
1. What is the relationship between each individual distinguishing feature
and the speaking test scores?
2. What is the contribution of distinguishing features to speaking test scores?
Method
66 advanced L2 Chinese learners and two raters participated in the study.
Pronunciation (number of target-like syllables per 10 syllables), fluency
(speech rate and pause time), vocabulary (word tokens and word types),
and grammar (grammatical accuracy and grammatical complexity) were
assessed. Speaking ability was measured through three test tasks, each of
which included integrated and independent tasks.
Statistical Tools
A bivariate correlation matrix showed that six of the seven distinguishing fea-
tures were significantly correlated with speaking test scores. As a result, two
standard multiple regressions were carried out with those six distinguishing
features as predictor variables and speaking test scores as the criterion vari-
able. In the first regression, one of the two vocabulary measures (i.e., word
tokens) was used and in the second regression, the other vocabulary mea-
sure (i.e., word types) was used.
156 Eun Hee Jeon
Results
Total R2s yielded by the first and second regression model were very high at
.79 and .77, respectively. In both regressions, target-like syllables, grammatical
accuracy, and word tokens and types were found to be significant predictor vari-
ables. These results provided empirical support that distinguishing features and
holistic speaking test scores are linked among advanced L2 Chinese learners.
Tools and Resources

Online Lectures and Websites
1. Andy Field, Professor of Child Psychopathology at the University of Sus-
sex, teaches statistics and has made many valuable contributions to research
methodology in the social sciences. Professor Fields uploads video recordings
of his Research Methods class lectures and SPSS tutorials on his YouTube
channel. As part of his lectures, he teaches MRA using great visuals and
humor (e.g., in one lecture, he throws himself on the floor to show students
what a zero slope looks like). These are, by far, my favorite online lectures on
statistics. Professor Andy Field’s YouTube lectures on statistics: http://www.
youtube.com/user/ProfAndyField
2. Research methods expert, Bernhard Kittle, a professor at the Insti-
tut für Wirtschaftssoziologie at the Universität Wien, offers a video lec-
ture on MRA at the Videolectures.net website: http://videolectures.net/
ssmt09_kittel_mra/
3. The IBM SPSS Statistics 20 Brief Guide is available at: ftp://public.dhe.ibm.
com/software/analytics/spss/documentation/statistics/20.0/en/client/Manu-
als/IBM_SPSS_Statistics_Brief_Guide.pdf. Although this PDF document
does not walk you through any specific statistical analyses, it does provide some
handy tools to help SPSS users to navigate its various features more easily.
Further Reading
General Textbooks
1. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/cor-
relation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum
Associates.
This is probably the most detailed volume on MRA for psychologists and social
scientists. The volume is full of in-depth discussion on theoretical and mathematical
aspects of MRA, but has little connection with statistical packages.
2. Allison, P. D. (1998). Multiple regression: A primer. Thousand Oaks, CA: Pine Forge Press.
This volume is entirely devoted to MRA. Compared to Cohen et al. (2003), it is
slightly less technical and geared toward novice users of MRA.
3a. Field, A. (2013). Discovering statistics using IBM SPSS statistics. Los Angeles: Sage.
3b. Howell, D.C. (2012). Statistical methods for psychology (8th ed.). Pacific Grove, CA:
Wadsworth Publishing.
3c. Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS.
3d.Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). Mahwah,
NJ: Lawrence Erlbaum Associates.
3e. Tabachnick, B. G., & Fidell, L. S. (2012). Using multivariate statistics (6th ed.). Boston,
MA: Pearson.
All four volumes listed are comprehensive statistics textbooks and include a chapter
on MRA. They include conceptual and mathematical explanations as well as com-
mands for statistical packages (e.g., SPSS, SES). These textbooks are used widely for
graduate level statistical courses for students in psychology, social sciences, and in the
case of Larson-Hall (2010), for applied linguistics.
Journal Articles on Technical Issues of MRA

1. Nathans, L. N., Oswald, F. L., & Nimon, K. (2012). Interpreting multiple linear regres-
sion: A guidebook of variable importance. Practice Assessment, Research, and Evaluation,
Practical Assessment, Research & Evaluation, 17(9). Available online: http://pareonline.
net/getvn.asp?v=17&n=9.
This unique paper provides an alternative view on determining variable importance
in a multiple regression analysis. While many researchers heavily and even exclusively
rely on beta weights to determine the relative importance of predictor variables, the
authors of this article argue that alternative “lenses” such as relative weights, structure
coefficients, commonality coefficients, and dominance weights can help the research-
er get a more complete understanding of contribution made by predictor variables.
2. Green, S. B. (1991). How many subjects does it take to do a regression analysis? Multi-
variate Behavioral Research, 26, 499–510.
This article evaluates the existing rules-of-thumb for appropriate sample sizes in MRA.
3. Havlicek, L., & Peterson, N. (1977). Effects of the violation of assumptions upon
significance levels of the Pearson r. Psychological Bulletin, 84, 373–377.
This article uses Monte Carlo procedures to empirically evaluate the consequences of
violation of assumptions.
4. Maxwell, S. E. (2000). Sample size and multiple regression analysis. Psychological Meth-
ods, 5(4), 434–458.
This article compares different methods of calculating suitable and sufficiently pow-
ered sample sizes.
1. Review the past five years’ issues of one or more L2 journals to locate studies
that used MRA. For the studies that used MRA, tally the types of MRA based
on their frequency of use. Is there a particularly frequently used type of MRA
within a certain subdiscipline of applied linguistics (e.g., language testing, soci-
olinguistics, language proficiency research)? If so, why do you think this is?
2. When using an MRA (and when using any type of modeling type of analysis,
actually), researchers care about identifying a model that fits the data well but
that is, at the same time, parsimonious. Why is model parsimony important?
Review the MRA studies collected for Discussion Question 1. Do you think
158 Eun Hee Jeon
all of them struck a happy medium between model fit and parsimony? Did
any of the studies sacrifice one for the other?
3. Jeon (2012) and Jiang, Sawaki, and Sabatini (2012) both used hierarchical
regression analysis to investigate a similar issue. Read both articles as a set and
see how the two articles converse with each other both theoretically and meth-
odologically.To what extent are their respective uses of MRA informed by and
justified according to the predictions of theory and of previous research?
4. Jin and Barley (2013) showcases the use of standard multiple regression in a
testing setting.What were the study’s PVs and CV and why was standard mul-
tiple regression chosen for this study? Can you imagine other instances when
MRA might be appropriate in the context of research in L2 assessment?
Note
1. For example, I know from previous research (Jeon & Yamashita, 2014) that L2 grammar
and L2 vocabulary are the two strongest correlates of L2 reading comprehension. How-
ever, other researchers have also noted that metacognition is also an important reading
predictor. For this reason, I am entering vocabulary and grammar as the two covariates
in the first block of this analysis.
References
Allison, P. D. (1999). Multiple regression. Thousand Oaks, CA: Pine Forge Press.
Bernhardt, E. B., & Kamil, M. L. (1995). Interpreting relationships between first language
and second language reading: Consolidating the Linguistic Threshold and the Linguis-
tic Interdependence Hypotheses. Applied Linguistics, 16, 15–34.
70, 426–443.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Howell, D.C. (2012). Statistical methods for psychology (8th ed.). Pacific Grove, CA: Wads-
worth Publishing.
Jeon, E. (2012). Oral reading fluency in second language reading. Reading in a Foreign Lan-
guage, 24(2), 186–208.
Jeon. E.,&Yamashita, J. (2014). L2 reading comprehension and its correlates:A meta-analysis.
Jiang, X., Sawaki,Y., & Sabatini, J. (2012). Word reading efficiency and oral reading fluency
in ESL reading comprehension. Reading Psychology, 33, 323–349.
Jin, T., & Mak, B. (2013). Distinguishing features in scoring L2 Chinese speaking perfor-
mance: How do they work? Language Testing, 30(1), 23–47.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs analyses, and reporting
Stevens, J. (1996). Applied multivariate statistics for the social sciences. Mahwah, NJ: Lawrence
Earlbaum Associates.
Tabachnick, B. G., & Fidell, L. S. (2012). Using Multivariate Statistics (6th ed.). Boston:
Allyn and Bacon.
8
MIXED EFFECTS MODELING
AND LONGITUDINAL DATA
ANALYSIS
Ian Cunnings and Ian Finlayson
Introduction
Consider a study that investigates two ESL teaching strategies. A researcher might
recruit participants from two schools and administer a course of each teaching
strategy to one of the schools. Participants’ proficiency would be tested at the
start and end of the course, and potentially a number of times in between, and the
relative increase in proficiency over time would be used as an indicator of which
strategy (if any) leads to a greater increase in proficiency. In a statistical analysis of
this type of study, the researcher will of course want to assess whether the influ-
ence of the independent variable, “teaching strategy,” on the English proficiency
of the participants sampled is likely to generalize to the wider population of
English language learners. The influence of “teaching strategy” on the dependent
variable, “English proficiency,” is modeled statistically as a fixed effect. A random
effect parameter in a statistical analysis models the random variance across the
participants tested.The researcher will want to assess whether the influence of the
fixed effect generalizes beyond the participants sampled to the wider population,
while taking into account any random variation observed. Simply put, a mixed
effects model is a statistical model that contains both fixed and random effects.
This hypothetical study is an example of a longitudinal design, as participants
are tested at multiple points in time. Longitudinal studies provide an important
tool to the second language (L2) researcher, as they provide the opportunity to
investigate how any number of factors may affect L2 acquisition over time. In
this chapter, we provide an overview of how longitudinal data can be analyzed
using mixed effects models. Mixed effects models have a number of properties
that mark them as particularly useful for L2 researchers interested in longitudinal
analysis or other research designs. Mixed effects models can be used to analyze a
160 Ian Cunnings and Ian Finlayson
variety of types of data and offer an alternative to the near ubiquitous use of t-tests
and ANOVA in the field (Lazaraton, 2000; Norris & Ortega, 2000; Plonsky, 2013;
Plonsky & Gass, 2011). We first discuss how mixed effects models might benefit
L2 researchers, before providing a practical example of how longitudinal mixed
effects data analysis can be conducted.
Mixed Effects Models

Mixed effects models were devised to account for the inherent clustering of
observations in various types of data. In our hypothetical example, a standard
approach might be to measure students’ proficiency at two testing points, first at
the beginning of the study and then at the end. With a two-way ANOVA, the
researcher may investigate whether proficiency improves over the course of the
study, by testing for a main effect of testing point (start vs. end of course), and,
more importantly, whether any improvement in proficiency is greater for students
taught on one strategy over the other, by testing an interaction between testing
point and teaching strategy. An assumption in analyzes such as these is that each
observation is independent of all other observations, apart from where depen-
dence is specified, e.g. that some of the observations were obtained from the same
cell of the experimental design. In our example, this independence is not the case.
A single participant’s proficiency at the beginning of the study will be related to
their proficiency at the end of the study, by virtue of the fact the two observations
were obtained from the same participant.
A common solution to such a problem is the use of repeated measures ANOVA.
In such an analysis, results are obtained by partitioning out error due to varia-
tion between individual participants (and, consequently, specifying the depen-
dence). For certain purposes, such as to increase statistical power, the researcher
may recruit students from multiple classes or schools. However, in this case, we
might expect there to be differences between the classes (and schools) tested. For
example, one class may have a particularly able set of students or a particularly
inspiring teacher. Thus, as the repeated observations from the same participant
in a repeated-measures design are not independent of each other, so too the
individual observations from participants clustered into classes (and schools) are
not truly independent of each other. The classes (and schools) that students come
from thus represent another set of clusters in our data that we must take into
account in order to make accurate statistical inferences.
In the terminology of mixed effects models, class and student are said to be
hierarchical or nested random effects (see Goldstein, 1995; Raudenbush & Bryk,
2002; Snijders & Bosker, 1999). Nested random effects cluster observations into
higher order categories (e.g., students clustered in classes). In our hypothetical
example, an analysis including a nested random effect that groups students into
classes, in addition to a random effect for the individual students themselves, will
allow the researcher to account for not only random variation across different
Mixed Effects Modeling and Data Analysis 161
participants, but also potential random variation arising from the way students are
clustered into classes.
Observations can also cluster in a nonnested fashion. For example, students
within the same ESL class might come from different first language (L1) back-
grounds, and students with the same L1s might be spread across different classes in
a school. In this case, although students are hierarchically nested into both classes
and L1s, classes and L2s are not nested. Rather, classes and L1s are crossed at the
same level of sampling. In addition to nested random effects, mixed effects models
can also include crossed random effects to model factors that are crossed, as in
classes and L1s in this example, at the same level of sampling (Raudenbush, 1993).
The ability to model nested and crossed random effects provides a new solu-
tion to an old problem in language research, namely Clark’s (1973) “language-as-
fixed-effect fallacy.” Clark argued that in language research, just as participants are
sampled from a wider population, so too the linguistic materials or target features
tested are also sampled from a wider population of materials or features that share
the same properties. As language researchers usually want to test if results general-
ize both to the wider population of people and the wider population of linguistic
materials, Clark argued both sources of random variance need to be taken into
account. A long-standing solution to this issue has been to conduct two separate
analyzes of a given data set, one in which data is averaged over the sampled sub-
jects (the F1 analysis) and a second averaged over the sampled linguistic items (F2).
A result is considered significant if it is reliable by both subjects and items.
However, conducting separate subjects and items analyzes is not a full solution
to Clark’s problem: Although the subjects analysis takes into account random sub-
ject variance and the items analysis random item variance, neither analysis takes
both sources of random variance into account at the same time. On a practical
level, it is also difficult to interpret a result that is reliable in one analysis but not
the other. Mixed effects models offer an alternative solution. In language research,
the subjects sampled are tested on a series of linguistic items, and the same lin-
guistic items are tested across subjects. In this way, subjects and items are crossed
at the same level of sampling. Mixed effects models with crossed random effects
for subjects and items allow both subject and item variance to be accounted
for in a single analysis and thus provide a better solution to Clark’s language-as-
fixed-effect fallacy than separate F1 and F2 analyses (Baayen, Davidson & Bates,
2008; Locker, Hoffman, & Bovaird, 2007).
Another advantage of mixed effects models over ANOVA is its flexibility in
the types of independent variables that may be considered. Like other types of
regression analyzes, mixed effects models allow us to model variance due to con-
tinuous as well as categorical predictors. In a longitudinal study, where changes
over time are of particular interest, this makes mixed effects models an attractive
option for data analysis. Returning to our earlier example, we may wish to take
multiple measurements during the length of our study to explore how proficiency
improves over time. With ANOVA, this could be analyzed as differences between
individual pairs of measurement points. However, it may be more meaningful to

explore change over many measurement points of a study. Mixed effects models
allow us to test the effects of continuous predictors, such as the time in months
or years as well as allowing us to take into account other continuous covariates
of interest, such as participant’s age or their performance on other measures of
linguistic or cognitive ability.
As the number of measurement points increase, unfortunately, so too might
the amount of missing data (e.g., due to attrition or illness). Standard practice in
the analysis of repeated measures data involves averaging over individual responses
and then submitting these averages to ANOVA. If data for a particular participant
are missing, their average is calculated based on the available observations. This
averaging is in part conducted to ensure that the assumption of ANOVA that data
come from a balanced design is met. Mixed effects models are robust against miss-
ing data, assuming the data are missing completely at random (Quene & van den
Burgh, 2008; Gelman & Hill, 2006, Chapter 25), thus enabling analysis of the raw
data with no prior averaging or imputation.
Parametric statistics are appropriate only when assumptions about the data are
met. These assumptions are rarely checked and often violated (for surveys and
discussion, see Plonsky, 2013; Plonsky, Egbert, & LaFlair, in press; Plonsky & Gass,
2011). Unlike ANOVA, mixed effects models are robust against violations of the
assumptions of homoscedasticity and sphericity (Quene & van den Burgh, 2008).
Mixed effects models with a continuous dependent variable make the same
assumptions with regards to the normal distribution as ANOVA. However, mod-
els for other distributions are available. For example, generalized mixed effects
models with a logit distribution can be used to analyze data with a binomial
dependent variable, such as a grammaticality judgment task with a binary gram-
matical/ungrammatical response or presence/absence of a particular linguistic
feature. Traditional analysis of categorical data involves computing average pro-
portions which are then submitted to ANOVA or t-test. However, as proportions
are not true continuous variables (they cannot be less than 0 or greater than 1),
analysis of categorical data in this way can lead to spurious results. In a logit mixed
effect analysis, the raw binomial response data is analyzed without prior averaging,
providing a solution to this problem (Jaeger, 2008).
Because mixed effects models can be used with unbalanced designs, they can
be used in a wider variety of contexts than ANOVA or t-test. In addition to
experimental designs, mixed effects models have also been used in corpus analysis,
and in the analysis of longitudinal data in a number of fields (see e.g. Boyle &
Willms, 2001; Collins, 2006; Goldstein, 1995; Raudenbush, 2001; Singer, 1998).
Although some L2 research has taken a longitudinal approach (see Ortega &
Byrnes, 2008), Ortega and Iberri-Shea (2005) noted that the field would benefit
from more sophisticated analysis of such data. Mixed effects models could help
fill this gap. Indeed, two recent studies by Ljungberg, Hansson, Andrés, Josefsson
& Nilsson (2013) and Meunier & Littre (2013; see Sample Study 1) have used
mixed effects analysis of longitudinal L2 data. In the next section, we discuss how
such analysis can be carried out. While our analysis involves a fictional longitu-
dinal study taking place over a matter of months, different types of longitudinal
effects can be analyzed with mixed effects models. This can include, for exam-
ple, effects relating to how participants perform over the course of an individual
experiment, as well as investigations of change over longer periods of time.
SAMPLE STUDY 1
Meunier & Littre (2013). Tracking learners’ progress: Adopting a dual ‘corpus cum
experimental data’ approach. The Modern Language Journal, 97, 61–76.
Background
The acquisition of tense and aspect marking in L2 English has been
well-researched in second-language acquisition (SLA). Meunier and Littre
conducted a longitudinal corpus-based analysis to investigate which prop-
erties of tense and aspect marking remain difficult to master even after a
number of years of exposure to English.
Methods and analysis

Meunier and Littre analyzed accuracy in use of tense and aspect markers in
essays written over a 3-year period by a cohort of L1 French learners of L2
English. The data were extracted from the Longitudinal Database of Learner
English and analyzed with mixed effects models. As the dependent variable
was count data (number of errors), which are not normally distributed, a
generalized mixed effects model with the Poisson distribution was used.
Results
The results showed that tense and aspect error rates reduced over time.
Certain properties of the English progressive, however, continued to present
considerable difficulties. Meunier and Littre used the results of the mixed
effects corpus analysis to inform construction of an experimental gram-
maticality judgment task testing the acquisition of specifically those struc-
tures that were found particularly difficult to acquire. This type of combined
approach to the study of L2 acquisition, facilitated by mixed effects analy-
sis of longitudinal data, thus provides an opportunity to gain an in-depth
understanding of developmental patterns in L2 acquisition that is not pos-
sible with traditional analyses that solely rely on cross-sectional designs.
Practical Example
The example data set we discuss in this section is longitudinal, although the issues
raised are of general relevance to mixed effects analysis.The example uses the R soft-
ware package (R development core team, 2014). Mixed effects analyses can also be
conducted in SPSS, SAS and STATA. R is a command-line driven application that
readers used to the menu system of SPSS might initially find taxing. It is beyond the
scope of this chapter to provide a comprehensive introduction to R syntax, but the
reader is directed to the Further Reading section for some recommended reading.
In addition to the functionality of the basic installation of R, additional pack-
ages can be downloaded to perform specific analyses.The main focus of this chap-
ter will employ the lme4 package (Bates, 2005), which provides an up-to-date
implementation of linear mixed effects models. Our analysis uses lme4 version
1.1–7. Different versions may display slightly different results. We also note useful
functions in the psych (Revelle, 2014) and car (Fox & Weisberg, 2011) packages.
Consider again our fictional study that tests two English language teaching
strategies (Strategy A and Strategy B). To test the strategies, one group of L2
English learners are taught using Strategy A and a second taught using Strategy B.
The two groups’ English proficiency is assessed at the start of the course and also
four additional points over the course of instruction. English proficiency is used as
the dependent variable to assess the relative effectiveness of each teaching strategy.
A simulated data set for this study can be found in the Longitudinal.RData
supplementary file, available on the companion website for this book (http://oak.
ucc.nau.edu/ldp3/AQMSLR.html). Longitudinal.RData contains a data frame
called scores which contains the longitudinal data. A data frame is an R object
that contains a table of rows, each containing an individual observation, and col-
umns, which each contain a different variable. To display the first six rows we can
use the function head().
> head(scores)
student class time course gender L1 age exp prof
1 1 1 0 A M J 27 3 12
2 1 1 6 A M J 27 3 22
3 1 1 12 A M J 27 3 27
4 1 1 18 A M J 27 3 36
5 1 1 24 A M J 27 3 36
6 2 1 0 A F J 31 4 15
The first column, Student, identifies the study’s 156 participants. The Class
column groups these students into six classes. Proficiency is graded at five points
in time from the start of the course onwards in the Time column (0 months, 6
months, 12 months, 18 months, and 24 months), which is why the data for Student
1, for example, occupies five rows.The data include cells missing at random to sim-
ulate students missing particular tests (e.g., Student 13 was tested four only times).
The Course column identifies the main independent variable, “teaching strat-
egy” (A or B). In this between-groups design, Classes 1–3, comprising Students
1–74, were tested on Strategy A and Classes 4–6, comprising Students 75–156,
were tested on Strategy B. The next three columns contain additional informa-
tion about the participants, including their gender, L1, and age. The Exp column
provides a measure of previous exposure to English, in terms of the number of
months that each participant has spent in an English-speaking country. Finally
“prof ”, the dependent variable, provides the proficiency score for each student at
each of the five test points.
Before analyzing the data we use the describeBy() function in the psych
package to calculate descriptive statistics. This function provides descriptive sta-
tistics for the Prof column of the Scores data frame as grouped by the Time and
Group columns. Note that describeBy(), which is similar to the Explore func-
tion in SPSS, computes additional statistics, but the output shown next has been
edited to save space.
> describeBy(scores$prof, list(scores$time,

scores$course))
:0 :0
:A :B
vars n mean sd median vars n mean sd median
1 1 71 18.1 7.59 18 1 1 78 20.17 8.68 20
:6 :6
:A :B
1 1 72 25.49 9.62 25 1 1 79 37.95 14.79 38
: 12 : 12
:A :B
1 1 72 37.68 13.94 38 1 1 81 57.4 15.34 56
: 18 : 18
:A :B
1 1 71 48.86 16.65 50 1 1 77 73.91 13.88 76
: 24 : 24
:A :B
1 1 70 58.81 17.01 58 1 1 78 83.56 11.54 86
The data show the average proficiency scores for Strategy A and B at five test
points. While proficiency is similar at month 0 (18 and 20 for Strategy A and B
respectively), by month 24 the average proficiency for Strategy B (84) is higher
than Strategy A (59), suggesting Strategy B is more effective. Due to limitations of
space, we do not discuss how these data could be visualized in detail. The sources
mentioned in the Further Reading section provide detail on how data can be
visualized in R (see also Hudson, Chapter 5 in this volume, for general discussion
of data visualization).
To test for differences between teaching strategies we use the lmer() function
in the lme4 package to fit a mixed effects model to the data. Before fitting the
statistical model, first consider the steps required in this analysis. The first step is
to consider the distribution of the dependent variable and decide which type of
model to fit. In this study, assume the researcher had access to the students’ pro-
ficiency scores as graded by the class teacher and as such we use a linear mixed
effects model. We first check whether the dependent variable follows a normal
distribution. We visually check the distribution using the qqnorm() function.
> qqnorm(scores$prof)
This function plots the proficiency scores as in Figure 8.1 (left panel) which,
if normal, would form a straight line. We can see that this is not the case. We thus
transform the variable to more closely resemble a normal distribution. As in stan-
dard analyses, there are different ways to transform variables. As the grades were
out of 100, we perform the logit transformation.We transform the variable in the
Prof column using the logit() function from the car package and create a new
column called “l_prof.”
> scores$l_prof = logit(scores$prof)
If we visualize the transformed variable using qqnorm(scores$l_prof ) we

can see from the right panel of Figure 8.1 that the data now more closely resem-
ble what would be expected following the normal distribution.
The next step is to consider the fixed effects. The two independent variables
of interest are “course” and “time.” This study examines how proficiency is influ-
enced by the two teaching strategies, so the analysis will include a fixed effect
for “course.” Proficiency is also expected to change over time, so a fixed effect of
“time” is needed. We are primarily interested in testing how the teaching strate-
gies influence the change in proficiency over time, so a “course” by “time” inter-
action is also included.
Fixed effects can be coded in different ways depending on the goals of the
researcher. The fixed effect for the “course” variable is a two-level factor. The
default coding scheme for factors in R is treatment coding. When a factor is treat-
ment coded, one of the levels is treated as the reference level and the other levels
are compared to it. This is particularly useful if you have a factor with more than
Normal Q-Q Plot
100
80
Sample quantiles
60
40
20
0
−3 −2 −1 0 1 2 3
Theoretical quantiles
Normal Q-Q Plot
2
Sample quantiles
−2
−3 −2 −1 0 1 2 3
Theoretical quantiles
FIGURE 8.1 Q-Q plots for untransformed (left) and transformed (right) proficiency
scores
two levels and you want to compare each level to a baseline condition. Treatment
coding is, however, different to the coding scheme of standard ANOVA, and does
not produce ANOVA-style main effects. To obtain main effects, sum coding is
used, which requires the two levels of our fixed effect “course” (A and B) to be
recoded as –0.5 and 0.5. We recode “course” into the sum coded column called
“s_course” as below. For further information on how different coding schemes alter
the interpretation of results in mixed effects models and regression analysis in gen-
eral, see Gillespie (2010) and Chen, Ender, Mitchell, and Wells (2003, Chapter 5).
> scores$s_course = ifelse(scores$course == “A”, -0.5,

0.5)
The fixed effect for the “time” variable is a continuous predictor rather than a
categorical factor. In this analysis, we assume the effect of “time” on l_prof is lin-
ear. Mixed effects models can, however, also model the effect of time in a nonlin-
ear fashion. For further discussion of different ways to model time in longitudinal
analysis, see Mirman, Dixon, and Magnuson (2008), Mirman (2014) and Singer
and Willet (2003, Chapter 6).
When including a continuous predictor, it is useful to center each value around
the mean, as this helps reduce collinearity in the model (see Jaeger, 2010). Cen-
tering involves subtracting the mean value of the predictor from each individual
value. Below, we add a column called “c_time” that centers the values from the
Time column.
> scores$c_time = scores$time—mean(scores$time,

na.rm=TRUE)
The next step is to consider what random effects to include.We will need ran-
dom effects parameters to model all known sources of random variance amongst
the different participants in our study. As six different classes of students were
tested, we also need random effects parameters to model the variance across
classes. Finally, as students are hierarchically clustered within classes, the model
will need to include a nested random effects structure that specifies that students
are nested under classes. The syntax that follows will fit a mixed effect model to
our data taking these considerations into account.
> model.1 = lmer(l_prof ~ s_course*c_time + (1|student)

+ (1|class), data = scores)
This fits a mixed effects model called “model.1” (note this name is arbitrary)
in which the dependent variable l_prof is analyzed in terms of the fixed effects
parameters s_course*c_time. This notation is a shorthand that specifies both
main effects and all possible interactions, but the notation s_course+c_time+s_
course:c_time could instead have been used, which explicitly specifies main
effects for s_course and c_time and the s_course:c_time interaction. In a more
complex design with several higher order interactions, this flexibility in R syn-
tax allows the researcher to specify which interactions to include based on the
hypotheses being tested.
The next part of the syntax specifies the random effects. These are specified
with parentheses () to distinguish them from the fixed effects. The syntax (1|stu-
dent) specifies a random intercept for students and (1|class) a random intercept for
classes (in R, 1 is used here to signify the presence of an intercept, while 0 could
be used to signify its absence). These random intercepts model how the overall
average proficiency scores for each student and each class vary randomly.The final
part of the syntax, data = scores, specifies which data frame is analyzed. Note that
we have not explicitly specified that the random effects are nested. As we coded
each student and each class with unique identifiers, the model is able to “work
out” the nested structure automatically.This would not be the case if the variables
were coded differently. If the three classes taught with Strategy A were coded as
1–3 and also the three classes with Strategy B as 1–3 (rather than 1–6), the nested
structure would need to be explicitly stated.We suggest adopting a similar coding
scheme to that used here so as to avoid this issue. A summary of the model (i.e.,
output) is obtained as follows.
> summary(model.1)
Linear mixed model fit by REML ['lmerMod']

Formula: l_prof ~ s_course * c_time + (1 | student) +
(1 | class)
Data: scores
REML criterion at convergence: 1306.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.2484 -0.5629 -0.0208 0.6198 4.8181
Random effects:
Groups Name Variance Std.Dev.
student (Intercept) 0.2334 0.4831
class (Intercept) 0.1090 0.3301
Residual 0.2203 0.4694
Number of obs: 749, groups: student, 156; class, 6
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.146509 0.141280 -1.04
s_course 0.867825 0.282561 3.07

c_time 0.116417 0.002042 57.02
s_course:c_time 0.059176 0.004084 14.49
Correlation of Fixed Effects:

(Intr) s_cors c_time
s_course -0.005
c_time 0.000 0.000
s_cors:c_tm 0.000 0.000 -0.047
This tells us that the mixed effects model is fit using a restricted maximum
likelihood technique (REML). The model formula (syntax) is then given, fol-
lowed by the REML criterion.This is a measure of how much the model deviates
from a “saturated” model (a model with a parameter for each observation). This
is known as deviance, and gives an indication of how well the model fits the data.
A number closer to 0 indicates a well-fitted model. Note that the absolute value
here is difficult to interpret, but the difference in values between two different
models fit to the same data can be used to assess which model provides a better fit
(see page 172). The scaled residuals provide a summary of the distribution of the
per-observation error (i.e., how the observed data differ from the values predicted
by the model). These values should be approximately symmetrical if the assump-
tion of normality has been met.
The summary then provides information about the random effects. The sum-
mary shows that we have included random intercepts for “student” and “class”
and provides information about the variance associated with each. The summary
then shows the residual variance, which is the amount of variance that is not
explained by the model. This is followed by information about the number of
observations and how they are grouped. Finally, we get information about the
fixed effects, including model estimates, standard errors, and t values. Note that p
values are not shown. We discuss this in more detail next.
The model estimates also provide an estimate of the size of the effects. That
the estimate of the main effect of “c_time” is positive indicates that proficiency
increased over time. The estimate for “c_time” indicates that for every one unit
increment in “c_time,” the (logit transformed) proficiency scores increased by
0.116 units. Note that these absolute values are perhaps difficult to interpret in
this instance as the dependent variable was transformed. In other analyses, the
estimates may be more easily interpretable. For example, if the dependent mea-
sure was a reaction time in milliseconds, the estimates would indicate differences
between conditions in milliseconds. The estimates could then be used to gauge
the magnitude of an effect in order to understand the extent to which the two
conditions differed.
It is important to emphasize again that the random effects as specified in this
model are random intercepts. This allows the average proficiency score of each
“student” and “class” to vary and will model, for example, that some students
might on average score less than others, while some classes might on average score
higher than others. In a between-groups design, the random variance across con-
ditions can be modeled with random intercepts. In a repeated measures design,
however, it is important to consider not only random intercepts but also random
slopes.
In this example study, “course” varies between students and classes. That is,
each student and class is tested on either Strategy A or Strategy B but not both.
In other words, a student or class is only tested on one level of the independent
variable “teaching strategy” (A or B). However, whereas “course” varies between
students and classes, “time” varies within them, as each student and each class
was tested at multiple points in time. As such, students and classes may not only
differ in overall average proficiency, but also in their sensitivity to the change
in proficiency over time. Some students (and classes) may greatly increase over
time, while others may only increase slightly. Currently, this type of variance is
not modeled in model.1. The random intercepts that this model includes only
model variance in average scores across students and classes, not variance in the rate
of change over time. Random slopes are required to model this type of variance.
Random slopes can be included for any repeated measures variable. It is impera-
tive that random slopes are included when required, as not including a random
slope for a repeated measures variable when there is considerable random slope
variance can lead to overconfident estimates of fixed effects and spurious results
(Barr, Levy, Scheepers, & Tily, 2013; Schielzeth & Forstmeier, 2009).
We add random slopes as follows. We first create a second model with a
random slope of “c_time” varying by student and then a third that addition-
ally includes the random slope of “c_time” varying by class. As “s_course” is not
repeated within participants and classes (no participant or class was tested on both
strategies A and B), it is not necessary to include the main effect of “s_course” or
the s_course:c_time interaction in the random slope terms. Random slope inter-
actions would be needed for any interaction that involves only repeated measures
variables (Barr, 2013).
> model.2 = lmer(l_prof ~ s_course*c_time + (1+c_

time|student) + (1|class), data = scores)

time|student) + (1+c_time|class), data = scores)
A formal way to test whether the inclusion of an additional model param-

eter improves model fit in comparison to a less complex model without the
parameter is to use a log-likelihood ratio test. This tests whether a more complex
model accounts for significantly more variance in the data than a less complex
one. The three models described earlier are incrementally compared using the
anova() function. Note that we have used the anova() function here specifying
refit = FALSE. The reason for this will be discussed in more detail next (the out-
put here has been edited for space).
> anova(model.1, model.2, model.3, refit = FALSE)

Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
model.1 7 1320.6 1353.0 -653.31 1306.6
model.2 9 1215.1 1256.7 -598.57 1197.1 109.490 2 < 2.2e-16 ***
model.3 11 1185.9 1236.7 -581.96 1163.9 33.227 2 6.094e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05. ” ’ 0.1 ‘ ’ 1
The results in the final column show that model.2 provides a significantly
improved fit over model.1, and model.3 provides a significantly improved fit
over model.2, indicating that the random slopes are accounting for a significant
amount of the random variance. Indeed, the summary for model.3 in the next
code sample (edited to save space) shows that the REML criterion value for
model.3 (1164) is lower than model.1 (1307), indicating that model.3 provides a
better fit. The addition of the random slopes for “c_time” in model.3 has also led
to an increase in the standard errors for the fixed effects in model.3 compared
to model.1, indicating that the random intercept only model was providing an
overconfident estimate of these parameters.
> summary(model.3)
REML criterion at convergence: 1163.9
Random effects:
Groups Name Variance Std.Dev. Corr
student (Intercept) 0.1947303 0.44128
time 0.0006193 0.02489 -0.13
class (Intercept) 0.0083835 0.09156
time 0.0004230 0.02057 1.00
Residual 0.1412285 0.37580
Fixed effects:
Estimate Std. Error t value
(Intercept) -0.142194 0.144181 -0.986
s_course 0.873302 0.288362 3.028
c_time 0.116553 0.008789 13.261
s_course:c_time 0.058102 0.017578 3.305
There has been some debate in the literature regarding how one should decide
on whether or not a random slope should be included in the analysis. Some
researchers have adopted a data-driven approach (e.g., Baayen et al., 2008) in
which random slopes are included only if they significantly improve model fit
(as shown earlier). Data-driven approaches are ideal for exploratory research. For
example, large corpora may have many independent variables. In such cases, it
may be unrealistic to include all fixed and random effects at once, and as such it
may make sense to adopt a data-driven approach.
However in confirmatory research, the researcher designs a study to test a spe-
cific set of hypotheses. Barr et al. (2013) argued that in this case, the random
effects should reflect the design of the study and the hypotheses being tested
such that random intercepts and slopes should be included for all theoretically
relevant fixed effects. They dubbed this the maximal model. Our example study
here was devised to examine how two teaching strategies influence proficiency
over time. In the design used, the random effects structure in model.3 contains
all the theoretically relevant random intercepts and slopes for the included fixed
effects to test these aims, and thus would be the maximal model. We suggest
researchers follow Barr et al. in use of maximal models in confirmatory research.
Note that the term confirmatory here is not intended to mean that random slopes
are needed only in replication research (i.e., research that attempts to confirm
existing results), rather it relates to research that tries to confirm (or reject) spe-
cific hypotheses.
Another issue that arises even in confirmatory research is whether to include
random slopes for control predictors (i.e., a predictor that is not of prime theo-
retical interest but which may affect the results; see Barr et al., 2013). There is
little consensus in this case. Given that a model may become overly complex if all
possible fixed and random effects for control predictors are included by default, a
data-driven approach might be appropriate in such cases to decide whether such
parameters should be included.
Recall that t values are reported in the model summaries, but not p values.The
calculation of exact p values for mixed effects models is not straightforward as it is
not obvious how the degrees of freedom should be counted (Baayen et al., 2008;
Bates, 2006).There are different ways to estimate p values and determine statistical
significance, although here too there is no current consensus on which method to
use. One way is to estimate p values from the t distribution as shown next (from
Baayen, 2008, p. 248):
2 * (1 - pt(abs(X), Y - Z))
Here, X is the t value, Y the number of observations and Z the number of

fixed effects parameters. In the case of the fixed effect “s_course” in model.3, the t
value is 3.028, the number of observations is 749, and the number of fixed effects
parameters is 4 (the fixed effect intercept, two main effects, and interaction). We
calculate p values for the fixed effects as shown next. Using this method, each
fixed effect is significant at the .05 level.
> 2 * (1 - pt(abs(3.028), 749 - 4))
[1] 0.002546729
> 2 * (1 - pt(abs(13.261), 749 - 4))

[1] 0
> 2 * (1 - pt(abs(3.305), 749 - 4))

[1] 0.0009952096
Note that this p value can be overly liberal for small data sets (Baayen, 2008;
Baayen et al., 2008). The degrees of freedom are estimated by subtracting the
number of fixed effects from the number of observations. Consequently, when
a data set is small, subtracting the number of fixed effects from the number of
observations can have a large impact on the p value. However, in the case of the
current example study, the difference between 749 and 749 – 4 is largely incon-
sequential. For further discussion of ways to assess statistical significance in mixed
models, see Baayen (2008, pp. 247–248), Baayen et al. (2008, pp. 396–399) and
Barr et al. (2013, pp. 276–277).
Although this hypothetical study was designed to examine the effects of two
teaching strategies, the researcher may want to consider if potentially confound-
ing variables are influencing the data. As mentioned earlier, one may not want
to include such control predictors in the analysis by default, as including too
many variables can lead to a model that is overly complex and difficult to inter-
pret. As we compared different random effects structures, we can also use model
comparisons to test whether the inclusion of fixed effects for control variables
significantly improves model fit. The researcher can then include or remove a
control predictor based on whether or not it provides a better fit to the data. As
an example, we create a model with a fixed main effect of gender to test for any
differences between male and female participants (note gender is first sum coded
into s_gender as above with course). We then compare model.4 to model.3 using
the anova() function (note refit = FALSE is not specified this time) to see if the
inclusion of s_gender improves model fit.
> scores$s_gender = ifelse(scores$gender == “M”, -0.5,

0.5)
> model.4 = lmer(l_prof ~ s_course*c_time+s_gender +

(1+c_time|student) + (1+c_time|class), data = scores)
> anova(model.3, model.4)
refitting model(s) with ML (instead of REML)

model.3 11 1164.5 1215.3 -571.24 1142.5
model.4 12 1165.0 1220.5 -570.51 1141.0 1.4466 1 0.2291
Here, the results of the model comparison in the last column suggest that
model.4 does not provide a significantly improved fit to the data (p = .229) com-
pared to model.3, and as such we do not need to include the fixed effect for
gender.
Note that when a model is fit using REML, as here, model comparisons are
appropriate only when comparing models with different random effects (Pin-
heiro & Bates, 2000). To compare two models that differ in fixed effects, mod-
els should be fit using maximum likelihood. The anova() function refits the
model with maximum likelihood to allow comparison of models differing in
fixed effects.The output shown earlier illustrates this by stating refitting model(s)
with ML (instead of REML). When we compared different random effects for
model.1, model.2 and model.3 earlier we explicitly specified refit = FALSE to
ensure that the anova() function did not refit these models using maximum
likelihood (ML), as comparing models with different random effects can be con-
ducted on models fit using REML. To compare models with different fixed
effects, however, ML should be used. Although the anova() function can auto-
matically do this, it is also possible to compare the same models’ fit with ML.
Next we suppress the default option of lmer() to fit models by REML with the
code REML=F.
time|student) + (1+c_time|class), data = scores,
REML=F)
> model.6 = lmer(l_prof ~ s_course*c_time+s_gender +

(1+c_time|student) + (1+c_time|class), data = scores,
REML=F)
> anova(model.5, model.6)

model.5 11 1164.5 1215.3 -571.24 1142.5
model.6 12 1165.0 1220.5 -570.51 1141.0 1.4466 1 0.2291
Note that this time the anova() function does not give the refitting model(s)
with ML warning. The comparison here is still nonsignificant and has the same
chi-square and p values as before. For conciseness, we suggest researchers fit mod-
els using ML (REML=F) when comparing models with different fixed effects,
rather than relying on refitting using the anova() function.
At this point, we are ready to report our results. The results could be reported
as follows:
We tested which of two teaching strategies led to a greater increase in

English proficiency over time. We fitted a mixed effects model in R using
the lme4 package (version 1.17) and using restricted maximum likeli-
hood. Fixed effects included main effects of course, time, and the course by
time interaction. The fixed effect factor “course” was sum coded while the
continuous fixed predictor “time” was centered. The dependent variable
(“proficiency”) was transformed using a logit transformation. Students and
classes were treated as random effects with students nested under classes.
Random intercepts for subjects and classes were included, as were random
slopes for time varying by both students and classes, using a maximal ran-
dom effects structure. Statistical significance was assessed by calculating p
values from the t distribution.
This model revealed a significant main effect of course (estimate = 0.87,
SE = 0.29, t = 3.03, p = .003), with those taught with Strategy B dem-
onstrating a higher average proficiency than those taught with Strategy A.
There was also a significant main effect of time (estimate = 0.12, SE = 0.01,
t = 13.26, p < .001), with the positive estimate indicating that the average
proficiency across both groups increased at a rate of 0.12 points on the logit
scale for every one unit increment in time (i.e., every month). Importantly,
these main effects were qualified by a significant course by time interaction
(estimate = 0.06, SE = 0.02, t = 3.31, p < .001), indicating that the increase
in proficiency over time was reliably larger for teaching Strategy B than
Strategy A, suggesting Strategy B is more effective. Indeed, although both
groups started with similar proficiency scores, after 24 months of teaching,
students taught with Strategy B had a proficiency score 25 points higher
(1.46 on the logit scale) than those taught with Strategy A. The addition of
a fixed main effect for gender did not lead to an improvement in model fit
compared to the model without (χ2 [1] = 1.45, p = .229), suggesting gender
did not affect proficiency in this study.
Finally, we discuss a practical problem that researchers may encounter. Some-

times models fail to converge. When this happens, the lmer() function will give
an error such as singular convergence or false convergence. This can particu-
larly be the case with complex random effects structures. These errors usually
result when models are too complex, with one or more parameters not being
adequately estimated because the data are too sparse. One should not interpret
and report a nonconverged model, but should instead simplify the random effects
structure until convergence is achieved.
One option for simplifying the random effects is to look at the estimates
of the random effects of the nonconverged model, identify the random effect
parameter with the lowest variance, and refit the model without this parameter.
Another option is to attempt to refit the model without random correlation
parameters. This has been shown to not have an impact on either Type I or Type
II error rates (Barr et al., 2013). For example, recall that the random effects for
model.3 were specified as (1+c_time|student) + (1+c_time|class). The syn-
tax 1+c_time includes a random intercept, a random slope for c_time and a
correlation between the two. This allows for the possibility that there may be
a correlation between the random intercepts and random slopes (e.g., a stu-
dent with a higher than average proficiency may learn faster than average over
time). The summary of model.3 indicates the correlation between the random
intercept and random slope for class is very high. High correlations can often
occur with models that fail to converge. If this were the case, the model could
be simplified by removing the correlation parameter with the syntax (1|class)
+ (0+c_time|class). Unfortunately, there is little consensus in best practice
when dealing with convergence errors (see Barr et al., 2013, pp. 275–276 for
discussion).
Tools and Resources

• The data from this chapter is available from the book’s companion website
(http://oak.ucc.nau.edu/ldp3/AQMSLR.html).
• R can be downloaded at http://www.r-project.org/.
• The R-Lang mailing list is a useful resource where users can post questions
relating to the analysis of linguistic data in R and can be joined by visiting
https://mailman.ucsd.edu/mailman/listinfo/ling-r-lang-l.
• As a recent advancement in data analysis in the language sciences, the stan-
dards for conducting mixed effects models are still evolving. When conduct-
ing, reporting or reviewing a mixed effects analysis, we suggest keeping the
following points in mind.
• Check the assumptions of the analysis and use the correct type of mod-
el for the distribution that the dependent variable is expected to follow.
• Clearly specify how the model was fit; describe how the dependent
variable may have been transformed; describe the fixed effects and how
they were coded; clearly specify the random effects, including both
random intercepts and random slopes; explain how statistical signifi-
cance was assessed; explain how convergence errors were dealt with.
• For confirmatory analysis, use maximal models (Barr et al., 2013)
• When reporting results, include the R syntax so that readers can see
how the analysis was structured. For example:
model = lmer(prof ~ s_course*c_time +

(1+c_time|student) + (1+c_time|class), data=scores)
• Whenever possible, consider making raw data sets, and the R scripts
used to prepare and analyze them, available for reanalysis by other
researchers.
Further Reading
We are aware of four introductions to R with an emphasis on language data, all of
which provide a strong foundation to both linguistic/quantitative analysis and to
the use of the R statistical package. Field (2013, Chapter 25) provides a practical
introduction to using mixed effects models in SPSS.
• Analysing linguistic data: A practical introduction to statistics using R (Baayen,

2008).
• Statistics for Linguistics with R. A practical introduction (Gries, 2013).
• (The online supplement of ) A guide to doing statistics in second language research
using SPSS (Larson-Hall, forthcoming)
• Using statistics in small-scale language education research: Focus on non-parametric
data (Turner, 2014)
• Discovering statistics using IBM SPSS statistics (Field, 2013)
Further discussion of mixed effects models can be found in the 2008 special
edition of the Journal of Memory and Language on emerging data analyzes (Baayen
et al. 2008; Barr, 2008; Dixon, 2008; Jaeger, 2008; Mirman et al. 2008; Quene &
van den Burgh, 2008). Cunnings (2012) and Linck and Cunnings (in press) pro-
vide additional introductions aimed at L2 researchers. Existing L2 studies using
mixed effects models for longitudinal analysis include Ljungberg et al. (2013) and
Meunier and Littre (2013; see Sample Study 1).
1. Think about the variables of a study you have read about or that you are
conducting. Would a mixed effects model be appropriate? If not, why not? If
appropriate, which factors would you consider to be fixed vs. random? Why?
2. In the analysis in this chapter, the main effect of gender did not improve
model fit. Other potentially confounding variables in the study are L1, age,
and length of exposure. Consider whether these variables should be included
while bearing the following questions in mind.
a) How should these variables be coded?
b) Do any of these variables lead to a significant improvement in model fit?
c) Should you only include main effects of each of these variables, or could
they potentially interact with other independent variables?
d) If any of these variables do provide a significantly improved model fit,
should you also consider including random slopes? If so, what random
slopes should be included?
3. The Categorical data frame in the supplementary data file (Longitudinal.
RData), available on this book’s companion website (http://oak.ucc.nau.
edu/ldp3/AQMSLR.html), contains a similar set of data with a different
dependent variable. Imagine participants took part in a formal test at each
point in time. With 50 questions per test, this equates to students answer-
ing 250 questions in total over the course of teaching. Responses to each
question were marked categorically as correct/incorrect (coded 1/0 in the

Correct column). As the dependent variable is a binary response, a logit
mixed effects model is appropriate. We can fit a logit mixed effects model
using the glmer() function as below.
> model.categorical.1 = glmer(correct ~ s_course*c_

time + (1|student) + (1|question) + (1|class),
data = categorical, family = binomial)
Now consider the following questions.

a) This model has random intercepts for “student,” “question,” and “class.”
Why has a random effect for “question” been included?
b) Get a summary of this model using summary(model.categorical.1). How
is the summary here different to the summaries for the other models dis-
cussed so far?
c) What random effects need to be included in the maximal model of this
data? To answer this question, consider which fixed effects are repeated
measures for students, questions, and classes (note that these complex
models will take some time to converge!).
d) Do you encounter any convergence errors with these complex models?
If so, how can you simplify the model so that convergence is achieved?
e) Consider the issues raised in Discussion Question 1 for the analysis of
this data as well.
References
Baayen, H. (2008). Analyzing linguistic data. A practical introduction to statistics using R. Cam-
bridge: Cambridge University Press.
Baayen, H., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random
effects for subjects and items. Journal of Memory and Language, 59, 390–412.
Barr, D. (2008). Analysing “visual world” eyetracking data using multilevel logistic regres-
sion. Journal of Memory and Language, 59, 457–474.
Barr, D. (2013). Random effects structure for testing interactions in linear mixed-effects
models. Frontiers in Psychology, 4, 328. doi: 10.3389/fpsyg.2013.00328
Barr, D., Levy, R., Scheepers, C., & Tily, H. (2013). Random-effects structure for con-
firmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68,
255–278.
Bates, D. (2005). Fitting linear models in R: Using the lme4 package. R News, 5, 27–30.
Bates, D. (2006). Post to the R-help mailing list, 19 May 2006. https://stat.ethz.ch/piperm
ail/r-help/2006-May/094765.html
Boyle, M., & Willms, J. (2001). Multilevel modelling of hierarchical data in developmental
studies. Journal of Child Psychology and Psychiatry and Applied Disciplines, 42, 141–162.
Chen, X., Ender, P., Mitchell, M. & Wells, C. (2003). Regression with SAS. http://www.ats.
ucla.edu/stat/sas/webbooks/reg/default.htm
Clark, H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in

psychology research. Journal of Verbal Learning and Verbal Behavior, 12, 335–359.
Collins, L. (2006). Analysis of longitudinal data: The integration of theoretical models,
design and statistical model. Annual Review of Psychology, 57, 505–528.
Cunnings, I. (2012). An overview of mixed-effects statistical models for second language
researchers. Second Language Research, 28, 369–382.
Dixon, P. (2008). Models of accuracy in repeated-measures designs. Journal of Memory and
Language, 59, 447–456.
Field, A. (2013). Discovering statistics using IBM SPSS statistics (4th ed.). Thousand Oaks,
CA: Sage.
Fox, J. & Weisberg, S. (2011). An {R} companion to applied regression (2nd ed.). Thousand
Oaks CA: Sage. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion
Gelman, A. & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models.
Cambridge: Cambridge University Press.
Gillespie, M. (2010). Categorical variables in regression analyzes. http://hlplab.files.word-
press.com/2011/02/codingtutorial.pdf
Goldstein, H. (1995). Multilevel statistical models. London: Arnold.
Gries, S. Th. (2013). Statistics for linguistics with R. A practical introduction (2nd ed.). Berlin:
Mouton De Gruyter.
Jaeger, F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not)
and towards logit mixed models. Journal of Memory and Language, 59, 434–446.
Jaeger, F. (2010). Common issues and solutions in regression modelling (mixed or not).
https://www.hlp.rochester.edu/resources/recordedHLPtalks/PennStateRegres-
sion10/PennState-Day2.pdf
Larson-Hall, J. (forthcoming). A guide to doing statistics in second language research using SPSS
and R. New York: Routledge.
Lazaraton, A. (2000). Current trends in research methodology and statistics in applied lin-
guistics. TESOL Quarterly, 34, 175–181.
Linck, J., & Cunnings, I. (in press). The utility and application of mixed effects mod-
els in second language research. In J. M. Norris, S. Ross, & J.J.M. Schoonen (Eds.),
Improving and extending quantitative reasoning in second language research. Malden, MA:
Wiley-Blackwell.
Ljungberg, J., Hansson, P., Andrés, P., Josefsson, M., & Nilsson, L. (2013). A longitudinal
study of memory advantages in bilinguals. PLoS ONE, 8(9): e73029. doi:10.1371/
journal.pone.0073029
Locker, L., Hoffman, L. & Bovaird, J. (2007). On the use of multilevel modelling as an
alternative to items analysis in psycholinguistic research. Behavior Research Methods, 39,
723–730.
Meunier, F., & Littre, D. (2013). Tracking learners’ progress: Adopting a dual “corpus cum
experimental data” approach. Modern Language Journal, 97, 61–76.
Mirman, D. (2014). Growth curve analysis and visualization in R. Pennsylvania: Chapman &
Hall.
Mirman, D., Dixon, J., & Magnuson, J. (2008). Statistical and computational models of the
visual world paradigm: Growth curves and individual differences. Journal of Memory and
Language, 59, 475–494.
Norris, J., & Ortega, L. (2000). Effectiveness in L2 instruction: A research synthesis and
Ortega, L., & Byrnes, H. (2008). The longitudinal study of advanced L2 capacities. New York:
Routledge.
Ortega, L., & Iberri-Shea, G. (2005). Longitudinal research in second language acquisition:
Recent trends and future directions. Annual Review of Applied Linguistics, 25, 26–45.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-PLUS. New York:
Springer.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyzes, and reporting
Plonsky, L., Egbert, J., & LaFlair, G. T. (in press). Bootstrapping in applied linguistics: Assess-
the case of interaction research. Language Learning, 61, 325–366.
Quene, H., & van den Bergh, H. (2008). Examples of mixed-effects modelling with crossed
random effects and with binomial data. Journal of Memory and Language, 59, 413–425.
R development core team. (2014). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing.
Raudenbush, S. (1993). A crossed random effects model for unbalanced data with appli-
cations in cross-sectional and longitudinal research. Journal of Educational Statistics, 18,
321–349.
Raudenbush, S. (2001). Comparing personal trajectories and drawing causal inferences
from longitudinal data. Annual Review of Psychology, 52, 501–525.
Raudenbush, S., & Bryk, A. (2002). Hierarchical linear models: Applications and data analysis
methods (2nd ed.). Thousand Oaks, CA: Sage.
Revelle, W. (2014) Psych: Procedures for Personality and Psychological Research, North-
western University, Evanston, IL, http://CRAN.R-project.org/package=psych
Version=1.4.8.
Schielzeth, H., & Forstmeier, W. (2009). Conclusions beyond support: Overconfident esti-
mates in mixed models. Behavioral Ecology, 20, 416–420.
Singer, J. (1998). Using SAS PROC MIXED to fit multilevel models, hierarchical models,
and residual growth curve models. Journal of Educational and Behavioral Statistics, 23,
323–355.
Singer, J., & Willett, J. (2003). Applied longitudinal data analysis: Modeling change and event
occurrence. New York: Oxford University Press.
Snijders, T., & Bosker, R. (1999). Multilevel analysis. London: Sage.
Turner, J. L. (2014). Using statistics in small-scale language education research. New York:
Routledge.
9
EXPLORATORY FACTOR
ANALYSIS AND PRINCIPAL
COMPONENTS ANALYSIS
Shawn Loewen and Talip Gonulal
Factor analysis comprises an array of multivariate statistical methods used to inves-

tigate the underlying correlations among a set of observed variables. To achieve
this goal, factor analysis seeks to determine the fewest number of variables that
will still explain a substantial amount of variance in the data. One of the first uses
of this procedure in second language (L2) research goes back to the 1940s when
Wittenborn and Larsen (1944) applied factor analysis to investigate the differences
between high- and low-achieving German L2 students (as cited in Loewen &
Gass, 2009). Since then, due to its multifaceted functions, the use of factor analysis
has increased in L2 research (Plonsky & Gonulal, 2015). However, factor analy-
sis is one of the more misunderstood and misused statistical techniques, in part
because of the large number of steps and options available to researchers. Norman
and Streiner (2003) clearly point out this issue:
Proponents feel that factor analysis is the greatest invention since the double
bed whereas its detractors feel it is a useless procedure that can be used to
support nearly any desired interpretation of the data. The truth, as is usually
the case, lies somewhere in between. Used properly, factor analysis can yield
much useful information; when applied blindly and without regard for its
limitations, it is about as useful and informative as tarot cards. (p. 144)
The purpose of this chapter is, therefore, to provide a discipline-specific,

step-by-step manual for conducting factor analysis, particularly the two main
types: exploratory factor analysis and principal components analysis. Readers are
referred to Field (2009), and Rietveld and Van Hout (1993) for additional con-
ceptual and technical information on factor analysis.
Factor Analysis 183
As in many other disciplines, L2 researchers often explore large data sets. For
instance, researchers interested in teachers’ and students’ beliefs about grammar
instruction may collect data from numerous participants using a survey with many
individual questions. Alternatively, researchers might investigate the occurrence of
various linguistic structures in different discourse types in L1 and/or L2 corpora. In
such research studies, a frequent objective is to reduce the initial data set by identi-
fying variables, such as the survey questions or linguistic structures mentioned ear-
lier, that behave similarly. Factor analysis can be used to investigate the correlations
present in the data and to consolidate variables in a principled manner.
Factor analysis is not a single statistical method but a series of complex structure
analyzing procedures, like structural equation modeling (see Schoonen, Chapter 10
in this volume), that investigates the potentially unobserved relationships amongst
variables in a data set; as such, factor analysis can be used for a variety of purposes.
One common use is to explore the underlying relationships in a set of variables
by deriving a more parsimonious number of related variables, referred to as factors
or components (Gorsuch, 1983; Kline, 2002; Tabachnick & Fidell, 2013; Thompson,
2004). These factors are argued to represent underlying constructs (also known as
latent variables) in the data. For example, Loewen et al. (2009) used factor analysis
to group 37 questionnaire items into six conceptually related factors. Additionally,
factor analysis can be applied to data sets with large numbers of items or variables
in order to reduce the data to a more manageable size (Field, 2009; Gorsuch, 1983,
1990). For instance, Asención-Delaney and Collentine (2011) used factor analysis
to investigate how 78 different linguistic structures in a written L2 Spanish corpus
grouped into different discourse types. Moreover, factor analysis can be used for
conducting item analysis to strengthen tests or questionnaires by identifying items
that are relatively unrelated to the overall test (see Gorsuch, 1983; Kline, 2002).
Finally, as explained later, the factors generated from a factor analysis can be used
in subsequent analyses such as ANOVA and regression.
Types of Factor Analysis

There are two general types of factor analysis: exploratory factor analysis (EFA)
and confirmatory factor analysis (CFA; see Figure 9.1). As the name implies, EFA
is preferred when researchers do not have any particular expectations regarding
the number and nature of the underlying factors (i.e., latent variables) that exist
in the data. For example, Winke (2011) used EFA to investigate teachers’ and
test administrators’ perceptions of an English Language Proficiency Assessment,
gathered from a 40-item, Likert-scale questionnaire. Because there was no theo-
retical or empirical rationale for choosing the nature and number of the factors,
an EFA was appropriate.Winke’s analysis produced five factors, which she labeled
184 Shawn Loewen and Talip Gonulal
(a) reading and writing tests, (b) effective administration, (c) impacts on cur-
riculum and learning, (d) speaking test, and (e) listening test. These factors would
have been difficult to identify simply looking at the 40 items in the questionnaire.
CFA, however, is used when researchers have specific expectations regarding
the underlying structure of the data. For example, Mizumoto and Takeuchi (2012)
used CFA in their adaptation and validation study of Tseng, Dörnyei, & Schmitt
(2006) analysis of a self-report questionnaire investigating the self-regulating
capacity in vocabulary learning in a Japanese English as a foreign language set-
ting. Because the researchers were basing their analysis on Tseng et al.’s (2006)
previously conducted analysis, Mizumoto and Takeuchi had clear expectations
regarding what and how many factors would underlie in the questionnaire. Con-
sequently, it was appropriate for them to conduct a CFA.
As seen from the previous examples, the selection between EFA and CFA
depends primarily on whether researchers have specific theoretical expectations
regarding the number and nature of factors present in the data. (See Thompson,
2004 for more detail on the differences between EFA and CFA.)
One practical difference between EFA and CFA lies in the software programs
used for statistical analyses. When conducting an EFA, more common statistical
computer software packages (e.g., SPSS, R, and SAS) are used, whereas in CFA,
more recent (and less common) statistical packages (e.g., AMOS, LISREL, and
EQS) are used. Considering the different assumptions and purposes of CFA, and
due to limited space, this chapter will focus exclusively on EFA. See Harrington
(2009) for more details on conducting CFA. Nevertheless, conceptual knowledge
of EFA is helpful in understanding CFA.
In addition to the differences between CFA and EFA, there is some ambiguity
in the terminology used within EFA itself because it is often used as an umbrella
Factor Analysis (FA)
Exploratory (EFA) Confirmatory (CFA)
Principal Components Analysis (PCA) Exploratory (EFA)
Maximum Likelihood Principal Axis Factoring
FIGURE 9.1 Types of factor analysis

Factor Analysis 185
term covering both principal components analysis (PCA) and EFA. However,
there are two schools of thought on the differences between EFA and PCA (Hen-
son & Roberts, 2006). Some statisticians view EFA and PCA as completely dif-
ferent types of analyses, whereas other statisticians treat PCA as a type of EFA that
differs only in its method of factor extraction.
In conceptual terms, the difference between PCA and EFA lies in how they
treat the variance that is present in the data; PCA analyzes variance whereas EFA
analyzes covariance (Tabachnick & Fidell, 2013). That is to say, PCA includes all
variance (i.e., the variability or spread within a data set) including (a) variance
unique to each variable, (b) variance common among variables, and (c) error
variance (Gorsuch, 1983; Kline, 2002; Tabachnick & Fidell, 2013). In contrast,
EFA includes only the variance in the correlation coefficients (i.e., the variance
common among variables), whereas the error variance and the variance unique to
each variable are excluded from the analysis. In sum, PCA does not differentiate
between common and unique variance, but EFA does.
The importance of the distinction between EFA and PCA is controversial
(Field, 2009). Often PCA results may be very similar to EFA results; however, in
some instances, there may be meaningful and substantial differences between the
two (Conway & Huffcutt, 2003). For instance, in PCA the weight with which
variables load on to factors may be too high, whereas EFA loadings are more
accurate when the data meet the assumptions of EFA (Widaman, 1993). Fabri-
gar, Wegener, MacCallum, and Strahan (1999) investigated several data sets and
showed that there were a number of cases in which EFA and PCA solutions were
different. Gorsuch (1990) argued that it is better to use EFA because it produces
better solutions some of the time and similar results the rest of the time. Conway
and Huffcutt (2003) note that:
If a researcher’s purpose is to understand the [underlying] structure of a set

of variables (which will usually be the case), then use of a common factor
model [EFA] such as principal axis or maximum likelihood factoring rep-
resents a high-quality decision. If a researcher’s purpose is pure reduction of
variables . . . then use of PCA represents a high-quality decision. (p. 150–151)
Step-by-Step Treatment of Exploratory Factor Analysis

EFA is comprised of a number of statistical procedures, and it is conducted in a
step-by-step fashion. However, the availability of multiple options at each step
makes conducting an EFA complex. Fabrigar and Wegener (2012) clearly point
out this aspect of factor analysis:
Few statistical procedures require more decisions on the part of a researcher

and provide more choices in the implementation of the analysis. It is this
aspect of factor analysis that users often find challenging and, at times,
bewildering. (p. 39)
1. Factorability of data
2. Factor extraction
method
3. Factor retention criteria
4. Factor rotation method
5. Results
5.2. 6. Use in
5.1. Factor
Factor subsequent
Loadings
scores analysis
6. Interpretation
7. Reporting the
results
FIGURE 9.2 Overview of the steps in a factor analysis (adapted from Rietveld & Van
Hout, 1993, p. 291)
Factor Analysis 187
In many cases, the software programs used for conducting EFA contain default
settings; however, overdependence on such settings, as is sometimes seen in fac-
tor analytic L2 research (Plonsky & Gonulal, 2015), may not provide the most
accurate analyses. Therefore, it is crucial for researchers to be informed about the
various options in conducting an EFA and to follow a decision pathway to obtain
the best results. The flow diagram (Figure 9.2) adapted from Rietveld and Van
Hout (1993, p. 291) illustrates the necessary steps to conduct an EFA. The next
sections will discuss these steps in order, and important decision points will be
explained. Throughout the steps, examples from the LearnerBeliefsData.sav file
(available on the companion website, http://oak.ucc.nau.edu/ldp3/AQMSLR.
html), which was subjected to a FA with principal components analysis extraction
in SPSS (version 21), will be provided. Note that different versions of SPSS may
differ somewhat in their format and output.
1. Factorability of Data
The first step in conducting an EFA is to consider if the data are appropriate
for factor analysis. As in other statistical methods, researchers should check the
assumptions of EFA. Specifically, EFA can be used for interval data, including Lik-
ert scale items. Further, the variables used in EFA should be linearly related and
moderately correlated. In addition, sample size should be taken into consideration
because correlations are highly sensitive to N. There are several rules of thumb
regarding the appropriate sample size for factor analysis. In some cases, research-
ers propose minimum sample sizes such as 100 (Hair, Anderson, Tatham, & Black,
1995), 300 (Tabachnick & Fidell, 2013), or 500 (Comrey & Lee, 1992). Alterna-
tively, recommendations regarding sample size relate to the specific number of
subjects or items per variable. The exact number required is disputed, with esti-
mates ranging from 3 to 20 subjects or items per variable (Gorsuch, 1983, 1990,
2003; Pett, Lackey, & Sullivan, 2003;Tabachnick & Fidell, 2013;Thompson, 2004).
That being said, 10 to 15 is the most common suggestion (Field, 2009). However,
following a rule of thumb can sometimes be misleading because a large sample
size is not always necessary for accurate factor solutions or correlations. Accord-
ing to MacCallum,Widaman, Zhang, and Hong (1999),“when communalities are
high (greater than .60) and each factor is defined by several items, sample sizes can
actually be relatively small” (p. 402).
Because the suggested sample size for factor analysis varies considerably (for
further detail on sample size in factor analysis see Gorsuch, 1983, 1990, 2003;
MacCallum et al., 1999), one additional approach is to conduct a post hoc analysis
to investigate the appropriateness of a given sample for a specific analysis. One
such method is the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy.
KMO values range from 0 to 1, with higher values representing better sampling
adequacy (see Figure 9.3). More specifically, “values between 0.5 and 0.7 are
mediocre, values between 0.7 and 0.8 are good, values between 0.8 and 0.9 are
great and values above 0.9 are perfect” (Field, 2009, p. 679). Thus, the KMO
KMO Measure of Sampling Adequacy .897

Bartlett’s Test of Sphericity Approx. Chi-Square 7253 190
Df 666
Sig. .000
(Adapted from Loewen et al., 2009)
FIGURE 9.3 Example of KMO measure of sampling adequacy and Bartlett’s Test of
Sphericity (SPSS output)
value of 0.897 in Figure 9.3 represents a very good sample size for the specific
study (which had 754 participants and 24 variables or roughly 30 participants per
variable).
Although there is no suggested sample size for L2 research, Plonsky and
Gonulal (2015) reported that in L2 factor analytic research EFA was used for a
median of 24 variables, with a median of 252 participants.The median variable-to-
participant ratio was 12.
In addition to determining the appropriate sample size, researchers need to
examine the correlations and communalities among the variables entered into
the EFA. There might be two possible problems here: (a) correlations can be
quite low (or even nonexistent), or (b) correlations can be quite high. Neither
situation is desirable because both indicate a lack of variation in the data. To
test for undesirably low correlations, researchers can employ Bartlett’s Test of
Sphericity, which tests the hypothesis that the correlation matrix is an identity
matrix, meaning that all correlation coefficients are close to 0 (Field, 2009).
Such a scenario is undesirable because if no variables are correlated, then it
is not possible to find clusters of related variables. Therefore, Bartlett’s Test
indicates whether the correlations between variables are significantly different
from 0 (Field, 2009), and a significant result with p < .05 indicates that the
variables are correlated and thus suitable for EFA, as is seen in the Sig. value of
.000 in Figure 9.3.
In addition to low correlations, another potential problem is multicollinearity,
which is the presence of variables that are too highly correlated, with a correla-
tion coefficient around ±.90. A simple solution to check for multicollinearity
is to inspect the correlation matrix (or R-matrix) and the determinant of the
R-matrix for highly correlated variables. Correlation coefficients beyond ±.90
indicate that the two variables are essentially identical and measure the same
thing, thereby adversely affecting the computation of the EFA. The determinant
of the R-matrix should be greater than 0.0001 (Field, 2009); thus, the determi-
nant of .001 in Figure 9.4 indicates that multicollinearity is not a problem for
this data set. If, however, multicollinearity is a problem, it is advisable to remove
one of the highly correlated variables from the analysis. Experimenting with the
Correlation matrixa
Q1 Q2 Q3 Q4 Q6 Q31 Q32 Q33 Q36 Q37 RVQ8 RVQ28
Correlation Q1 1.000 .433 .360 .147 .358 .252 .225 .118 .243 .314 .176 .077
Q2 .433 1.000 .469 .170 .358 .242 .370 .157 .203 .338 .135 .114
Q3 .360 .469 1.000 .206 .402 .239 .389 .078 .186 .459 .110 .094
Q4 .147 .170 .206 1.000 .109 .328 .119 .195 .141 .174 .355 .385
Q6 .358 .358 .402 .109 1.000 .136 .305 .095 .155 .401 .016 –.034
Q7 –.270 –.175 –.176 –.055 –.167 –.103 –.182 –.090 –.183 –.172 –.216 –.110
Q11 .305 .365 .367 .075 .376 .166 .425 .182 .107 .354 .051 .044
Q12 .246 .236 .185 .172 .198 .189 .173 .200 .211 .229 .180 .139
Q13 .122 .132 .109 .212 .119 .218 .174 .143 .083 .109 .188 .276
Q16 .387 .328 .272 .141 .372 .185 .252 .079 .237 .353 .134 .010
Q17 .444 .412 .425 .191 .425 .302 .299 .129 .273 .433 .231 .140
Q18 .374 .378 .367 .247 .357 .359 .349 .189 .233 .355 .231 .140
Q21 .472 .447 .493 .192 .448 .320 .308 .172 .264 .447 .198 .137
Q22 –.042 .010 .002 .078 –.101 .087 –.119 .021 .013 –.094 .054 –.031
Q23 .315 .303 .327 .234 .390 .235 .310 .169 .188 .363 .153 .149
Q26 .177 .129 .132 .059 .312 .068 .181 .116 .087 .156 –.043 –.101
Q27 .379 .323 .445 .134 .382 .261 .322 .106 .194 .428 .206 .123
Q31 .252 .242 .239 .328 .136 1.000 .195 .269 .180 .198 .338 .310
Q32 .225 .370 .389 .119 .305 .195 1.000 .198 .150 .369 .081 .121
Q33 .118 .157 .078 .195 .095 .269 .168 1.000 .198 .138 .139 .120
Q36 .243 .203 .186 .141 .155 .180 .150 .198 1.000 .216 .155 .053
Q37 .314 .338 .459 .174 .401 .198 .369 .138 .216 1.000 .148 .481
RVQ8 .176 .135 .110 .355 .106 .338 .081 .139 .155 .148 1.000 .481
RVQ28 .077 .114 .094 .385 –.034 .310 .121 .120 .053 .143 .481 1.000
a. Determinant = .001
FIGURE 9.4 Adapted R-matrix

removal of different variables will help determine which variable is having the
largest negative impact (Field, 2009).
Finally, examining the communalities (h2) can provide an indication of the
relationship of each variable to the entire data set. Communalities represent
the amount of common variance in a variable that is accounted for by all of
the extracted factors. For example, in Figure 9.5 the communality for Q1
(h2 = .482) indicates that the six extracted factors in Loewen et al.’s (2009) study
explain 48.2% of the variance in the variable. High communalities are desired
because they indicate that the EFA results perform well in accounting for vari-
ance within the variables. Researchers may wish to exclude variables with low
communalities since one purposes of factor analysis is to investigate the common
underlying relationships in a data set.
Figures 9.6, 9.7, and 9.8 illustrate the initial steps for conducting an EFA
in SPSS.
Initial Extraction
Ql 1.00 .482
Q2 1.00 .549
Q3 1.00 .670
Q4 1.00 .472
Q6 1.00 .520
Q7 1.00 .568
Qll 1.00 .573
012 1.00 .542
Q13 1.00 .448
Q16 1.00 .527
Q17 1.00 .665
Q18 1.00 .551
Q21 1.00 .597
Q22 1.00 .720
Q25 1.00 .519
Q26 1.00 .558
Q27 1.00 .494
Q31 1.00 .466
Q32 1.00 .578
Q35 1.00 .704
Q36 1.00 .510
Q37 1.00 .464
RVQ8 1.00 .609
RVQ28 1.00 .682
Extraction method: Principal Component Analysis
FIGURE 9.5 Communalities
Factor Analysis 191
Start by selecting Analyze > Dimension Reduction > Factor, which will
bring up the main dialogue box for factor analysis.
Select the variables of interest from the main dialogue box and move them
into the Variables dialogue box, then click the Descriptives button.
FIGURE 9.6 Choosing EFA
FIGURE 9.7 Main dialogue box for factor analysis

FIGURE 9.8 Descriptives in factor analysis
In the Descriptives dialogue box, the Univariate descriptives option produces

means and standard deviations for each variable.The Coefficients option provides a
correlation matrix of variables (i.e., R-matrix). The Determinant of the R-matrix
is used for testing for multicollinearity or singularity. The KMO and Barlett’s
test of sphericity option produces the KMO measure of sampling adequacy and
Barlett’s Test.
2. Determining the Factor Extraction Method

After determining that the data are suitable for factor analysis, researchers should
decide which method of factor extraction to use. Factor extraction is the process
of deciding on the number of statistically important underlying constructs in the
data, and there are a number of different extraction methods. PCA is the default
method in many statistical packages, and it was originally designed to reduce the
number of measured variables to a smaller set of variables. That is, because PCA
contains both common and unique variance, the main use of PCA is to explain
the variances of the measured variables rather than to account for the under-
lying structure of correlations among measured variables (Fabrigar & Wegener,
2012; Tabachnick & Fidell, 2013). Thus, PCA is useful in reducing data to a more
Factor Analysis 193
manageable size. A second common method of extraction is principal axis factor-

ing (PAF), which is similar to PCA in terms of the decomposition strategies used
to determine the number of factors (Pett et al., 2003). Additionally, both PCA
and PAF assume that the sample being analyzed constitutes the entire population;
consequently, generalizing the results of these two methods is not appropriate
(Field, 2009). In contrast, one extraction method that allows generalization is
the maximum-likelihood method, which assumes that the variables in the analy-
sis constitute the entire population of relevant variables. While these and other
theoretical differences exist for different methods of extraction, the practical dif-
ferences between these extraction methods are frequently negligible, especially
when the variables have high communalities (Thompson, 2004). Nonetheless,
thought should be given to the appropriate method in order to choose the most
appropriate analysis. (For further detail on different factor extraction methods, see
Pett et al., 2003.)
Back in the main Factor Analysis dialogue box (Figure 9.7), click on the
Extraction button. Next to Method is a drop-down menu with several options. For
our purposes we will use Principal components analysis. Then select Unrotated factor
solution and Scree plot for additional information. In the Extract section, there are
two options for retaining factors. Eigenvalues greater than 1 is the default option in
SPSS. However, the value can be changed, for example to Joliffe’s recommenda-
tion of .7. Alternatively, one can specify the number of factors to be extracted by
FIGURE 9.9 Dialogue box for factor extraction

selecting Fixed number of factors and then entering the desired number of factors.
In most cases, the default value of 25 for Maximum Iterations for Convergence is ade-
quate, although a larger value might be needed for larger data sets (Field, 2009).
3. Determining Factor Retention Criteria

After selecting the factor extraction method, it is still necessary to decide on the
number of factors to retain. Considering that the main purpose of factor analysis
is to obtain the fewest number of variables while still explaining a substantial
amount of variance in the data set, it is important to extract the correct number
of factors since this decision has considerable impact on the interpretation of
the results. This step can be compared to deciding on the number of clusters to
include in a cluster analysis (see Staples & Biber, Chapter 11 in this volume). Not
surprisingly, there are several potential criteria to determine the number of factors
to retain, including Kaiser’s criterion, Joliffe’s criterion, the cumulative percentage
of variance for extracted factors, scree plot, and parallel analysis (Cattell, 1966;
Cotello & Osborne, 2005; Kaiser, 1960; Thompson, 2004).
Kaiser’s and Joliffe’s criteria. Kaiser’s criterion suggests retaining factors
with eigenvalues greater than 1.0 (Comrey & Lee, 1992). Eigenvalues repre-
sents the amount of variance accounted for by each factor; thus, the higher
the eigenvalue, the more variance accounted for by the factor. Kaiser’s crite-
rion is the default method in SPSS, which might explain its common use. In
Figure 9.10, adapted from Loewen et al. (2009), six factors have eigenvalues
above 1.0, suggesting that only these six variables, accounting for 55% of the
variance, should be retained as factors. Kaiser’s criterion produces the most
accurate solutions when there are fewer than 40 variables and the sample size
is adequate (Gorsuch, 1983, 1990). However, Jolliffe’s criterion argues that an
eigenvalue selection threshold of 1.0 is too strict and recommends retaining all
variables with eigenvalues greater than 0.7. In the case of Loewen et al. (2009),
this criterion would yield an additional five factors, accounting for almost 72%
of the variance.
Cumulative percentage of variance. Another criterion for determining
the number of factors to retain is the cumulative percentage of variance extracted.
Using this criterion, researchers include factors up to a given threshold of cumu-
lative variance, with the goal being the most parsimonious solution to account
for as much variance as possible with as few variables as possible. Figure 9.10
shows that 55.3% of the variance is accounted for when a six-factor solution
(based on eigenvalues > 1.0) is chosen. Although there are no well-established
field-specific thresholds in the factor analysis literature, Field (2009) suggests that
the minimum cumulative percentage of explained variance should be around
55–65%. In fact, the average cumulative percentage of variance in factor analytic
L2 research is approximately 60% (Plonsky & Gonulal, 2015).Therefore, it may be
appropriate to continue factor extraction until at least 60% of the total variance
is accounted for.
FIGURE 9.10 Total variance explained (adapted from Loewen et al., 2009)
Scree plot
6
Eigenvalue
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Component Number
FIGURE 9.11 Scree plot (adapted from Loewen et al., 2009)

Scree plot. A fourth common extraction criterion is to examine the scree

plot in which eigenvalues are represented in descending order of magnitude. The
cutoff point for selecting factors is the point of inflexion, which is the sharp
descent, or elbow, in the slope of the plot. (Here, again, the scree plot in factor
analysis is comparable to a plot of distance between fusion coefficients in cluster
analysis; see Staples & Biber, Chapter 11 in this volume.) Only the number of fac-
tors to the left of the sharp descent is extracted. A scree plot for the data examined
in Figure 9.10 is presented in Figure 9.11, with the circle representing the point
of inflexion for the six-factor solution. However, scree plots can be difficult to
interpret because determining the point of inflexion is somewhat subjective. For
example, some researchers might choose the third factor in Figure 9.11 as the
point of inflexion, resulting in a two-factor solution. Because of this ambiguity, it
is useful to interpret the scree plot in light of other factor retention criteria.
Parallel analysis. Another criterion is to use parallel analysis, which is a
robust method for determining the number of factors to retain (Fabrigar et al.,
1999; Ford, MacCallum & Trait, 1986; Velicer et al., 2000). Although SPSS does
not directly allow researchers to use parallel analysis, several programs such as
LISREL and FACTOR can be used to employ it. In parallel analysis, the actual
eigenvalues of the variables are compared with eigenvalues generated randomly
by the computer program, based on the same number of observations (n) and
variables (v) as the real data set (Hayton, Allen, & Scarpello, 2004). When actual
eigenvalues are larger than those produced by parallel analysis, those factors are
retained. Examination of the eigenvalues in Table 9.1 shows that only two of the
six actual eigenvalues from the retained factors in Loewen et al. (2009) in fact
surpass their corresponding random eigenvalues. (For further information on par-
allel analysis see Glorfeld, 1995; Harshman & Reddon, 1983; Hayton et al., 2004.)
In the L2 factor analytic literature, almost 40% of the factor analysis studies
have used a single retention criterion, namely Kaiser’s criterion of eigenvalues
greater than 1.0 (Plonsky & Gonulal, 2015). However, considering the com-
plex nature of EFA, no single factor retention criterion can perfectly extract
the correct number of factors. Our suggestion, then, is to use multiple factor
TABLE 9.1 Parallel analysis (adapted from Hayton et al., 2004)
Factors Actual eigenvalue Random eigenvalue Decision

from PA
1 6.368 1.348 Accept

2 1.619 1.292 Accept
3 1.208 1.248 Reject
4 1.130 1.206 Reject
5 0.994 1.173 Reject
6 0.934 1.142 Reject
Factor Analysis 197
retention criteria (e.g., examination of the eigenvalues, percentage of variance

for extracted factors, and scree plot). In addition, it is advisable to consider which
solutions might make the best sense vis-à-vis the theoretical constructs in ques-
tion. For example, from the Loewen et al. (2009) beliefs data, a two-factor solu-
tion based on the parallel analysis and scree plot might seem unduly restrictive,
but an 11-factor solution based on Jolliffe’s criterion may be considered exceed-
ingly unwieldy.
4. Determining Factor Rotation Method

At this point, it might be tempting to consider the factor analysis complete
because the meaningful factors have been extracted. However, most researchers
take one more step and rotate the factor solution, which produces a more differ-
entiated factor-loading matrix indicating the strength with which each variable
loads onto each factor. In unrotated solutions, most of the variables load onto
the first factor, which is unhelpful in determining groupings of related variables;
however, the rotation spreads the loadings across factors, resulting in a more inter-
pretable solution. Rotation also maximizes the high item loadings and minimizes
the low item loadings on other factors (Field, 2009).There are numerous rotation
strategies but they can be grouped into two types, orthogonal and oblique, which
have different assumptions but generally produce similar solutions. In orthogonal
rotation, factors are assumed to be uncorrelated or independent whereas oblique
rotation produces factors that are correlated (Fabrigar & Weneger, 2012). Because
most factors related to human cognition and language learning can be assumed to
be related in some way, the most appropriate type of rotation in SLA research is
generally oblique rotation. SPSS has three options for orthogonal rotation (vari-
max, quartimax, and equamax) and two options for oblique rotation (direct obli-
min and promax). Since these options apply different mathematical calculations
when rotating the factors, the rotated factor solutions generally differ slightly
(Field, 2009). Varimax is the default rotation option in SPSS, and it has been
found to be the most common type of rotation in L2 factor analytic research
(Plonsky & Gonulal, 2015), even though oblique rotation such as direct oblimin
or promax might be more appropriate for L2 data. In SPSS, the factor-loading
matrix generated in orthogonal rotation is called a component matrix whereas in
oblique rotations two different factor-loading matrices are produced: a factor pat-
tern matrix and a factor structure matrix. The structure matrix displays loadings that
take relationships between factors in account, while the pattern matrix is based
on the unique contribution of each variable to each factor. In general, the pattern
matrix accentuates the differences among the factors and is therefore often more
meaningful and interpretable (Field, 2009; Pett et al., 2003). Finally, in spite of the
differences between orthogonal and oblique rotation, they often produce similar
results (Costello & Osborne, 2005).
In the main Factor Analysis dialogue box, click on Rotation and select Direct
Oblimin for an oblique rotation. The default Delta value of 0 is recommended
(Field, 2009). In the Display section, select the Rotated solution in order to produce
the rotated factor-loading matrix (Figure 9.12).The Maximum Iterations for Conver-
gence option specifies how many times SPSS will attempt to find a solution for data
set. The default value of 25 is usually adequate; however, in cases of large data set,
it is possible to increase the number of iterations, as done here for the N of 750.
In addition to the steps mentioned earlier there are several additional options
in conducting EFA (Figure 9.13). The first box described addresses missing data,
and it allows researchers to Exclude cases listwise, which means that any case with
missing data for any variable is excluded from the entire analysis. Alternatively,
Exclude case pairwise includes all cases, even if they have missing scores from one
or two variables. The missing scores for each case are simply eliminated from the
analysis, while the remaining scores are included in the analysis. Because factor
analysis is based on correlations across the data set, it is recommended to eliminate
listwise rather than pairwise; however, listwise elimination may result in substan-
tial data loss if numerous cases have missing scores.
Other options in the dialogue box include sorting variables according to the
size of their loadings on each factor, with the highest absolute scores placed first
on the list. Finally SPSS also allows the suppression of absolute values less than a
FIGURE 9.12 Dialogue box for factor rotation

Factor Analysis 199
FIGURE 9.13 Options dialogue box
specified value, for example .30. This option aids in factor interpretation because
it identifies only the variables that contribute substantially to the factor.
Once all the desired options have been chosen, click OK in the main Factor
Analysis dialogue box.
5. Results
5.1 Factor Loadings
The next step after conducting the factor rotation and producing the rotated
component matrix (i.e., after producing the SPSS output for your factor analysis)
is to examine the factor loadings, which indicate the strength of the association
between each variable and each factor. Ideally, each variable should have a high
loading on only one factor, with small loadings on the remaining factors. Of
course, the interpretation of what constitutes a high loading is subjective, and
not surprisingly, there are different opinions about the optimal factor loading
score. One suggestion is to consider all loadings greater than .30 as important
(Comrey & Lee, 1992; Field, 2009); however, a cutoff score of .40 has also been
proposed (Pett et al., 2003). Finally, Stevens (2009) offers different guidelines for
evaluating factor loadings depending on the sample size. For instance, for a sample
size of 300, loadings should be larger than .298 whereas for a sample size of 600,
a loading of .21 is considered important (see Stevens, 2009, for further detail).
Once the factor loading cutoff level has been determined, the variables with
high loadings can be inspected. One useful aid in this process is the option in
SPSS that suppresses factor loadings lower than a specified cutoff point. As seen
in Figure 9.14, Loewen et al. (2009) suppressed factor loadings from –.29 to .29,
meaning that loadings beyond .30, such as Item 1 on Factor 1, are visible, while
loadings less than .30, such as Item 1 on Factor 2, are hidden. It is possible for
a variable to have low loadings on all factors, indicating that the variable is not
strongly associated with the other variables. In such cases, it is often desirable to
exclude the variable and rerun the analysis, keeping in mind that when an item
is excluded from a subsequent analysis, the factor loadings of the remaining items
will change. It is therefore important to exclude one item at a time and check the
new factor loadings accordingly.
In addition to variables that do not have high loadings on any factors, it is also
possible to have complex variables that have high loadings on more than one
Component Matrixa
1 2 3 4 5 6
Q1 .619
Q2 .622 .332
Q3 .639 .405
Q4 .361 .553
Q6 .604 –.353
Q7 –.369 .601
Q11 .577 –.374
Q12 .459 –.310
Q13 .325 .344 –.314
Q16 .589
Q17 .758
Q18 .705
Q21 .744
Q22 .576 .474
Q23 .620
Q26 .329 .427
Q27 .656
Q31 .472 .453
Q32 .545 –.423
Q33 –.540 .412
Q36 .386 .401 –.384
Q37 .631
RVQ8 .345 .647
RVQ28 .699
Extraction method: PCA

a. Six components extracted
FIGURE 9.14 Unrotated component matrix

Factor Analysis 201
Factor/Component
1 2 3 4 5 6
Q1* .354 .330

Q2 .730
Q3 .866
Q6 .450 –.382
Q11 .569
Q21 .504
Q27* .402 .331
Q32* .672 –.336
Q37 .602
Q4 –.482
Q8 .714
Q13* –.533 –.411
Q28 .830
Q31 –.625
Q7* .573 –.311
Q22
Q16* .432 –.344
Q17* .320 .421
Q12 .452
Q33* –.566 .661
Q36 .710
Q18 –.318
Q23 –.521
Q26 –.747
Extraction method: principal component analysis

Rotation method: direct oblimin with Kaiser Normalization
* Complex variables
FIGURE 9.15 Rotated factor loadings (pattern matrix) (adapted from Loewen et al., 2009)
factor, making interpretation difficult. For example, in Figure 9.15, Item 33 has
a loading of –.566 on Factor 4 and .661 on Factor 5. There are several suggested
solutions to this problem (Field, 2009; Henson & Roberts, 2006). One suggestion
is to simply assign the item to the factor that it loads most highly on. Another
option is to try different extraction and rotation methods to see if a stronger dif-
ferentiation of loadings across factors can be obtained.
5.2 Factor Scores

In addition to factor loadings, it is also possible to create factor scores for each par-
ticipant, which reflect “a more parsimonious set of composite scores” (Thompson,
2004).Thus, rather than having a loading for each variable on a factor, individuals
have a composite score that takes into account all the variables that make up that
factor. For example, based on Figure 9.15, each participant would have a score for
each of the nine items on Factor 1; however, it is possible to combine the nine
scores into one factor score, which then provides a single numeric value for the
individual’s position on the factor. Thus, participants with higher variable scores
will have higher factor scores, while individuals with lower variable scores will
have lower factor scores (Rietveld & Van Hout, 1993).
There are several different statistical methods for computing factor scores. The
simplest is to sum or average each individual’s score on the variables that comprise
the factor, but such a method does not take into account the fact that variables
load on multiple factors. For example, Item 27 in Figure 9.15 has loadings above
.30 on factors 1 and 4. Rather than counting the item twice, or omitting it from
one factor, it is possible to calculate factor scores that reflect the weight of loadings
across the factors. In SPSS there are three primary methods of calculating factor
scores: the Regression method, the Bartlett method, and the Anderson-Rubin
method (see Figure 9.16). These three methods generally produce similar fac-
tors; however, they differ slightly in their mathematical calculations. (See Field,
2009 and Thompson, 2004 for further details.) Click Scores from the main Fac-
tor Analysis dialogue box (Figure 9.16) and select Save as variables. Select which
method to use to calculate the factor scores, which will appear as variables in the
data view section of SPSS.
Rietveld and Van Hout (1993) list several situations in which factor scores can
be very useful:
• If one wants to investigate “whether groups or clusters of subjects can be dis-

tinguished that behave similarly in scoring on a test battery, [and] the latent,
FIGURE 9.16 Factor scores dialogue box

Factor Analysis 203
underlying variables are considered to be more fundamental than the origi-

nal variables, the clustering of factor scores in the factor space can provide
useful clues to that end” (p. 289).
• Factor scores can be used as a solution to multicollinearity problems in mul-
tiple regression because rather than having several highly correlated variables,
the analysis includes a single factor score that is a composite of those variables
(see Jeon, Chapter 7 in this volume).
• The factor scores can also be handy in complex experiments, such as when
multiple dependent measures of the same construct are used. In such cases,
“it may be a good idea to use the scores on the different factors, instead of
using the scores on the original variables” (p. 290).
An example of factor score use comes from Loewen et al. (2009), who fol-
lowed their EFA with a discriminant function analysis. Loewen et al. (2009) used
factor scores to examine differences in L2 learners’ beliefs about grammar instruc-
tion and error correction according to the target languages that they were study-
ing.Thus rather than relying on 37 item responses for each individual, the analysis
incorporated only the factor scores for the six factors produced by the EFA.
6. Interpreting the Rotated Factors

Once a clear and interpretable pattern of factor loadings has been produced,
the important task of interpreting the factors can begin. Interpretation includes
examining which items or variables load on which factors and identifying a
theme for each factor based on its core content. Researchers should pay more
attention to variables with higher loadings when naming the factors (Field,
2009). This process can be challenging because it is a subjective, theoretical, and
inductive task (Pett et al., 2003), relying heavily on the researcher’s interpreta-
tion (Henson & Roberts, 2006). A detailed investigation of the content of each
factor is also crucial (Comrey & Lee, 1992). As a rule of thumb, in order to have
a meaningful interpretation there should be at least two or three variables load-
ing on a factor (Henson & Roberts, 2006; Thompson, 2004). Note that in Fig-
ure 9.15 the third factor has only two variables whereas the first factor has nine
variables. What is important in naming a factor is to select a descriptive name
that is representative of all items that load on that particular factor. For example,
in Loewen et al. (2009), the items from the beliefs survey that loaded most
strongly on Factor 1 were statements such as “Knowing a lot about grammar
helps my reading” (Item 3 with a loading of .87) and “I usually keep grammar
rules in mind when I write in a second language” (Item 2 with a loading of .73).
In examining all the items that loaded on Factor 1, the researchers identified a
theme of grammar being useful for L2 learning in general, as well for specific
skills. Consequently, the researchers named this factor “Efficacy of Grammar”
(see Figure 9.17).
Rotated Factor Loadings for Learner Beliefs
Factors
Item 1 2 3 4 5 6 h2
I. Efficacy of Grammar

1. Studying grammar formally essential .35 .33 .48
for mastering a second language.
2. I usually keep grammar rules in mind .73 .55
when I write in a second language.
3. Knowing a lot about grammar helps –.87 .67
my reading.
6. My second language improves most .45 –.38 .52
quickly if I study the grammar of the
language.
11. I like studying grammar. .57 .57
21. The study of grammar helps in learning .50 .60
a second language.
27. Knowledge about grammar helps in .40 .33 .40
understanding other people’s speech.
32. When I read a sentence in a second .67 –.34 .58
language, I try to figure out the grammar.
37. One way to improve my reading ability is .60 .46
to increase my knowledge of grammar.
II. Negative Attitude to Error Correction
4. When I make errors in speaking a –.65 .47
second language, I like my teacher to
correct them.
8. Teachers should not correct students .71 .61
when they make errors in class.
13. I like to be corrected in small group –.53 –.41 .45
work.
28. I dislike it when I am corrected in class. .83 .68
31. When I make grammar errors in –.48 .47
writing in a second language, I like my
teacher to correct them.
III. Priority of Communication
7. I can communicate in a second language .57 –.31 .57
without knowing the grammar rules.
22. It is more important to practice a .86 .72
second language in real-life situations
than to practice grammar rules.
IV. Importance of Grammar
16. Good learners of a second language .43 –.34 .53
usually know a lot of grammar rules.
17. Knowing grammar rules helps .32 .42 .66
communication in a second language.
V. Importance of Grammatical Accuracy
12. People will respect me if I use correct .45 .34
grammar when speaking a second
language.
Factor Analysis 205
Rotated Factor Loadings for Learner Beliefs
Factors
Item 1 2 3 4 5 6 h2
33. I feel cheated if a teacher does not –.57 .66 .70
correct the written work I hand in.
36. Second language writing is not good if .71 .51
it has a lot of grammar mistakes.
VI. Negative Attitudes to Grammar Instruction
18. I like it when my teacher explains .32 .55
grammar rules.
23. When I have a problem during –.52 .52
conversation activities, it helps me to
have my teacher explain grammar rules.
26. There should be more formal study of –.75 .56
grammar in my second language class.
FIGURE 9.17 Labeling the factors
7. What to Report?
Given the number of options and subjective decisions involved in each step of
an EFA, readers must be able to assess researchers’ processes and results (Com-
rey & Lee, 1992; Conway & Huffcut, 2003; Field, 2009; Ford, MacCallum, &
Tait, 1986; Pett et al., 2003). However, a great majority of L2 factor analytic
studies fail to provide sufficient information regarding their factor analytic pro-
cedures and results (Plonsky & Gonulal, 2015). In addition, some researchers are
advised by journal reviewers and editors not to provide too much statistical detail
(e.g., Loewen et al., 2014). This issue is symptomatic of more general problems
related to reporting practices and transparency in L2 research (e.g., Plonsky, 2013;
Larson-Hall & Plonsky, forthcoming).
Fortunately, there are guidelines regarding what to report for a factor analysis.
Pett et al. (2003), for example, offer a comprehensive set of guidelines that can be
used by researchers, reviewers, and editors who wish to evaluate the quality of a
published factor analysis study. These recommendations for reporting include the
following items, many of which have been exemplified in this chapter:
• The theoretical rationale for the use of factor analysis

• Detailed descriptions of the sampling methods and participants
• Descriptive statistics for each item, including means and standard deviations
• A justification for the choice of factor extraction and rotation methods
• Evaluation of the correlation matrix: Bartlett’s Test of Sphericity, KMO test
• Criteria for extracting the factors: scree plot, eigenvalues, percent of variance
extracted, etc.
• Cutoff points for meaningful factor loadings
• The structure matrix for orthogonally rotated solutions; the structure and
pattern matrices and interfactor correlations for obliquely rotated solutions
• Descriptions and interpretation of the factors

• Method of factor score calculations
• Assessment of the study limitations and suggestions for future research direc-
tions (Adapted from Pett et al., 2003, p. 228)
Conclusion
EFA has several important uses and has the potential to greatly inform L2 theory
and practice. Conducting an EFA, however, poses various challenges due in part
to (a) its complex nature, (b) researchers’ limited experiences with EFA, and
(c) the realities of conducting L2 research. Throughout this chapter we have
attempted to provide some useful insights and have presented a step-by-step
treatment of EFA.
We end our discussion now with three principles that we hope will guide
researchers employing this technique: First, each data set should be treated sepa-
rately, with researchers evaluating which EFA options are most appropriate for
the data in question. Second, it is always useful to try out different factor extrac-
tion, retention, and rotation methods to see which ones account for the largest
percentage of variance and provide the most interpretable solutions. Researchers
could begin with the default SPSS settings, and then alter procedures according
to the guidelines discussed throughout this chapter. Conducting multiple analyses
will not only strengthen the results, it will also help provide researchers with a
better understanding of the implications of selecting various EFA options. Third,
it is essential that factor analysts report sufficient information to allow for replica-
tion, evaluation, and accumulation of knowledge. Following these guidelines will
help researchers use EFA to its full potential in investigating various aspects of L2
learning and teaching.
SAMPLE STUDY 11
Loewen, S., Li, S., Fei, F., Thompson, A., Nakatsukasa, K., Ahn, S., & Chen, X.
(2009). Second language learners’ beliefs about grammar instruction and error
correction. The Modern Language Journal, 93(1), 91–104.
Background
The role of grammar instruction and error correction in the L2 classroom
has been a topic of considerable debate, centering in large part around the
feasibility and efficacy of meaning-focused instruction versus form-focused
instruction. Although previous studies have taken into consideration both
teachers’ and students’ beliefs on this issue, learner beliefs have received less
Factor Analysis 207
attention than teacher beliefs, even though such beliefs may influence the
effectiveness of classroom instruction. It is therefore important to investi-
gate, in detail, L2 learners’ perspectives on this issue.
Research Questions
• What underlying constructs are present in L2 learners’ responses to a
questionnaire regarding their beliefs about grammar instruction and er-
ror correction?
• To what extent can the underlying constructs of learners’ beliefs distin-
guish L2 learners studying different target languages?
Method
A questionnaire consisting of 37 Likert-scale questions regarding beliefs
about L2 grammar instruction and error correction was used.
Statistical Tools
An EFA was chosen because the researchers had no a priori expectations
regarding the number and nature of underlying factors. PCA was selected
for factor extraction and direct oblimin was used for factor rotation. The fac-
tor scores calculated from the EFA were used in the subsequent discriminant
function analysis to determine if students studying different L2s varied in
their responses to the factors.
RESULTS
The EFA produced six factors with eigenvalues greater than 1. These factors
accounted for 55% of the total variance. After examining the content of the
items loading above .30 on each factor, Factor 1 was labeled “Efficacy of
Grammar” and included items such as “Knowing a lot about grammar helps
my reading” and “I usually keep grammar rules in mind when I write in a
second language.” The remaining five factors were labeled (2) “Negative
Attitudes to Error Correction,” (3) “Priority of Communication,” (4) “Impor-
tance of Grammar,” (5) “Importance of Grammatical Accuracy,” and (6)
“Negative Attitudes to Grammar Instruction.”
SAMPLE STUDY 2
Vandergrift, L., Goh, C.C.M., Mareschal, C. J., & Tafaghodtari, M. H. (2006). The
metacognitive awareness listening questionnaire: Development and validation.
Language Learning, 56(3), 431–462.
Background
The metacognitive awareness listening questionnaire (MALQ) is used to
examine the extent to which language learners are conscious of and can
adjust the L2 listening comprehension process. However, developing a valid
instrument that can address language learners’ awareness of the L2 listening
process is not easy and has potential shortcomings, such as being too long
or not comprehensive enough. This study examines the development and
validation of a listening questionnaire aiming to assess L2 listeners’ metacog-
nitive awareness and perceived use of strategies while listening to oral texts.
Method
Vandergrift et al. (2006) examined the relevant and recent literature on meta-
cognition, listening comprehension, and self-regulation. Based on previous
instruments, a comprehensive list of questionnaire items was formed and
then subjected to expert judgment for redundancy, content validity, clarity,
and readability. After this initial fine-tuning, the instrument was piloted with
a few students and revised again for clarity of the items. Finally, a question-
naire of 51 items was adopted.
Statistical tools
Vandergrift et al. (2006) employed an EFA to determine the emerging fac-
tors, followed by a confirmatory factor analysis to validate the items retained.
Principal axis factoring was selected for the factor extraction method with
promax rotation with Kaiser Normalization. Maximum likelihood was
employed for confirmatory factor analysis. Finally, the reliability of each fac-
tor was calculated using Cronbach’s alpha.
Results
The EFA produced a 13-factor solution with eigenvalues larger than 1. How-
ever, after examining the scree plot, five factors were retained, thus increas-
ing the interpretability of the results. These five factors explained 44.5% of
the total variance. The items loading on each factor were carefully exam-
ined, and the factors were labeled as (1) “Person Knowledge,” (2) “Mental
Translation,” (3) “Directed Attention/Concentration,” (4) “Planning,” and
(5) “Problem-Solving.” Based on the results of EFA, a subsequent CFA was
conducted with separate data collected from a different sample. The three
models (i.e., four-factor solution, five-factor solution, and six-factor solution)
were tested using maximum likelihood estimation. The CFA results showed
that the five-factor model was a better overall fit. Based on these analyses,
the MALQ was considered to have robust psychometric properties as a mea-
sure of listening awareness.
Factor Analysis 209
Key Concepts in Factor Analysis

Communality (h2):A communality represents the amount of variance in a vari-
able accounted for by all the factors. For instance, in Loewen et al.’s (2009)
study, six factors explain 48.2% of the variance in variable Q1 (h2= .482)
(See Figure 9.5).
Confirmatory factor analysis (CFA): CFA is a special form of factor analysis
that is commonly used to verify the factor structure of a set of measured
variables.
Exploratory factor analysis (EFA): EFA is a type of factor analysis often used
to explore the underlying relationships in a data set.
Factor: A factor is an unobservable (latent) construct that affects more than
one variable in the data set.
Factor loading: A factor loading is a term used for a coefficient in a pat-
tern matrix or structure matrix. That is, a factor loading is the correlation
between each factor and variable.
Factor score (also called component score in PCA): A factor score is a
numerical value that provides information about an individual’s standing
on a factor.
Eigenvalue: An eigenvalue for a given factor represents the amount of vari-
ance in total variables accounted for by that factor.
Further Reading
• Discovering statistics using SPSS (Field, 2009)
• Exploratory factor analysis (Fabrigar & Wegener, 2012)
• Making sense of factor analysis:The use of factor analysis for instrument development
in health care research (Pett, Lacky, & Sullivan, 2003)
• Statistical techniques for the study of language and language behavior (Rietveld &
Van Hout, 1993)
• Exploratory factor analysis: A five-step guide for novices (Williams, Onsman, &
Brown, 2010)
1. In which kinds of L2 research do you think exploratory factor analysis can
be of importance?
2. Describe the differences between EFA and PCA.
3. What kinds of criteria can be used to ensure that the appropriate numbers of
factors are extracted? Why is it preferable to employ multiple factor retention
criteria?
4. What are some of the advantageous and disadvantageous of using rules of
thumb to check the factorability of the data?
5. Imagine that you carried out an EFA. Due to the page limitations of your
target journal, however, you are not able to justify your decisions or to report
all the results. Which results would you report?
6. Factor analysis is often contrasted with cluster analysis (see Staples & Biber,
Chapter 11 in this volume). In what ways are these two procedures similar?
In what ways are they different?
7. What are the advantages of performing an EFA rather than conducting mul-
tiple correlations?
8. Using the data provided on this book’s companion website (http://oak.ucc.
nau.edu/ldp3/AQMSLR.html), attempt to replicate the results from Loewen
et al. (2009). How do the results change if you alter some of the EFA options?
Note
1. The SPSS outputs of this study were used throughout this chapter.
References
Asención-Delaney,Y., & Collentine, J. (2011). A multidimensional analysis of a written L2
Spanish corpus. Applied Linguistics, 32, 299–322.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245–276.
Comrey, A. L., & Lee, H. B. (1992). A first course in factor analysis (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum.
Conway, J. M., & Huffcutt, A. I. (2003). A review and evaluation of Exploratory factor anal-
ysis practices in organizational research. Organizational Research Methods, 6(2), 147–168.
Costello A., & Osborne J. (2005). Best practices in exploratory factor analysis: Four rec-
ommendations for getting the most from your analysis. Practical Assessment, Research &
Evaluation, 10(7), 1–9.
Fabrigar, L. R., & Wegener, D. T. (2012). Exploratory factor analysis. New York: Oxford Uni-
versity Press.
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the
use of exploratory factor analysis in psychological research. Psychological Methods, 4,
272–299.
Field, A. (2009). Discovering statistics using SPSS. London: Sage.
Ford, J. K., MacCallum, R. C., & Tait, M. (1986). The application of exploratory factor
analysis in applied psychology: A critical review and analysis. Personnel Psychology, 39,
291–314.
Glorfeld, L. W. (1995). An improvement on Horn’s paralel analysis methodology for select-
ing the correct number of factors to retain. Educational and Psychological Measurement,
55, 377–393.
Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
Gorsuch, R. L. (1990). Common factor-analysis versus component analysis: Some well and
little known facts. Multivariate Behavioral Research, 25(1), 33–39.
Gorsuch, R. L. (2003). Factor analysis. In A. Schinka & W. F. Velicer (Vol. Eds.), Handbook
of psychology:Vol. 2. Research methods in psychology (pp. 143–164). Hoboken, NJ: Wiley.
Hair, J., Anderson, R. E., Tatham, R. L., & Black, W. C. (1995). Multivariate data analysis (4th
ed.). Upper Saddle River, NJ: Prentice Hall.
Harrington, D. (2009). Confirmatory factor analysis. Oxford: Oxford University Press.
Factor Analysis 211
Harshman, R. A., & Reddon, J. R. (1983). Determining the number of factors by compar-
ing real with random data: A serious flaw and some possible corrections. Proceedings of
the Classification Society of North America at Philadelphia, 14–15.
Hayton, J. C., Allen, D. G., & Scarpello,V. (2004). Factor retention decisions in exploratory
factor analysis: A tutorial on parallel analysis. Organizational Research Methods, 7(2),
191–205.
Henson, K. R., & Roberts, J. K. (2006). Use of exploratory factor analysis in published
research: Common errors and some comment on improved practice. Educational and
Psychological Measurement, 66(3), 393–416.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational
and Psychological Measurement, 20, 141–151.
Kline, P. (2002). An easy guide to factor analysis. London: Routledge.
findings: What gets reported and recommendations for the field. Language Learning, 65,
Supp. 1, 125–157.
Loewen, S., & Gass, S. (2009). Research timeline: The use of statistics in L2 acquisition
research. Language Teaching, 42(2), 181–196.
Loewen, S., Li, S., Fei, F., Thompson, A., Nakatsukasa, K., Ahn, S., & Chen, X. (2009). Sec-
ond language learners’ beliefs about grammar instruction and error correction. Modern
Language Journal, 93, 91–104.
Loewen, S., Lavolette, B., Spino, L. A., Papi, M., Schmidtke, J., Sterling, S., & Wolff, D.
(2014). Statistical literacy among applied linguists and second language acquisition
researchers. TESOL Quarterly, 48, 360–388.
MacCallum, R. C.,Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in factor
analysis. Psychological Methods, 4, 84–99.
Mizumoto, A., & Takeuchi, O. (2012). Adaptation and validation of self-regulating capacity
in vocabulary learning scale. Applied Linguistics, 33(1), 83–91.
Norman, G. R., & Streiner, D. L. (2003). PDQ statistics (3rd ed.). Hamilton: BC Decker.
Pett, M. A., Lackey, N. R., & Sullivan, J. J. (2003). Making sense of factor analysis: The use of
factor analysis for instrument development in health care research. Thousand Oaks, CA: Sage.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and report-
ing practices in quantitative L2 research. Studies in Second Language Acquisition, 35,
655–687.
Plonsky, L., & Gonulal, T. (2015). Methodological reviews of quantitative L2 research:
A second order synthesis and a review of exploratory factor analysis. Methodological
synthesis in quantitative L2 research: A review of Reviews and a case study of explor-
atory factor analysis. Language Learning, 65, Supp. 1, 9–35.
Rietveld, T., & Van Hout, R. (1993). Statistical techniques for the study of language and language
behavior. New York: Mouton de Gruyter.
Stevens, J. P. (2009). Applied multivariate statistics for the social sciences (5th ed.). Routledge:
New York.
Tabachnick, B., & Fidell, L. (2013). Using multivariate statistics (6th ed.). Boston: Pearson
Education.
Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understanding concepts and
applications. Washington, DC: American Psychological Association.
Tseng, W. T., Dörnyei, Z., & Schmitt, N. (2006). A new approach to assessing strategic
learning: The case of self-regulation in vocabulary acquisition. Applied Linguistics, 27,
78–102.
Velicer, W. F., Eato, C. A., & Faca, J. L. (2000). Construct explication through factor or com-
ponent analysis: A review and evaluation of alternative procedures for determining the
number of factors or components. In R. D. Goffin & E. Helmes (Eds). Problems and
solutions in human assessment: Honoring Douglas N. Jackson at seventy (pp. 41–71). Norwell,
MA: Kluwer Academic.
Widaman, K. F. (1993). Common factor analysis versus principal component analysis: Dif-
ferential bias in representing model parameters. Multivariate Behavioral Research, 28(3),
263–311.
Williams, B., Onsman, A., & Brown,T. (2010). Exploratory factor analysis: A five-step guide
for novices. Australasian Journal of Paramedicine, 8(3), n.p.
Winke, P. (2011). Evaluating the validity of a high-stakes ESL test: Why teachers’ percep-
tions matter. TESOL Quarterly, 45(4), 628–660.
Wittenborn, J. R., & Larsen, R. P. (1944). A factorial study of achievement in college Ger-
man. Journal of Educational Psychology, 35(1), 39.
10
STRUCTURAL EQUATION
MODELING IN L2 RESEARCH
Rob Schoonen
If there is one thing that we know in second language (L2) research, it is that there
are many factors involved in L2 learning and use. These factors are found in very
complex relationships, which may even change with increasing language profi-
ciency. These relationships are far more complex than what we can describe with
the computation of a series of simple bivariate correlations. L2 researchers have
to be able to deal with multivariate analyses of data. Structural equation model-
ing provides a framework to investigate these complex multivariate relationships.
Structural equation modeling (SEM), also known as causal modeling, covari-
ance structure analysis, or LISREL analysis, has as its distinguishing feature that
it requires some sort of modeling. Modeling implies that researchers need to be
explicit about the relationships they envisage between measured variables and
underlying constructs (i.e., latent variables) and between the constructs them-
selves. Therefore, a researcher has to think carefully about the hypothesized rela-
tionships before embarking on a SEM enterprise. SEM provides the researcher
with a toolbox that can uncover complex relationships that go well beyond the
bivariate relations as expressed in a correlation or a simple regression, but also
beyond the multivariate relationships that are usually addressed in a multiple
regression analysis (see Jeon, Chapter 7 in this volume).
SEM can be used at various stages of theory development, ranging from con-
firmatory testing to exploration. More specifically, Jöreskog and Sörbom (1996)
mention three situations for fitting and testing models. First is a strictly confirma-
tive situation, where there is a single model that is put to the test with empirical
data. The model is either accepted or rejected. Second is testing alternative or
214 Rob Schoonen
competing models, when a researcher wants to choose between two or three con-
current models on the basis of a single data set. A third use is a model-generating
situation, when a researcher starts off with an initial model and then tries to
improve it on the basis of (mis)fit results ( Jöreskog & Sörbom, 1996, p. 115).
The result of a model-generating situation should not be taken as a real statistical
testing of the (final) model, and the process of model improvement should not
only be guided by statistical outcomes but also by substantive theoretical consid-
erations. The resulting model should then be put to the test anew with different
data (creating a new, confirmatory situation).
The possibilities in a SEM analysis seem to be unlimited (see Hancock &
Schoonen, 2015), and the flexibility of the approach to address them makes SEM
a very attractive analytic framework, leading to an increase in recent years in the
use of SEM in L2 research (Plonsky, 2014). However, it is not difficult to imagine
that these options also contain a risk for using the technique uncritically (see the
“Pitfalls” section in this paper).Therefore, it is crucial that the user has theoretical
guidance with respect to the research questions he or she wants to investigate and
the analytic choices that need to be made. Lewin’s well-known quote that there
is “nothing so practical as a good theory” applies here for sure.
SEM is a collection of analyses that can be used to answer many research ques-
tions in L2 research. Prominent is the use of SEM to predict (or “explain”) complex
constructs, such as reading and writing proficiency, or the development of these
complex proficiencies, on the basis of scores on component skills. Other studies
investigate the complex relations between related constructs, such as motivation
and attitude toward foreign languages. At the initial stage, modeling these kinds
of relationships, a researcher could start with drawing graphs depicting how con-
structs influence each other, or how they are related, using unidirectional or bidi-
rectional arrows, respectively, to connect the constructs. To make it more concrete,
the constructs could be connected to measured, observed or manifest variables.
Conventionally, underlying or latent variables are represented as circles or ovals, and
observed variables as rectangles (see Figure 10.2–10.3). SEM is also highly flexible,
able to deal with multiple dependent variables and multiple independent variables.
These variables can be continuous, ordinal, or discrete, and they can be indicated as
observed variables (i.e., observed scores) or as latent variables (i.e., the underlying
factor of a set of observed variables) (Mueller & Hancock, 2008; Ullman, 2006).
Examples of complex models in L2 studies can be found in, for instance, Gu (2014),
Schoonen, Van Gelderen, Stoel, Hulstijn, and De Glopper (2011) or Tseng and
Schmitt (2008).Which measured and latent variables, and which relations to include
in the SEM analysis is up to the researcher. We should keep in mind that statistical
techniques per se cannot make substantive decisions. As is the case with nearly all
analyses described in this volume, SEM requires a number of choices to be made by
the researcher, and these choices must be made on solid theoretical grounds.
In the remainder of this chapter a number of examples will be presented to
illustrate the possibilities of SEM. Furthermore, a more detailed sample analysis will
Structural Equation Modeling 215
be provided using two different software packages, LISREL and AMOS. Readers
interested in other packages or more extensive introductions to the available soft-
ware are referred to the corresponding manuals or specialized introductions (Byrne,
1998, 2006, 2010, 2012). Readers who want to learn more about SEM than this
chapter can offer, or who want to know more about the theoretical underpinnings
of SEM, will find suggestions for further reading at the end of this chapter.
Two Parts of a Model: Measurement and Structure

Testing the relationships that one postulates or expects between the theoretical vari-
ables (as opposed to measured variables) is just one part of an analysis with SEM,
often referred to as the structural model. In a structural model one can design hypoth-
esized relations between theoretical variables. For example, does Language Exposure
directly influence a language learner’s Language Development or is this presumed
effect mediated by Working Memory Capacity? Or maybe a researcher wants to
compare the tenability of these two concurrent hypotheses (see Figure 10.1).
These kinds of research questions relate to the structural part of the model.
However, a researcher can address these issues only provided that he or she has
reliable and valid measures for the latent theoretical variables involved. From L2
research we know that adequate measurement of core variables is almost never
as straightforward as we would like it to be. An important part of SEM analysis
therefore concerns the modeling of the measurement of theoretical variables or
constructs.These measurement concerns are addressed in what is referred to as the
measurement model. The main question here is: What are appropriate measures for
the constructs or latent variables one intends to measure? In our example, we need
measures for Language Exposure, Language Development and Working Memory
to investigate our hypotheses, and thus the model needs to be extended with
the measured or observed variables involved (see Figure 10.2). The number of
observed variables needed to operationalize a latent variable depends on other fea-
tures of the model, but three measures will suffice in most cases (see Kline, 2010).
Although the measurement part of the model seems to be a psychometric
issue, decisions about construct operationalization get at the heart of validity
research, which makes it a substantive issue. For example, in a study about the
Language Language Language Language

Exposure Development Exposure Development
Working Working
Memory Memory
FIGURE 10.1 Two competing structural models

216 Rob Schoonen
LE1 LE2 LE3 LD1 LD2 LD3 LE1 LE2 LE3 LD1 LD2 LD3
Language Language Language Language

Exposure Development Exposure Development
Working Working
Memory Memory
WM1 WM2 WM3 WM1 WM2 WM3
FIGURE 10.2 Two competing structural models with measurement part added
relationship between linguistic ability and some other construct, a researcher has
to decide whether linguistic ability can be measured by vocabulary, grammatical
knowledge, and pragmatic knowledge together or whether these three domains
should be kept separate and should be measured each on their own. This latter
type of research question is what is often treated as a confirmatory factor analysis
(CFA) problem (see Ockey, 2014). In other words: Do the measures involved
measure a single construct or do they measure multiple constructs?
Underlying Factors
When one wants to investigate the underlying structure of a set of variables, for
example the subtests of a test battery, one can use SEM to actually test hypotheses
about the number of factors that are underlying and also about their interrela-
tions. Key is the testing of hypotheses, which implies that one has a priori one
or a few (competing) expectations that can be put to the test. This is differ-
ent from, for example, exploratory factor analysis (EFA) or principal component
analysis (PCA), where in a data-driven way the number of underlying factors
(or components) is determined according to a statistical criterion (Ockey, 2014;
Loewen & Gonulal, Chapter 9 in this volume). Using SEM, one has to model the
relationship between the measured variables and the hypothesized factors (i.e.,
latent variables) and subsequently test the fit of the model to the empirical data.
This makes it a CFA. An advantage of the SEM framework is that the relations
between selected factors can be modeled in the structural part of the model.
Imagine, for example, a second-language ability test battery that consists
of nine tests: Grammaticality Judgments (V1), Resolution of Anaphors (V2),
Understanding of Conjunctions (V3), Vocabulary Size (V4), Depth of Vocabu-
lary Knowledge (V5), Knowledge of Metaphors (V6), Sentence Comprehen-
sion (V7), Use of Verb Inflection (V8), and Use of Agreement (V9). A researcher
could question, for example, whether the nine test scores are best described (or
explained) by one underlying general L2 linguistic skill, or whether a three-factor

model with a metacognitive-metalinguistic factor, a lexical-semantic factor, and a
morpho-syntactic factor is more plausible.
Figure 10.3 depicts both competing models. In the first model (left panel),
the observed variables V1 through V9 are dependent of one (latent) underlying
variable, “General L2 Linguistic Factor,” and there is some unexplained residual
variance (e) in each observed variable as indicated by the arrows coming from e
(error). In the second model (right panel) three underlying constructs are pos-
tulated, namely, (a) metacognitive, (b) lexical-semantic, and (c) morpho-syntactic
factor. Scores on three instruments (tests) are considered indicative of metacogni-
tive proficiency: Grammaticality judgments (V1), the resolution of anaphors (V2),
and understanding of conjunctions (V3). Vocabulary size (V4), depth of vocabu-
lary knowledge (V5), knowledge of metaphors (V6), and sentence comprehen-
sion (V7) are assumed to be typical indicators of lexical-semantic proficiency,
and scores on the verb inflection (V8) and the agreement test (V9) are typical
indicators of morpho-syntactic proficiency. Ideally, one would prefer more than
two measured variables to indicate the latent variable morpho-syntactic profi-
ciency. So, the model postulates that there is reason to assume that there are three
underlying latent variables involved in the test performances. The model further
indicates that these latent variables are not fully unrelated to each other because
there are double-headed arrows indicating covariance between the three factors,
covariance being the unstandardized equivalent of correlation. The covariance
between residual e5 and e7 will be introduced later on.
Advantages of SEM
The example in Figure 10.3 largely deals with the way one defines and measures
the theoretical variables (cf. CFA) and as such is considered part of the measure-
ment model. One of the advantages of SEM is that one can test the fit of the
hypothesized model against one’s data, and one can also compare and test the dif-
ference in fit between the two competing models described later in this chapter.
There are at least two other advantages to using SEM in these kinds of analy-
ses. First, researchers are more or less forced to come up with hypotheses about
relationships between their measurements (observed scores) and underlying con-
structs or latent variables. Most hypotheses in L2 research involve variables that
are not directly observable, such as language proficiency, working memory capac-
ity, speaking proficiency, and so on. However, in the actual empirical investigation
researchers want to test the tenability of their claims about these latent underlying
variables. Putting forward a measurement model makes this part of studies more
explicit and thus more open for empirical scrutiny and discussion. In some cases
theoretically relevant variables can be measured more directly, such as age or
parental education. In such cases, the observed and latent variables coincide.
Another advantage of SEM pertains to the more substantive analyses in the struc-
tural part of the model. Once one has modeled the collected data in a well-fitting
L2 General Metacognitive Lexical- Morpho-
linguistic factor factor semantic factor syntactic factor
V1 V2 V3 V4 V5 V6 V7 V8 V9 V1 V2 V3 V4 V5 V6 V7 V8 V9
e1 e2 e3 e4 e5 e6 e7 e8 e9 e1 e2 e3 e4 e5 e6 e7 e8 e9
FIGURE 10.3 Two competing models: a one-factor model and a three-factor model
measurement model, one can test substantive hypotheses with latent variables that
are so-called error-free. From Figure 10.3 one can see that the latent variables are
determined by the covariance of the different measured variables (V1–V9 in the
left panel or V1–V3, V4–V7 and V8–V9, respectively, in the right panel) and thus
that the idiosyncrasies of the measurements, including measurement error (e1–e9),
are partialed out (excluded). This way an analysis of the relations of latent variables
in the structural model, not being attenuated by measurement error, can provide
a clearer picture of what these relations are (see Mueller & Hancock, 2008, for
an example). In the structural part of our three-factor model, the researcher can
investigate whether the three factors simply covary as depicted in Figure 10.3 or
show more specific relations. For example: Is the metacognitive knowledge the
result of lexical-semantic and morpho-syntactic proficiency? To test such a hypoth-
esis the relationship between the three factors should be modeled as regressions
(with one-directional arrows) in which metacognitive knowledge is the dependent
variable and lexical-semantic and morpho-syntactic proficiency are the indepen-
dent variables (analogous to Figure 10.2; see also Jeon, Chapter 7 in this volume).
Alternatively, one could also claim that the three factors are unrelated. This would
lead to a model without any connections between the three factors, or—in other
words—covariances of 0. Comparison of the fit of the various models to the avail-
able data as described later in this chapter will suggest which model is most plausible.
The previous example is—for practical reasons—kept simple, but numerous
multiple regression models with single as well as multiple dependent variables
in all kinds of different configurations can be analyzed if there are good substan-
tive reasons to do so (see Tseng & Schmitt, 2008; Schoonen et al., 2003; Gu,
2014). One could say that SEM elegantly combines factor-analytic procedures
with regression-analytic ones (and many more, see Hancock & Schoonen, 2015,
for examples in the L2 field; in addition, Rovine & Molenaar, 2003, show all
kinds of variance-analytic applications of SEM). However, this flexibility requires
substantial sample sizes, data that meet certain requirements, and a clear plan for
the analyses, because the number of possibilities for the analyses are sometimes
overwhelming. In the next section, we will go into more detail as we discuss SEM
analyses step by step. First, we will focus on general principles and considerations
at the successive stages in SEM analyses. Second, we will have a closer look at
what an analysis looks like in two of the available packages for SEM analyses (see
the next section): LISREL, being one of the earlier and well-developed packages,
and AMOS, being part of the IBM SPSS family of packages.
General Considerations in SEM Analyses

In the previous section, it was said that SEM can combine factor-analysis,
regression-analysis, and much more. In this chapter we confine ourselves to mod-
eling relationships between measured variables and latent variables and latent
variables among each other; we will ignore the possibility of modeling (latent)
220 Rob Schoonen
mean scores. Furthermore, we will focus on the modeling of interval or continu-

ous data, such as test scores and reaction time data. For other applications we refer
to more extensive introductions as mentioned at the end of this chapter. Hancock
and Schoonen (2015) discuss a number of possible applications in the field of
second language acquisition and applied linguistics.
Data Preparation
The data for the SEM analysis have to meet certain requirements for a straightfor-
ward analysis. For the procedures to work well and for the testing and parameter
estimation to be reliable, the continuous variables should be multivariate normally
distributed. Among other things (see Kline, 2010), this means that the individual
variables are univariate normally distributed. So, initial screening of the data is
relevant for a valid interpretation of the outcomes of a SEM analysis.This includes
checks on skewness and kurtosis of variables, but also outliers can affect an analysis
in a detrimental way. Bivariate plots for pairs of variables give a first impression of
possible violations of a multivariate normal distribution. For an overview of mul-
tivariate assumptions and data preparation, see Jeon (Chapter 7 in this volume). If
data violate assumptions for SEM, especially multivariate normality, the researcher
can resort to other estimation methods within the SEM framework or apply cor-
rections to the outcome statistic (χ²) and the standard errors for the estimated
parameters (Satorra-Bentler’s scaled version). See West, Finch and Curran (1995)
or Finney and DiStefano (2013) for an extensive discussion about the assumptions
in SEM and possible alternatives in case these assumptions are violated.
In L2 research, as in other empirical domains, data sets are seldom complete.
There are several ways to deal with missing data, such as listwise deletion of cases
with missing data or estimation of a missing score on the basis of available scores.
Listwise deletion avoids controversial imputation of estimated scores.This approach,
however, is advisable only in cases where (a) data are assumed to be missing com-
pletely at random and where (b) the sample is large enough to endure the resulting
loss of statistical power. Imputation of missing values can be a good alternative, but
has its drawbacks as well. For example, replacing the missing score by the sample
mean will reduce the score variance, an important source of information in model-
ing. Fortunately, there are more advanced procedures for dealing with missing data.
Most software packages for SEM have their own provisions for handling missing
data that are very sophisticated, so it might be wise to consider their options (Kline,
2010; for a more thorough discussion see Enders, 2013).Working with incomplete
data implies that one works with the raw data (including missing value codes), and
not with just a correlation or covariance matrix as input data. However, using a
correlation or covariance matrix as the input data for an analysis is a viable option
if one wants to replicate analyses from the literature and only a covariance matrix
or a correlation matrix (preferably with corresponding means and standard devia-
tions) is available (see the next section and Discussion Question 8).
Designing a Model
After preparing the data, the most exciting part of the analysis begins: design-
ing the model. This process should be guided by theoretical considerations and
expectations, and can best be split into two stages (Mueller & Hancock, 2008).
The first stage involves testing the measurement model, which helps us deter-
mine whether the presumed latent variables are measured by the observed test
scores in the expected way. At this stage no constraints are implemented regard-
ing the relationships among the latent variables, so that any misfit of the model
is due to the way the latent and observed variables were presumed to be related
in the model.
The latent variables being latent, do not have a scale of themselves. To solve
this, one can either standardize the latent variable by fixing its variance at 1 (cf.
z-values) or equate the scale to that of one of the observed variables, a so-called
reference variable. In the latter case the regression weight for the observed vari-
able on the latent variable is fixed at a value of 1. Both solutions are equivalent.
If the fit of the measurement model is satisfactory (that is, the model fits well)
and all observed measures can—to a reasonable extent—be explained by their
underlying variables, one can move on to the second stage: modeling the relation-
ships among the latent variables. However, if the measurement model does not
fit satisfactorily, the relations between the measured variables and the underlying
variables needs to be reconsidered. A variable might not be related to the underly-
ing variable(s) in the expected way, or a variable may show only a weak relation
to the underlying variable(s).Validity and/or reliability issues could be involved if
a measured variable does not fit the hypothesized relations.
At the second stage, when the structural model is developed, one can test the
substantive hypotheses about the theoretical constructs, either as a single model
or as competing models that can be compared to select the best model. There are
often many possibilities for modeling relationships between variables, especially in
complex data sets. Therefore it is wise to make a plan for the analyses beforehand
to avoid getting side-tracked or to avoid the risk of “overfitting” (i.e., continu-
ously adjusting the model to the data).There is a thin line between testing models
and exploring for new ones. One easily enters the phase of explorations in which
test statistics lose their original interpretation and outcomes require replication.
The building blocks of a model are its parameters and they basically consist
of variances and covariances (i.e., correlations and regressions). When modeling a
parameter, a researcher has three options. The first option is to fix a parameter at
a certain value; for example, a covariance can be set at 0 when it is hypothesized
that there is no covariance between two variables and the parameter does not
need to be estimated, or a variance can be set at 1 when one wants to standardize
a latent variable. If one wants to equate a latent variable’s scale to that of a refer-
ence variable, the regression (“factor loading”) of that particular observed vari-
able on the latent variable can be set at 1 to achieve that. As a second option, the
222 Rob Schoonen
researcher can model a parameter to be “free” and the program will estimate the
value of the parameter such that it fits the data best. This may be the case when,
for example, it is assumed that there is a relationship between latent variables (e.g.,
Metacognitive knowledge, Lexical-semantic knowledge, and Morpho-syntactic
knowledge in the earlier example), and we want an estimate of the size of the
covariance. In such cases, the covariance parameter will be modeled as a free
parameter. A third way in which a parameter can be modeled is to constrain it to
be equal to another parameter. One can postulate that covariances, regressions,
and/or variances are equal. These options for modeling parameters apply to the
structural and measurement part of a model alike. For example, in a test develop-
ment project a researcher could be interested in the question of whether tests
A and B are parallel in a psychometric sense. This—among other things—means
that the error variance in A and B and the regressions for A and B on the latent
variable are equal to each other, respectively (cf. Bollen, 1989; see Schoonen,
Vergeer, & Eiting, 1997 for an application).
Fitting and Evaluating a Model

Once a researcher has operationalized the hypotheses in a model, she or he can put
this model to test by fitting it to the data. Essentially, on the basis of the model speci-
fications, the SEM analysis reproduces or estimates a covariance matrix of observed
variables that would accommodate the model specifications best, and this repro-
duced covariance matrix is compared to the actual covariance matrix of the input
data. In this process, initial estimates or starting values for the free and constrained
parameters are computed by the program. Based on the differences between the
observed sample covariance matrix and the reproduced or estimated matrix, these
initial values are adjusted in a second iteration to minimize the difference between
the reproduced and observed covariance matrix. In successive iterations the pro-
gram will estimate the optimal parameter values such that the difference between
observed and reproduced matrix is minimal according to—for instance—a maxi-
mum likelihood (ML) function and further iterations do not lead to better fit. The
program will stop its iterations and report the achieved results. Researchers have
several options for the fit functions, such as ML and general or unweighted least
squares (GLS and ULS, respectively). See Bollen (1989) for an extensive treatment
of the different procedures. Software packages as LISREL and AMOS provide ML
estimates by default. Different software packages for SEM may use slightly different
procedures to compute starting values and algorithms to minimize fit functions,
and therefore the same analysis on the same data set may sometimes lead to slightly
different parameter estimates, but usually the general results will be the same.These
packages are constantly updated to meet new requirements and insights. Ullman
(2007) provides a comparison of a few packages at that point in time.
Once the SEM iterations have converged to a solution, the researcher will
have to evaluate whether the model satisfactorily fits the data. This is not a
simple yes/no matter, because there are multiple ways of evaluating the fit of
a model. There is a statistical way and there are many descriptive ways. The
analysis gives a chi-square (or related) statistic with a corresponding p-value
and degrees of freedom (df ). In conventional null hypothesis testing, research-
ers usually want to reject the null hypothesis (e.g., p < .05). However in SEM
analyses, most of the time one does not want to reject the model. This raises
the question of whether p-values simply greater than .05 suffice. This issue is
further complicated by the fact that the chi-square in SEM analyses is sensi-
tive not only to sample size, but also to the number of parameters that had to
be estimated. Most researchers use the chi-square statistic as a more descrip-
tive indicator of model fit than as a serious statistical significance test. A ratio
of less than 2 for χ² / df is considered a good fit (Kline, 2010; Ullman, 2007).
The degrees of freedom are derived from the number of observed variables in
the input and the number of parameters estimated in the model, and as such
they are also a good check on the model specification. One should be able to
forecast the degrees of freedom for one’s model in a SEM analysis. If the data
set under investigation consists of m variables, the covariance matrix consists of
m (m + 1) / 2 elements. From this number, the number of parameters has to be
subtracted to get the degrees of freedom. Of course, two parameters set to be
equal count as a single estimated parameter. Predicting the degrees of freedom
of one’s model before actually running the analysis is thus a check of the correct
implementation of the model.
In addition to a chi-square value, a SEM analysis will provide the researcher
with many more descriptive fit indices. Some are based on the differences (residu-
als) between the input covariance matrix and the reproduced covariance matrix
(e.g., standardized root mean square residual, or SRMR). Other indices take the
number of estimated parameters into account as well; the more parsimonious
the model is (i.e., the fewer estimated parameters), the better (e.g., the root mean
square error of approximation, or RMSEA). Others are based on a comparison
between the fit of the tested model and a basic or “null” model that assumes the
variables to be unrelated (e.g., the nonnormed fit index, or NNFI, also known
as the Tucker-Lewis index, and the comparative fit index, or CFI). Different fit
indices weight different aspects of the model (sample size, number of parameters,
residuals, etc.) differently (see Kline, 2010). For most of these fit indices both
lenient and strict cutoff criteria can be found in the literature (Hu & Bentler,
1999). As a rule of thumb, the SRMR should be lower than .08, the RMSEA
lower than .06, and the CFI higher than .95 (Hu & Bentler, 1999). As with
determining the number of factors in EFA or the number of clusters in a cluster
analysis (see Loewen & Gonulal, Chapter 9 in this volume, and Staples & Biber,
Chapter 11 in this volume), multiple fit indices should be taken into account to
avoid overprioritizing one particular criterion.
A third (additional) evaluation of a model consists of the inspection of the
model parameters themselves and the residuals. It could well be the case that,
224 Rob Schoonen
generally speaking, a complex model fits the data well, but that at the same time
some “local” misfit exists. Therefore, a check of the residuals and of the meaning-
fulness of individual parameter estimates is advisable. Eyeballing the standardized
residuals (i.e., the standardized differences between the observed covariances of
the input variables and the reproduced covariances) may show outlying residuals
that indicate local misspecifications. In a similar vein, parameter estimates that are
illogical (such as a negative variance or a correlation out of the –1 to 1 range)
could flag a local misfit as well.
Pitfalls
One of the risks of using SEM is that researchers endlessly tweak a model,
helped by the so-called modification indices that indicate how the chi-square
will change if a certain fixed parameter is set free (Lagrange Multiplier test) or
if a free parameter is set fixed (Wald test). It is very tempting to attune a model
according to these indices and in such a way to strive for more acceptable fit
statistics. However, this is also a risky enterprise because researchers are often
inclined to include relationships that are not theoretically supported, and after
a number of modifications the significance testing can no longer be seen as
real hypothesis testing and p-values become meaningless. The researcher might
end up with a hybrid model that most likely will not be replicable. If analyses
cannot be replicated, the study “might as well be sent to the Journal of Irreproduc-
ible Results or to its successor, The Annals of Improbable Research,” according to
Boomsma (2000, p. 464).
A more interesting and useful approach is to compare two competing mod-
els, preferably representing two stances in a theoretical debate. A comparison of
the fit of the two models could point to the model and the theoretical stance
that deserves our support. Consider, for example, the unitarian holistic view on
language proficiency versus the componential view mentioned earlier. A SEM
analysis of test scores could show that a multiple-factor model fits the data much
better than a one-factor model, and that multiple latent variables (components)
should be distinguished, favoring the componential view. Models that are hierar-
chically nested (i.e., the parameters of one model (A) form a subset of the param-
eters of the other (B)), can be compared statistically by the chi-square difference
test.The difference in the two models’ chi-squares is a new chi-square with as the
degrees of freedom the difference in dfs of the two compared models (Δχ² = χ²A
– χ²B; Δdf = dfA – dfB).
In all cases, it is considered best practice to report the steps taken in the devel-
opment of the ultimate model, which parameters were set to be fixed at a certain
value, which ones were freely estimated, and which ones were constrained to be
equal to another parameter (Mueller & Hancock, 2008). A model’s replicability
is one of the points that is stressed by Boomsma (2000), quoting Steiger’s (1990)
adage: “An ounce of replication is worth a ton of inferential statistics” (p. 176).
A SEM Analysis Step by Step

Later in this chapter a study (Gu, 2014) that uses SEM in various ways is briefly
introduced and discussed (see “Text Box 6”). In this section we show the steps a
researcher has to take to perform a SEM analysis. SEM is a rich toolbox with all
kinds of options and possibilities, many more than can be illustrated in a single
chapter or a single example. Readers who want more extensive introductions and
examples are referred to “Tools and Resources” and “Further Readings.” This
introductory example will illustrate the use of LISREL and AMOS, respectively.
The introduction to LISREL will refer to two modes of working with LISREL,
i.e., using the SIMPLIS syntax and using the program menus. The introduction
to AMOS will be brief to avoid overlap with the introduction to LISREL. Both
packages can take different data file formats as input quite easily, for example raw
data files and SPSS data files.
The example concerns data that allow us to test the models depicted in
Figure 10.3.The data are fictitious: nine variables, N=341. If these data (for exam-
ple as an SPSS file) are imported in LISREL (8.80 Student version), this will
prompt PRELIS 2.80 for—among other things—data screening (e.g., evaluation
of distributions, multivariate plots). The researcher will be prompted to save the
data as a PRELIS data file (*.psf ) that can be used for the SEM analyses. Note that
imported data are by default considered to be ordinal; these can easily be changed
into continuous by clicking Data > Define Variables, selecting the variables
you want to change and then selecting Continuous and OK (see Figure 10.4).
Command lines for a SEM analysis in LISREL are straightforward. Researchers
can use the matrix notation, the SIMPLIS language, and/or a graphical interface
(Jöreskog & Sörbom, 1996–2001). For this example, the SIMPLIS language was used
(see Text Box 1). A new “SIMPLIS project” file can be opened by selecting File >
New from the top bar of the LISREL program. In this case we named the file sam-
pledata.spj (*.spj is the extension LISREL adds). In this new screen (see Figure 10.5),
you can either key in your commands or paste them from a menu (similar to work-
ing in SPSS via the menu options versus working in a SPSS syntax file).The options
under Setup (in the top bar) can be helpful in building a setup for the analyses.
After entering the Title, and in this case ignoring the definition of groups
(since our data pertain to a single group of participants), we can read the data
for the analysis by clicking Add/Read Variables in the Variables menu, selecting
PRELIS System File from the drop-down menu, and then browsing to the path
where we have saved the PRELIS System File (sampledata.psf ). Click OK, and
the nine variables (V1 to V9) and a constant are available for model specification.
In the right-hand panel we can add the latent variables that we assume to under-
lie our measured variables. In our first model we hypothesize one general factor:
L2 proficiency (L2Prof ). Entering this label (see Figure 10.6) and a clicking on
OK will bring us back to the setup screen. Select Setup again and click Build
SIMPLIS Syntax, and the setup so far appears in the upper panel. This setup can
226 Rob Schoonen
FIGURE 10.4 PRELIS data definition options
FIGURE 10.5 Starting to build command lines
be extended either by typing additional model specifications (such as the ones in

Text Box 1), or by selecting keywords from the lower panel and dragging variable
names from the lower panel to the upper panel. Clicking Build SIMPLIS Syntax
again will check and add default information, such as the variance of latent vari-
ables. Since the latent variables have no predefined scale it is assumed that they
have a variance of 1. In Text Box 1, the actual data are entered as a covariance
matrix as an alternative way of importing data, which might be convenient if data
are not available as raw data, but are—for instance—derived from published work.
Sample size and the names of the observed (measured) variables need to be men-
tioned explicitly where they are implied when one uses the PRELIS system file.
FIGURE 10.6 Adding latent variable command lines
TEXT BOX 1: COMMAND LINES FOR LISREL

ANALYSIS (ONE-FACTOR MODEL)
Ti L2 Proficiency
Observed variables
V1 -V9
covariance matrix
42.039
32.026 41.285
30.452 33.114 53.178
14.603 9.254 11.165 56.655
12.818 8.959 8.700 41.741 41.969
11.251 7.340 10.574 34.607 29.376 33.313
6.825 5.476 4.992 21.428 21.028 14.994 12.726
21.101 18.174 21.764 23.986 21.238 18.994 11.174 32.831
19.748 17.708 22.026 20.805 17.889 17.877 9.591 28.147 29.678
SAMPLE SIZE is 341
Latent variables
L2Proficiency
Relationships
V1-V9 =L2Proficiency
Path diagram
End of problem
228 Rob Schoonen
In the command lines, the equals sign (=) can be read as “is determined by.”
The pre-final line in Text Box 1 will result in a path diagram that depicts the
hypothesized model and as such provides a nice check on the specification of
the model. By default the program will provide ML estimates. However, data
requirements such as multivariate normality need to be met to get trustworthy
estimates (Kline, 2010). The estimation procedure can be changed from ML to,
for example, GLS by adding an extra SIMPLIS command line: Method of Estima-
tion: General Least Squares just above or under Path diagram in Text Box 1, or by
selecting Output > Simplis outputs.This leads us to options for the method of
estimation and other output features. Of course, there are many more options for
analyses and kinds of output LISREL can produce than can be demonstrated here
(see Jöreskog & Sörbom, 1996–2001 for more detailed descriptions).
The analysis is run by clicking the Run LISREL button in the top bar. If there
are no serious misspecifications or syntactical errors, the model will show the path
diagram with the estimates. One can switch to the output file with all the details
by means of the Window button. The LISREL output file that results from the
analysis echoes the command lines and the covariance matrix for reference. The
most important part of the outcomes consists of the parameter estimates with
their standard errors and the indices for model fit. In this example, fit indices as
reported in Text Box 2 indicate that the model should be rejected and does not
fit the data very well. None of the aforementioned fit indices that are reported
for the one-factor model comes close to the recommended cutoff for good fit.
TEXT BOX 2: EDITED PART OF THE LISREL

OUTPUT (ONE-FACTOR MODEL)
Goodness of Fit Statistics
Degrees of Freedom = 27
Minimum Fit Function Chi-Square = 1177.53 (P = 0.0)
(. . .)
Root Mean Square Error of Approximation (RMSEA) = 0.38
90 Percent Confidence Interval for RMSEA = (0.36 ; 0.40)
P-Value for Test of Close Fit (RMSEA < 0.05) = 0.00
(. . .)
Chi-Square for Independence Model with 36 Degrees of
Freedom = 3764.64
(. . .)
Normed Fit Index (NFI) = 0.69
Non-Normed Fit Index (NNFI) = 0.59
Parsimony Normed Fit Index (PNFI) = 0.52
Comparative Fit Index (CFI) = 0.69
Incremental Fit Index (IFI) = 0.69
(. . .)
Root Mean Square Residual (RMR) = 9.00

Standardized RMR = 0.22
(. . .)
In a similar way one can build a three-factor model; that is, one has to
replace the last six lines of the setup as represented in Text Box 1 and intro-
duce three latent variables (instead of one): Metacognition, Lexical-Semantic,
and Morpho-Syntactic Knowledge (see Text Box 3). Working with the LIS-
REL menu, one can add and rename labels for latent variables via Setup >
Variables as illustrated earlier, and then redesign the model accordingly in
the upper panel (see Figure 10.7). This model specification can be fitted to
the data by clicking the Run LISREL button in the top bar. The results show
that a three-factor model is far more realistic and that it fits the data much
better, although still not very well yet. The fit indices (see Text Box 4) come
close to the required level for good fit. Statistically speaking, the model has to
be rejected (χ² = 120.28, df = 24), but it constitutes an enormous improve-
ment compared to the first model (χ² = 1,177.53, df = 27). At the “cost” of
three extra estimated parameters (these are the covariances between the latent
variables), the reduction in chi-square is remarkable and statistically signifi-
cant (Δχ² = 1,057.25, Δdf = 3, p < .001), which means that the less restrictive
three-factor model is preferred. The RMSEA, which reduced from .38 to .12,
however, indicates that the model fit is still not satisfactory. The normed fit
index (NFI) and the CFI both show a noticeable increase (from .69 to .96) and
both are satisfactory. The SRMR dropped from .22 to .041, which is in the
range of acceptable models.
TEXT BOX 3: COMMAND LINES FOR LISREL

ANALYSIS (THREE-FACTOR MODEL)
Ti L2 Proficiency (3 factors)
(. . .)
Latent variables
Metacognition LexSem MorphSynt
Relationships
V1-V3 = Metacognition
V4-V7 = LexSem
V8-V9 = MorphSynt
Path diagram
End of problem
230 Rob Schoonen
FIGURE 10.7 Setup for the three-factor model

OUTPUT (THREE-FACTOR MODEL)
( . . . )
( . . . )
Freedom = 3764.64
( . . . )
( . . . )
( . . . )
The comparison of a one-factor model to a three-factor model as a research goal

could have been theoretically underpinned. The three-factor model seems to be
the better one, but is not yet completely satisfactory. It will depend on the specific
research context whether the researcher can defend additional theoretically sup-
ported model improvements, or whether he or she enters the phase of explorations.
For the sake of demonstration, let us assume that all but one test score is derived
from separate test administrations. The exception pertains to V5 and V7, which
are subtest scores derived from one and the same test. As a consequence, distur-
bances during that test will affect both scores. In other words, there might be
so-called correlated error. This phenomenon can be modeled by allowing covari-
ance between the two residuals concerned (e5 and e7); in other words, add the line
Let error covariance between V5 and V7 be free in the model specification. A final
analysis shows that this extra free parameter in the model substantially improves fit
(χ² = 71.16, df = 23, RMSEA = .08, NFI = .98, CFI = .99, SRMR=.034). Not all
indices are completely satisfactory for this model (χ² / df > 2, RMSEA = .08) but
if there are no more plausible parameters to add, the researcher might want to stop
here and inspect the parameter estimates. When the parameter estimates are logi-
cal and within the normal ranges (for example, no negative estimates of variance),
then the researcher can start the substantive interpretation. In this simple model
it is important that the nine observed variables are explained to a large extent by
the three presumed latent variables. The coefficients of determination (R²) range
from .62 to .93. which is reasonably good (see Text Box 5). From a theoretical
point of view the correlations between the latent variables are interesting: How
high are they? Are they different from 0 and—at the other end—sufficiently dif-
ferent from 1? In this case, LISREL reports .31 (.05), .65 (.04) and .63 (.04) with
the corresponding standard errors between brackets for CIs and/or significance
testing. When one takes the standard errors into account it can be concluded
that the estimates are (statistically) different from 0 and 1. In this example, the
focus was on the underlying latent variables of the nine observed variables. In
a next step or dealing with different research questions, one could investigate
whether claims about “causal” relations between the three latent variables of the
kind illustrated in Figure 10.2 can be maintained. One may want to test whether
metacognitive knowledge is the result of lexical-semantic and morphosyntactic
knowledge. To address that question the regression of Metacognitive Knowledge
on the Lexical-Semantic and the Morphosyntactic factors should be specified.

OUTPUT (THREE-FACTOR MODEL WITH
CORRELATED ERROR)
LISREL Estimates (Maximum Likelihood)
Measurement Equations
V1 = 5.57*Metacogn, Errorvar.= 11.07, R² = 0.74

(0.29) (1.33)
18.92 8.34
232 Rob Schoonen

(0.29) (1.25)
19.81 7.14

(0.34) (1.94)
16.69 10.46
V4 = 6.97*LexSem, Errorvar.= 8.06, R² = 0.86

(0.32) (1.12)
22.08 7.19

(0.27) (0.85)
21.80 7.65

(0.25) (0.83)
19.65 10.38

(0.16) (0.35)
19.30 9.79
V8 = 5.53*MorphSyn, Errorvar.= 2.24, R² = 0.93

(0.24) (0.70)
23.52 3.22
V9 = 5.09*MorphSyn, Errorvar.= 3.78, R² = 0.87

(0.23) (0.64)
22.24 5.90
Error Covariance for V7 and V5 = 2.82

(0.47)
5.95
Correlation Matrix of Independent Variables
Metacogn LexSem MorphSyn

-------- ------ --------
Metacogn 1.00
LexSem 0.31 1.00
(0.05)
5.64
MorphSyn 0.65 0.63 1.00
(0.04) (0.04)
17.80 17.73
(. . .)
(. . .)
Freedom = 3764.63
(. . .)
(. . .)
(. . .)
The same analyses can be done in AMOS by drawing the required model
with the tools provided in the program. The opening screen of AMOS (Graph-
ics) consists of three parts, with the left-most panel showing a toolbox for model
drawing. Holding the cursor on an icon in the toolbox will show its function.
From this panel one can select the tools needed for drawing the model: circles and
boxes for latent and measured variables, respectively; single- and double-headed
arrows, but also a tool to add measured variables to a latent variable ; and an
eraser to delete parts of a model. Once a model is designed, one can import the
data by clicking , Filename and then browsing the computer for the right data
file (see Figure 10.8)—by default an SPSS file, but other formats can be read as
well. All variables in the model need to be named, and the measured variables in
the model need to be linked to variables in the data file. Double-clicking circles
lets you key in names for latent variables. Note that the “errors” need to be named
as well, for example E1 through E9, because they are treated as latent variables in
AMOS. Desired features of the analysis or outcomes, such as ML and standardized
parameters, can be handled in the Analysis Properties menu, which you can access
by clicking this button . If the model is fully designed, the data and variables
234 Rob Schoonen
are included, and the features for the analysis are set, the Calculate button can
be clicked. The two top buttons in the middle panel now allow the researcher
to toggle between a representation of the model as designed (i.e., input) and a
representation of the model with parameters (i.e., output). However, the details
of the analysis such as fit indices, standard errors, and possible warnings are pro-
vided in text. Clicking View Text ( ) provides access to the text file with a table
of contents at the left (navigation tree) and at the right the corresponding results.
Figure 10.9 shows the fit indices for our model with three factors and correlated
error. The chi-square was identical to that of the LISREL analysis (71.16), as are
the fit indices. In AMOS fit indices are reported next to the fit of an indepen-
dence model and a saturated model. The model of interest is the Default model,
which is labeled this way because we did not enter a name for it.
This is a very superficial introduction of the possibilities of AMOS and LISREL.
Readers who wish to embark on SEM sessions will best familiarize themselves
with the software manual, which is usually embedded in the package in the Help
area, or consult more extensive introductions aiming at a certain packages (see
Byrne, 1998, 2010).
FIGURE 10.8 Importing data for one-factor model in AMOS

FIGURE 10.9 Output file three-factor model with correlated error in AMOS
In Text Box 6 we briefly present parts of a recent study that uses SEM in
various ways. Here we focus on the underlying structure of the TOEFL iBT that
Gu investigated as part of her doctoral dissertation. In the dissertation and the
article (Gu, 2014), a multigroup analysis was conducted to investigate whether
the underlying structure holds for two different groups, and whether level of
performance was related to studying abroad.
TEXT BOX 6: A SAMPLE STUDY

Gu, L. (2014). At the interface between language testing and second language
acquisition: Language ability and context of learning. Language Testing, 31(1),
111–133.
This study addresses three research questions. For reasons of space, we do

not go into the analyses for questions 2 and 3.
Background
Gu (2014) investigated the structure of scores on the Internet-based Test
of English as Foreign Language (TOEFL-iBT). This study combines several
236 Rob Schoonen
applications of SEM: a factor-analytic application to investigate the underly-

ing structure of TOEFL-iBT; a multisample analysis to evaluate whether the
underlying structure that was found holds for two subpopulations of test
takers; and an investigation of the so-called mean structure to compare score
differences on the latent variables instead of on the observed scores. It is
beyond the scope of this chapter to discuss all these uses of SEM, but Gu’s
study nicely shows the flexibility of SEM.
Research Questions
1) Is the factorial structure of academic language ability the same for stu-
dents who have studied abroad and students who have not done so (a
study-abroad group versus a home-country group)?
2) Do the two groups differ in their scores on the underlying factors (i.e.,
latent variables) of academic English?
3) Is there a relationship between length of study abroad and the level on
the underlying factors?
Here we focus on Research Question 1.
Method
The data consisted of the test scores and questionnaire responses of 1,000
and 370 test takers, respectively. The subsample that answered the question-
naire was split in two groups: (a) never lived in an English speaking environ-
ment (n=124) and (b) have lived in such an environment (n=246). Data
for the present analysis were based on test scores of 1,000 candidates for
listening, reading, writing, and speaking. From the questionnaire data, Gu
derived information about exposure to English language and instruction.
Using the Mplus SEM package (Muthén & Muthén, 2010), Gu expli-
cates the check of relevant assumptions such as normality. Since some score
distributions deviated from normality, Gu opted for an adjusted estima-
tion of chi-square, derived indices, and standard errors of parameters (the
Satorra-Bentler correction). The scale for each latent variable is determined
by using a reference variable and fixating its loadings on the latent variable
relative to 1.
Results
Gu postulated three plausible models for the structure of the four skills.
The fit of these models and the comparison thereof was used to choose the
best model. Model 1 follows the scoring procedure of TOEFL-iBT and previ-
ous research. It consists of four factors representing the four skills and one
higher-order, overarching factor (“Language Ability”) that is supposed to
capture the correlations between the four skills. Model 2 is a straightforward
four-factor model with intercorrelated factors, one for each skill. Model 3
consists of two factors: “Speaking” on the one hand and “Reading, Writing,
and Listening” on the other. This latter model is based on previous research,
but is theoretically speaking less transparent (see Gu’s Figure 4, reproduced
below).
Model fit was evaluated in several ways, as it should be: overall fit
(chi-square test, CFI, RMSEA, SRMR), evaluation of parameter estimates, and
parsimony for equally well fitting models. The SEM analyses showed rea-
sonable fit for all three models, Model 3 being somewhat less well fitting.
Estimation = MLM
Observation = 1000
Chi-Square (df) = 530.73 (118) L1 EL1 0.63
CFI = 0.96
RMSEA = 0.06
SRMR = 0.04 L2 EL2 0.62
0.61 L3 EL3 0.55

0.62
0.67
L4 EL4 0.62
0.62
0.68 L5 EL5 0.53
L/R/W 0.75
L6 EL6 0.44
0.65
0.75 R1 ER1 0.58

0.77
0.82 R2 ER2 0.43
0.77
R3 ER3 0.41
0.80 W1 EW1 0.33
W2 EW2 0.41
S1 ES1 0.40
0.72
S2 ES2 0.47
0.73
0.76 ES3 0.43

S3
S
0.77
0.77 S4 ES4 0.40
0.80
S5 ES5 0.41
S6 ES6 0.36
GU’S FIGURE 4: Correlated two-factor model with standardized estimates

(Gu, 2014, p. 123) in Language Testing, 31(1), 111–133, copyright © 2012 by
author. Reprinted by permission of Sage.
238 Rob Schoonen
Evaluating the parameters, Gu discovered that in Model 2 Listening and

Writing showed extremely strong correlation (.97) and that in Model 1 the
factor loadings for Listening and Writing on the higher-order general factor
were exceptionally high. Both observations indicate that Listening and Writ-
ing ability are difficult to distinguish empirically. This prompted Gu to opt for
Model 3 with two factors: Speaking with six indicators, and non-Speaking
(LWR) with the remaining 11 indicators. The standardized factor loadings
were represented in the visual representation of the model (see Figure 4).
The selected model was then tested successfully on the subsample of 370
test takers as well.
Readers interested in the solutions to research questions 2 and 3 are
referred to Gu (2014). The article includes the descriptives and the correlation
matrix of the seventeen variables (N=1,000). Furthermore, Gu’s study shows
a few more (common) applications of SEM in a clear and well-reported way.
In Sum
SEM is a flexible approach to data analysis, especially for larger data sets that rep-
resent more complex relationships. The possibilities to apply SEM are enormous,
but the substantive interpretation of models and parameter estimates depends
heavily on carefully conducted analyses, taking into account data requirements
and the risk of overfitting the model.
Tools and Resources

SEM researchers have different software packages at their disposal. Each possess a
unique set of strengths and weaknesses in use and in the way they can deal with
special cases (Ullman, 2007). Most commercial packages also have demo versions
for a limited number of variables and/or participants and/or for a limited period
of time that allow the new user to explore the possibilities of the software.
• R (Fox, 2006): a freeware statistical package. Rosseel (2012) has developed a

special package for R users called lavaan (latent variable analysis).
• AMOS (Arbuckle, 2012): This package is related to SPSS and has a graphical
interface (see also Byrne, 2010).
• LISREL: This package was originally developed by one of the founding
fathers of SEM, Karl Jöreskog. It started with a matrix-oriented interface, but
now has several interfaces including a visual one and the SIMPLIS language
(see Jöreskog & Sörbom, 1996–2001)
• Other more or less specialized packages include Mplus (Muthén & Muthén,
2010) and EQS (Bentler, 2006).
A number of additional online resources and communities can also provide assistance:
• SEMNET (The Structural Equation Modeling Discussion Network): A list-

serv and discussion board for all things SEM: http://www2.gsu.edu/~mkteer/
semnet.html
• The website of David A. Kenny: http://davidakenny.net/cm/causalm.htm
• A thorough set of video lectures on SEM: http://www.ats.ucla.edu/stat/
seminars/muthen_08/default.htm
Further Reading
There are many different introductions and advanced volumes dealing with SEM.
A good starting point could be the manual of the software package that one
wants to use. The manual can provide a quick introduction into the theoretical
considerations, many of which are only touched upon here. Byrne (1998, 2006,
2010, 2012) wrote different introductions for different software packages (Mplus,
LISREL, EQS, AMOS). More general introductions include Raykov and Mar-
coulides (2006), Kline (2010), Mueller & Hancock (2008) and Ullman (2007).
These volumes also cover some of the more advanced applications, such as multi-
group analysis in which models are fitted simultaneously in two (or more) groups
(for example, boys and girls, L1 and L2 speakers, or study-abroad and study-home
as in Gu’s study), or latent growth modeling in which different curves of develop-
ment can be modeled and related to predictor variables. Hancock and Mueller
(2013) provide in their edited volume what they call a “second course,” that is, the
contributions take the applications a step further and deal with topics like missing
data, categorical data, power analysis, and so forth.
There is also a journal dedicated to structural equation modeling that pub-
lishes applications from all fields, discusses methodological issues, and has a
“teacher’s corner” that presents brief instructional articles on SEM-related issues:
Structural Equation Modeling: A Multidisciplinary Journal (ISSN 1070–5511 [Print],
1532–8007 [Online]).
There are also a number of introductions and applications in the field of applied
linguistics and language assessment; see Hancock and Schoonen (2015), Kunnan
(1998), Schoonen (2005), In’nami and Koizumi (2011, 2012), and Ockey (2014).
1. Select a study that uses SEM and read the abstract, introduction, and research
questions. On the basis of your reading, draw the model you expect the research-
ers to test. In what respect does your model diverge from the model actually
tested? To what extent can you understand the differences between your model
and the author’s? Are there any unexpected differences and are these motivated
(a priori or post hoc)? How logical are the unexpected differences?
240 Rob Schoonen
2. Select a study that uses SEM and that postulates correlated error. Are these
parameters well explained in terms of the measurement procedures?
3. Select two SEM studies.What criteria do they use for model fit? Do they use
criteria from different families of fit indices? Are there any other differences
between the two studies? If you would apply the criteria from one study
to the other, would that affect the model selection (and conclusions) in the
other study? How so?
4. It is claimed that the correlations between latent variables are not attenuated
by measurement error. Can you corroborate that on the basis of the data in
Text Box 1? What is the average correlation between the observed variables
for Metacognitive Knowledge (V1–V3) and observed variables for Morpho-
syntactic Knowledge (V8–V9)? How does that compare to the .65 reported
for the correlation between the latent variables?
5. Using the data set made available along with this chapter (http://oak.ucc.nau.
edu/ldp3/AQMSLR.html), explore whether another structural model for the
three latent variables in the sample analysis is plausible (e.g., Metacognitive
Knowledge as the result of the two latent linguistic variables). How plausible is
a model with Metacognitive Knowledge independent of the two latent linguis-
tic variables? Try to model these “hypotheses” and fit the models to the data.
6. How could you test whether the two latent linguistic variables coincide? In
other words, test a two-factor model with a metacognitive factor (V1–V3)
and a linguistic factor (V4–V9). How does this model compare to the
one-factor model? To the three-factor model?
7. SEM and factor analysis have a lot in common. What similarities and dif-
ferences between the two approaches can you think of ? When would one
approach be more appropriate or informative than the other?
8. Gu (2014) provides the correlation matrix of the measured variables involved in
the models, as well as descriptives statistics. By doing so, the author allows you to
replicate her analysis (consult the AMOS manual for importing a matrix).You
can start a LISREL analysis with the setup provided in Text Box 1, and then
continue by adjusting it. Choose your own title, define the observed variables
(L1–W2), insert “correlation matrix” and replace the matrix with Gu’s matrix,
change sample size, define your latent variables and specify the relations (see also
Text Box 3). As you probably know, correlations are standardized covariances,
and the standardization is based on the standard deviations of the two variables
involved (see Kline, 2010). LISREL can derive the covariances from the cor-
relations on the basis of the standard deviations. So add another command, just
above or under the correlation part, that starts with “Standard deviations” and
then on the next line list all the standard deviations. Now replicate models 2
and 3 from Gu’s study (i.e., the correlated four- and two-factor models).1 What
do you find? There will be small differences due to slightly different algorithms,
but the overall outcome should be highly similar.The difference in chi-square is
also due to a correction Gu applied to account for the slightly nonnormal data
she had. It is beyond the scope of this chapter to go into the details.
Note
1. If you work with LISREL’s student version, then you are restricted to 16 observed vari-
ables where Gu (2014) has 17.You could either delete the first variable L1 for Listening,
or resort to the 15-day trial version of LISREL. If you delete L1, your results will of
course differ, as well as the degrees of freedom. Can you predict df ?
Acknowledgment
The author wishes to thank Jan Hulstijn, Camille Welie, Luke Plonsky, and two anonymous
reviewers for their helpful comments. All remaining errors are the author’s.
References
Arbuckle, J. L. (2012). IBM® SPSS® AMOS™ 21 User’s Guide. Chicago: IBM Software
Group.
Bentler, P.M. (2006). EQS 6 Structural Equations Program Manual. Encino, CA: Multivariate
Software.
Bollen, K. A. (1989). Structural equations with latent variables. New York: John Wiley & Sons.
Boomsma, A. (2000). Reporting analyses of covariance structures. Structural Equation Mod-
eling: A Multidisciplinary Journal, 7(3), 461–483.
Byrne, B. M. (1998). Structural equation modeling with LISREL, PRELIS, and SIMPLIS: Basic
concepts, applications, and programming. Mahwah, NJ: Lawrence Erlbaum.
Byrne, B. M. (2006). Structural equation modeling with EQS: Basic concepts, applications, and
programming (2nd ed.). Mahwah, NJ: Lawrence Erlbaum.
Byrne, B. M. (2010). Structural equation modeling with AMOS: Basic concepts, applications, and
programming (2nd ed). New York: Taylor & Francis.
Byrne, B. M. (2012). Structural equation modeling with Mplus: Basic concepts, applications, and
programming. New York: Taylor & Francis.
Enders, C. K. (2013). Analyzing structural equation models with missing data. In G. R.
Hancock & R. O. Mueller (Eds.), Structural equation modeling. A second course (2nd ed.,
pp. 493–519). Charlotte, NC: Information Age Publishing.
Finney, S. J. & DiStefano, C. (2013). Nonnormal and categorical data in structural equation
modeling. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling. A sec-
ond course (2nd ed., pp. 439–492). Charlotte, NC: Information Age Publishing.
Fox, J. (2006). Structural equation modeling with the sem package in R. Structural Equation
Modeling, 13(3), 465–486.
Gu, L. (2014). At the interface between language testing and second language acquisition:
Language ability and context of learning. Language Testing, 31(1), 111–133.
Hancock, G. R. & Mueller, R. O. (Eds.) (2013). Structural equation modeling. A second course
(2nd ed.). Charlotte, NC: Information Age Publishing.
Hancock, G. R., & Schoonen, R. (2015). Structural equation modeling: Possibilities for
language learning researchers. Language Learning, 65: Suppl 1, 158–182.
Hu, L., & Bentler, P.M. (1999). Cutoff criteria for fit indexes in covariance structure anal-
ysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1),
1–55.
In’nami, Y., & Koizumi, R. (2011). Structural equation modeling in language testing and
learning research: A review. Language Assessment Quarterly, 8(3), 250–276.
In’nami, Y., & Koizumi, R. (2012). Factor structure of the revised TOEIC® test: A
multiple-sample analysis. Language Testing, 29(1), 131–152.
242 Rob Schoonen
Jöreskog, K. G., & Sörbom, D. (1996). LISREL 8: Structural equation modeling with the SIM-
PLIS command language. Chicago: Scientific Software International.
Jöreskog, K. G., & Sörbom, D. (1996–2001). LISREL 8: User’s Reference Guide (2nd ed.).
Lincolnwood, IL: Scientific Software International.
Kline, R. B. (2010). Principles and practice of structural equation modeling (3rd ed.). New York:
The Guilford Press.
Kunnan, A. J. (1998). An introduction to structural equation modeling for language assess-
ment research. Language Testing, 15(3), 295–332.
Mueller, R. O. & Hancock, G. R. (2008). Best practices in structural equation modeling.
In J. Osborne (Ed.). Best practices in quantitative methods (pp. 488–508). Thousand Oaks,
CA: Sage.
Muthén, L. K., & Muthén, B. O. (2010). Mplus user’s guide. Statistical analysis with latent vari-
ables (6th ed.). Los Angeles: Muthén & Muthén.
Ockey, G. J. (2014). Exploratory factor analysis and structural equation modeling. In A. J.
Kunnan (Ed.), The companion to language assessment. Vol. III: Evaluation, Methodology,
and Interdisciplinary Themes (pp. 1224–1244, Part 10, Chapter 73). Malden, MA: John
Wiley & Sons.
Raykov,T., & Marcoulides, G. A. (2006). A first course in structural equation modeling (2nd ed.).
Mahwah, NJ: Erlbaum.
Rosseel,Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statisti-
cal Software, 48(2), 1–36.
Rovine, M. J., & Molenaar, P.C.M. (2003). Estimating analysis of variance models as struc-
tural equation models. In B. H. Pugesek, A. Tomer, & A. von Eye (Eds.), Structural equa-
tion modeling: Applications in ecological and evolutionary biology (pp. 235–280). Cambridge:
Schoonen, R. (2005). Generalizability of writing scores. An application of structural equa-
tion modeling. Language Testing, 22 (1), 1–30.
Schoonen, R., Van Gelderen, A., De Glopper, K., Hulstijn, J., Simis, A., Snellings, P., &
Stevenson, M. (2003). First language and second language writing: the role of linguistic
fluency, linguistic knowledge and metacognitive knowledge. Language Learning, 53(1),
165–202.
Schoonen, R., Van Gelderen, A., Stoel, R., Hulstijn, J., & De Glopper, K. (2011). Model-
ing the development of L1 and EFL writing proficiency of secondary-school students.
Schoonen, R.,Vergeer, M., & Eiting, M. (1997). The assessment of writing ability: Expert
readers versus lay readers. Language Testing, 14(2), 157–184.
Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimation
approach. Multivariate Behavioral Research, 25(2), 173–180.
Tseng, W.-T., & Schmitt, N. (2008). Toward a model of motivated vocabulary learning:
A structural equation modeling approach. Language Learning, 58(2), 357–400.
Ullman, J. B. (2006). Structural equation modeling: Reviewing the basics and moving for-
ward. Journal of Personality Assessment, 87(1), 35–50.
Ullman, J. B. (2007). Structural equation modeling. In B. G. Tabachnick & L. S. Fidell (Eds.),
Using multivariate statistics (5th ed., pp. 676–780). Boston: Pearson/Allyn and Bacon.
West, S. G., Finch, J. F., & Curran, P. J. (1995). Structural equation models with nonnormal
variables. Problems and remedies. In R. H. Hoyle (Ed.), Structural equation modeling. Con-
cepts, issues, and applications (pp. 56–75). Thousand Oaks, CA: Sage.
11
CLUSTER ANALYSIS
Shelley Staples and Douglas Biber
Research in applied linguistics typically involves comparisons among groups of
speakers. Those groups can be defined in terms of many different types of cat-
egorical variables, such as students from different first language (L1) backgrounds,
or students in a treatment group versus a control group. Those groups are then
usually compared with respect to quantitative (dependent) variables, such as per-
formance scores on a language test. It is often the case, though, that there is
considerable variation within these groups. For example, while there might be
significant differences in language test scores between a treatment group and a
control group, there will also often be considerable variation among students
within each of those groups. Cluster analysis can be useful for situations like
this, because it provides a bottom-up way to identify new groups that are better
defined with respect to target variables.
Cluster analysis is a multivariate exploratory procedure that is used to group
cases (e.g., participants or texts). Cluster analysis is useful in studies where there is
extensive variation among the individual cases within predefined categories. For
example, many researchers compare students across proficiency level categories,
defined by their performance on a test or holistic ratings. But a researcher might
later discover that there is extensive variation among the students within those
categories with respect to their use of linguistic features or with respect to attitu-
dinal or motivational variables. Cluster analysis provides a complementary way to
group students based directly on such variables. So, for example, cluster analysis
could be used to identify groups of students with positive attitudes and intrinsic
motivations; a group with positive attitudes and extrinsic motivations; a group
with negative attitudes and intrinsic motivations; and so on.Those new categories
could then be described and compared with respect to a range of other linguistic
244 Shelley Staples and Douglas Biber
or performance variables.The distinction between independent versus dependent

variables is not relevant in a cluster analysis. Rather, the goal is to create a new
categorical variable that minimizes the amount of variation within categories,
while maximizing the differences among categories.
In this line of research and others related to second language (L2) learning,
cluster analysis is a useful tool to examine within group differences in language
learners (i.e., variation within the groups that are defined on an a priori basis
by other variables in the study, such as proficiency level). Cluster analysis is par-
ticularly relevant when there is evidence to suggest that different subgroups of
learners may utilize different pathways to language learning, including different
strategies, aptitudes, motivational profiles, or different linguistic features to pro-
duce successful spoken or written language, among other questions relevant to
L2 research.
Cluster analysis is not a commonly used statistical procedure in L2 research
(Plonsky, 2013), and it is rarely discussed in methodological textbooks written for
L2 researchers. Nevertheless, the number of studies employing cluster analysis has
been growing, particularly as a means to investigate the impact of individual dif-
ferences (e.g., aptitude, motivation, strategy use) on language learning, since the
procedure can identify subcommunities of learners within a larger L2 community
(Csizer & Dörnyei, 2005). In an early study, Skehan (1986) examined the aptitude
profiles of language learners in order to identify different types of successful lan-
guage learners. Learner strategies and levels of achievement have also been a focus
of some studies, including Kojic-Sabo & Lightbown (1999) and Yamamori, Isoda,
Hiromori, and Oxford (2003). Other factors related to language learning that
have been explored through cluster analysis include L1 achievement and intel-
ligence (Sparks et al., 2012). Finally, Tsuda & Nakata (2013) investigated learner
types in relation to a number of individual differences, building on the work of
many of the previous studies identified. First, they used factor analysis to identify
five factors related to strategies, motivation, and self-efficacy and then they used
these factor scores to cluster students.
Most of the studies discussed above focus on individual difference profiles,
clustering cases by variables such as aptitude, motivation, and strategy use. After
clustering, cluster groups are then examined in relation to an outside variable
such as proficiency level or some measure of language attainment. These stud-
ies identify different profiles of motivation, strategy use, or aptitude that enable
high proficiency level/performance as well as profiles associated with lower pro-
ficiency level/performance.
Cluster analysis studies in L2 research outside the area of individual differences
are more limited. A few studies have used cluster analysis to examine the linguistic
profiles of the written and oral production of language learners. Notably, Götz
(2013) examined the linguistic features used by L1 and L2 speakers of English
within two corpora of interviews in English. The cluster analysis revealed distinct
linguistic profiles of both L1 and L2 speakers. Perhaps most interesting is the
Cluster Analysis 245
division of two clusters of L2 speakers: one cluster that primarily used formulaic
language to achieve fluency and the other that used a variety of other strategies,
including filled pauses, repetitions, and discourse markers. Jarvis, Grant, Bilowski,
and Ferris (2003) and Friginal, Li, and Weigle (2014) are also innovative in their
exploration of multiple linguistic profiles of high-scoring L2 writers.
Cluster analysis can also be used to investigate the linguistic development of
learners over time by determining how linguistic features cluster within texts
produced by learners at various points in time. Gries and Stoll (2009), for exam-
ple, focus on clustering individual performances by a single speaker based on
changes in one linguistic feature—mean length of utterance (MLU)—over time.
By identifying performances that cluster together, clear developmental stages can
be identified in the data.This method could be also applied to longitudinal studies
of L2 development and to additional variables (e.g., development of the lexicon).
Two other areas within L2 research where cluster analysis has been applied
are L2 assessment and language policy and planning. Eckes (2012) used cluster
analysis to determine rater types based on their behavior in rating a high-stakes
German foreign language test. Leung & Uchikoshi (2012) investigate the lan-
guage planning and policy profiles (e.g., language use in the home) of parents
of bilingual Cantonese and English-speaking children in relation to their profi-
ciency in each language.
Other studies outside the field of L2 research point to different ways in which
linguistic variables can be used to cluster texts (oral and written). First, studies of
register variation have been very fruitful in identifying text types, which are based
on the clustering of texts that are similar in their use of linguistic variables. For
example, Biber (1989), Biber and Finegan (1989), Biber (1995), Biber (2008) have
all investigated a wide range of lexico-grammatical features to determine clus-
ters of texts that are similar in their linguistic characteristics, and then examine
those groupings in relation to established text categories (e.g., scientific writing,
face-to-face conversation). Csomay (2002) used cluster analysis to identify within
classroom discourse different functional episode types, which are sections of text
clustered based on their similar linguistic features. Four episode types were identi-
fied. Gries, Newman, and Shaul (2011) provide an example of how texts can be
grouped by their use of frequent lexical strings (i.e., n-grams or lexical bundles).
Text-linguistic applications of cluster analysis may also be useful to L2 researchers,
as they reveal information about the linguistic features used in particular domains
of language use. Such findings, similar to those for factor analysis, allow ESP and
EAP researchers, teachers, and materials developers to understand more about the
linguistic nature of particular registers of a language.This same approach could be
used in L2 research, clustering texts of learner production that are similar in their
linguistic characteristics, and then considering the relation of those clusters to a
priori categories such as task features or different proficiency levels. We explore
a study of this type in the next section, which documents the process used to
perform cluster analysis.
Procedures for Conducting Cluster Analysis

Step 1. Choosing a Type of Cluster Analysis
There are two main types of cluster analysis: hierarchical cluster analysis (HCA)
and disjoint cluster analysis (often identified through the K-means approach).
Disjoint clustering is conceptually (and computationally) simpler: The researcher
determines how many clusters he or she wants (see Step 8), and the technique
combines observations into that many groups such that the differences among
groups are maximized, while the variation within each group is minimized. In
contrast, HCA produces a hierarchical tree structure, with all observations starting
as their own cluster and then combining cases into larger and larger clusters.Thus,
the analysis results in a kind of hierarchical taxonomy, with high-level groups and
smaller groupings embedded in those high-level groups.1
Although it is computationally complex, HCA is the simplest method for the
end user:There is no need to decide on the best number of clusters ahead of time.
Rather, a single HCA can be run and the results interpreted afterwards.This is prob-
ably the reason why HCA is the dominant method of cluster analysis used in L2
research (see e.g., Götz, 2013; Jarvis, Grant, Bikowski, & Ferris 2003; Kojic-Sabo &
Lightbown, 1999; Skehan, 1986; Uchikoshi & Marinova-Todd, 2012). As will be
seen later, while the distance coefficients in SPSS provide guidance in determining
the number of clusters that form the optimal solution, it should be noted that this
process relies heavily on the interpretation of the researcher.
There is also a theoretical motivation for choosing between HCA and disjoint
clustering: If there is a theoretical reason to believe that your data is hierarchical
in nature, then HCA should be used. The converse is also the case: If you have no
interest in hierarchical levels, then disjoint (K-means) clustering should be used.
From this technical perspective, HCA is often misapplied, because researchers use
it to identify disjoint clusters and pay no attention to the hierarchical structure.
In our examples we describe the interpretation of both the optimal number of
clusters as well as the hierarchical structure leading to those clusters.
Most analyses with HCA use agglomerative hierarchical methods, which start
with each observation in its own cluster, followed by a series of successive merg-
ers of clusters, with all observations in one cluster at the end of computation.
This is actually the only possibility for computing HCA in SPSS. An alternative
method for computing HCA—divisive hierarchical clustering—begins with all
observations in one cluster and then subgroups are identified based on how far
observations in one subgroup are from the observations in another subgroup.This
method is available in statistical programs such as SAS and R.The present chapter
focuses on the implementation of agglomerative HCA in SPSS.
To set up the data in SPSS, download the sample data set from this book’s
companion website (http://oak.ucc.nau.edu/ldp3/AQMSLR.html). These will
differ from those used in the chapter, but can be used in similar ways. The objects
being clustered (e.g., participant, text file) will be entered as cases (rows) while
FIGURE 11.1 Step 1
each of the predictor variables (e.g., test scores, Likert scale items) will be added
as columns. There may be other variables not included in the analysis (e.g., pro-
ficiency level) that will be included in the data set as columns but not added to
the HCA.The first step is to select Analyze > Classify > Hierarchical Cluster
Analysis, as shown in Figure 11.1.
Step 2. Choosing the Variables and Objects to Cluster

As with any statistical analysis, the first step is actually deciding on the research
design, governed by the research questions. In the case of HCA, there will nor-
mally have been previous statistical analyses of your data, so it will already be clear
what observations you would like to cluster into new groups. However, it will
be less clear what variables you should use to determine the new groupings. This
decision is centrally important for the later interpretation of the HCA.The result-
ing clusters will be maximally distinguished with respect to these variables, so
you should have clear theoretical motivation for the selection of those predictors.
In our example study here, we describe an HCA of 947 written texts pro-
duced for the TOEFL iBT. These texts had been previously categorized for two
task types (“independent” and “integrated”) and assigned proficiency scores.
Linguistic analyses had been carried out to describe lexico-grammatical differ-
ences between the two task types and across proficiency score levels. Biber &
Gray (2013), for instance, included a factor analysis to identify four underlying
FIGURE 11.2 Step 2
linguistic parameters of variation. Although there were significant differences

among the predefined groups in that study with respect to these four factors,
there was also considerable linguistic variation within these groups (see Biber &
Gray, 2013, Appendix L, p. 128). Thus, HCA is an appropriate follow-up tech-
nique to investigate whether the written texts can be clustered into groups that
are similar in their linguistic characteristics, and if so, whether those new clusters
can be interpreted from the perspective of L2 development.
Thus, for this example study, our predictor variables are the four factors that
had been previously identified in Biber & Gray, 2013 (see also Sample Study
2 for more details). These factors are specified in SPSS by moving them to the
Variable(s) section of the dialogue box. We retain the default setting in SPSS of
clustering by cases (instead of variables2), and choose the ID number to label the
cases (see HCA dialogue box in Figure 11.2). This allows us to have a unique
case identifier for each text, to interpret the composition of the clusters.
Step 3. Choosing Statistics

Next, click on Statistics.
Choose Agglomeration schedule; this will provide us with the distance coef-
ficients we need to assess the cluster solutions and choose the optimal number
FIGURE 11.3 Step 3, part 1

of clusters. The Cluster Membership feature produces output that identifies the
cluster of each case. At this stage of the analysis, we are trying to determine the
optimal number of clusters, so you should choose None under Cluster Membership.
Cluster membership will be identified later (see Step 9) using the Save function.
Click Continue when finished.
Step 4. Choosing Plots

Next, click on Plots.
Under Plots, select Dendrogram. Deselect the Icicle plot by choosing None. The
icicle plot shows how cases are merged into clusters. Cases that are merged are
indicated by a bar in the column between them. However, this feature is not
needed or helpful for studies with a large number of cases (such as ours). Click
Continue when finished.
Step 5. Determining the Method

Finally, click the Method button.
Under Method, choose Cluster Method. There are a number of options in SPSS
for the type of similarity and distance measures used, including between-groups
linkage, within-groups linkage, nearest neighbor, furthest neighbor, centroid


clustering, and Ward’s method. We will use Ward’s method, but here provide a
short explanation of each of the other options. Based on a review of the literature,
Ward’s method is the most commonly used measure within HCA (see, e.g., Eckes,
2012; Gries et al., 2011; Leung & Uchikoshi, 2012).
The simplest method is the nearest neighbor (also known as single linkage)
method. In this method, cases are joined to existing clusters if at least one of the
members of the existing cluster is of the same level of similarity as the case under
consideration for inclusion (Aldenderfer & Blashfield, 1984, p. 38). The major
advantage of this method is that the results will not be affected by data transfor-
mations. The major disadvantage of this method is that it tends to form chains
of linkage within the data such that toward the end of the clustering, one large
cluster may eventually be formed with individual cases being added one by one.
Visual examination of the data is also not very helpful (Aldenderfer & Blashfield,
1984, pp. 39–40).
The furthest neighbor or complete linkage method indicates that the new case
is added to an existing cluster if it is within a certain level of similarity to all
members of the cluster (Aldenderfer & Blashfield, 1984, p. 40). This method
tends to produce the opposite of the single linkage, namely very tight clusters
with high within-group similarity. However, relatively similar objects may stay
in different clusters for a long time, creating the opposite problem from that of
single linkage.
The between-groups linkage or average linkage method was developed to find a
compromise between the two extremes of the single and complete linkage meth-
ods. It calculates the average of all possible distances between all pairs of cases in
Cluster A and all pairs of cases in Cluster B and combines the two clusters if a
given level of similarity is achieved (Aldenderfer & Blashfield, 1984, pp. 40–41;
Norušis, 2011, p. 373).
While between-groups linkage uses pairs of cases, within-groups linkage adds an
additional consideration of the average measure of all possible pairs of cases in a
resulting cluster (Norušis, 2011, p. 373).
The centroid method uses the distance between the centroid for the cases in
Cluster A and the centroid for cases in Cluster B to measure dissimilarity. The
distance between two clusters is the sum of distances between cluster means for
all of the variables. When a new cluster is formed, the new centroid is a weighted
combination of the two clusters that have been merged (Norušis, 2011, p. 373).
Median clustering is similar to the centroid method but there is no weighting of
the combination of centroids when clusters are merged (Norušis, 2011, p. 373).
Finally, Ward’s method measures dissimilarity between clusters in relation to
the “loss of information” or increase in the error sum of squares by joining two
clusters (Aldenderfer & Blashfield, 1984, p. 43). In practice, the choice of similar-
ity measure usually has only minor consequences for applications in applied lin-
guistics. As noted earlier,Ward’s method is most commonly used, and we illustrate
its application next.
Step 6. Determining the Distance or Similarity Measure

After choosing Ward’s method for the Cluster Method, we then move on to the
Measure menu, just below Cluster Method, as seen in Figure 11.9. The Measure
options allow you to indicate which distance or similarity measure to use when
clustering the data.
Since we are using interval data, our options are Euclidean distance, squared
Euclidean distance, cosine, Pearson correlation, Chebychev, block, Minkowski,
and customized. The squared Euclidean distance should be used with the centroid
clustering, median clustering, and Ward’s method.The squared Euclidean distance
is the sum of the squared differences between the values for the items. Other
options are explained within SPSS using the Help option.
Step 7. Transforming Variables

In the HCA options for SPSS, you also can transform the variables used to cluster
the cases. This is an important consideration for conducting cluster analysis: If
variables use very different scales (or have very different ranges), this will affect the
outcome of the clustering process. In such cases it is advisable to use standardized
FIGURE 11.9 Step 6
FIGURE 11.10 Step 7

variables. As Figure 11.10 shows, there are a number of options for transforming
variables. Z scores are one common method of standardization. In our case study,
the variables have already been transformed (using z scores) for the factor analysis,
and thus we do not need to standardize.
Click Continue to exit this menu, and then OK to run the HCA function.
Step 8. Determining the Number of Clusters

In the resulting output, there are two features of interest to help interpret the
cluster formation.3 The first, and most commonly associated with cluster forma-
tion, is the dendrogram plot, which is a graphical illustration of the hierarchical
tree structure formed using the linkage method and measures chosen in the pre-
vious steps. We provide a rotated and rescaled image of the dendrogram for our
data next.
The clustering algorithm joins objects into successively larger clusters. At first,
all of the cases are individual clusters. The two most similar clusters are then
fused and distances recalculated. More and more objects are linked together until
the last step, when all objects are joined. It can be seen that there are a few clear
stages of clustering of the data, certainly at the level of 2, 3 and 4 clusters, but
it is difficult to determine the optimal number of clusters based purely on the
dendrogram.
Another, more quantitative (but still heuristic) approach that is commonly
used for this purpose involves investigating the fusion coefficients, or the numeri-
cal value at which cases merge to form a cluster (Aldenderfer & Blashfield, 1984).
FIGURE 11.11 Dendrogram of cluster analysis for 947 cases

For this, we examine the agglomeration schedule in the output. The agglom-
eration schedule generally displays the cases or clusters combined at each stage,
the distances between the clusters being combined (the coefficients column, our
main focus), and the next stage at which the cluster joins another cluster. Note
that when using Ward’s method, the coefficient is actually the within-cluster sum
of squares at that step. That is why the values may be much larger than those
found for other measures. Figure 11.12 shows a truncated version of the agglom-
eration schedule from the SPSS output. Note that the total number of stages cor-
responds to one less than the number of cases in the data set.
The agglomeration schedule shows the step-by-step output for clustering
cases. As noted, the procedure begins with each case representing a separate clus-
ter. At Stage 1, two of these cases (Case 887 and Case 894) are clustered together.
The resulting within-cluster sum of squares is .035. Neither of the two cases have
been previously clustered, so the “stage cluster first appears” is 0 for both clusters.
FIGURE 11.12 Truncated agglomeration schedule for 947 cases in the data set
In the agglomeration schedule, each cluster is referred to by a single case.Thus,

in stage 33 Cluster 887 (which contains cases 887 and 894) is combined with the
cluster that contains Case 826. Thus, there are now three cases in this cluster. The
within-cluster sum of squares for the cluster containing Case 887, Case 894, and
Case 826 is 5.189. At Stage 242, the cluster containing Case 826 (and 887 and
894) will be combined with another case (808) to form a cluster with four cases
(Case 808, 826, 887, and 894). This cluster will be combined with another case
at Stage 495.
In actual practice, it is not likely that you will ever need to use the agglomera-
tion schedule to identify when specific cases have been clustered. Rather, we use
this schedule mostly to determine the number of clusters that we should include
in our final analysis. For this purpose, we focus on the Coefficients column, which
indicates the within-cluster sum of squares at the point at which the last two
clusters were joined. This column can generically be referred to as the “fusion
coefficient” column (Aldenderfer & Blashfield, 1984, p. 54). We need consider
only the clusters that develop toward the end of the process, as these will be the
largest clusters. In this example study, we investigate the last seven clusters formed
(i.e., stages 940–946).
We are interested in seeing where the difference between the fusion coef-
ficients starts to flatten out, which indicates that no new information is being
added with the addition of new clusters (Aldenderfer & Blashfield, 1984, p. 54).
The last agglomeration coefficient (Stage 946) represents the within-cluster sum
of squares for the last cluster formed (119,127.237), i.e., when all 947 cases are
combined into a single cluster. The next lowest distance coefficient (Stage 945)
represents the within-cluster sum of squares when all cases are grouped into one
of two major clusters (70,646.460), and so on.
Table 11.1 reformats the agglomeration schedule, subtracting the last fusion
coefficient from the next fusion coefficient to determine the distance between
the two fusion coefficients.
Using Microsoft Excel, we can then plot the differences between the coef-
ficients (y-axis) in relation to the number of clusters (x-axis), looking for a break
where adding more clusters contributes little to the total variance accounted for
by the analysis.
TABLE 11.1 Reformatted fusion coefficients for final six clusters formed
Coefficient Last Clustering Coefficient Next Clustering Distance Between Coefficients
119,127.2 70,646.46 48,480.78

70,646.46 43,657.98 26,988.48
43,657.98 35,210.71 8,447.27
35,210.71 30,805.02 4,405.69
30,805.02 27,314.41 3,490.61
27,314.41 25,248.28 2,066.13
60000
Distance between fusion coefficients
50000
40000
30000
20000
10000
0
1 2 3 4 5 6
Number of clusters
FIGURE 11.13 Distance between fusion coefficients by number of clusters
The graph in Figure 11.3 can be considered similarly to a scree plot used in
factor analysis, in that we are looking for the number of clusters where the dif-
ference in coefficients starts to flatten out. As discussed earlier, this flattening out
indicates that not much new information is gained by adding more clusters. In
the present study, this flattening out occurs at the point at which three clusters
are created. However, this measure is only one indication of the optimal num-
ber of clusters. The next step is thus to investigate the information gained by a
four-cluster solution and the information lost by a two-cluster solution, to deter-
mine the optimal number of clusters for interpretation.
Step 9. Obtaining Results for Multiple Cluster Solutions

We now obtain results for the three solutions of interest, in which we group the
data into two, three, and four clusters. In order to see how the observations will be
grouped into clusters in these three different solutions, we must re-run the HCA
following Steps 1–7, but adding a step to save the results.
As you will recall, we select Analyze > Classify > Hierarchical Cluster
Analysis. Make the same selections of variables and cases. See the earlier steps
for the statistics, plots, and methods selections (although all of these will remain
the same for the duration of your SPSS session). In the main Hierarchical Cluster
Analysis menu, click the Save button (see Figure 11.14).
Under Cluster Membership, choose Range of solutions and indicate a range of
2–4 clusters, as indicated in Figure 11.15. Click Continue and then OK to run the
cluster analysis.
The information provided in the output will be the same (or similar) as
before. However, in the data set you’ve been using, you should now see three new
columns called CLU4_1, CLU3_1, and CLU2_1. Each column presents data for
a different cluster solution, providing the cluster membership for each of the cases
in that solution. For example, the column CLU4_1 identifies cluster membership
for the four-cluster solution.
FIGURE 11.16 Data view with 2, 3, and 4 cluster solutions
Notice the case highlighted by the arrow in Figure 11.16. We can see that,
depending on the cluster solution, a particular case may fall into different clus-
ter memberships. Reading left to right, in the four-cluster solution (CLU4_1),
the case was placed into Cluster 4; in the three-cluster solution (CLU3_1), it
was grouped into Cluster 3; and in the two-cluster solution (CLU2_1) it is in
Cluster 1.
Note that SPSS automatically adds a label for each of these three additional
variables, and all three are labeled “Ward Method” (since we used that method for
all three clusters). However, this labeling will be confusing in our output, so we
recommend renaming the labels to reflect the new variable names. We relabeled
our variables “2-cluster solution,” “3-cluster solution,” and “4-cluster solution”
(see Figure 11.17).
Step 10. Comparing the Mean Scores of the Predictor

Variables across Clusters
Now, we can compare the mean scores of the predictor variables used to create
the clusters, as a way of describing the distinctive characteristics of each cluster.
This same step can be repeated for each cluster solution, helping us to better
understand the changes when we shift from one solution to the next. The most
convenient way to achieve these goals is to run a one-way ANOVA for each of
the cluster solutions.
In SPSS, select Analyze > Compare means > One-way ANOVA. To
analyze the linguistic characteristics of the clusters in the two-cluster solution,
choose the four factors for the Dependent List, and choose 2-cluster solution as the
independent categorical variable (the Factor in the ANOVA). This will allow us
to compare the mean scores of the four factor scores for each of the two clusters
(see Figure 11.18).
Under Options, select Descriptives so we can see the mean differences in the
factor scores (the dependent variables) according to the cluster categories (see
Figure 11.19).
Click Continue, then OK.
In Table 11.2, we see that there are significant mean differences for all four of
the factors in the two-cluster solution.
The descriptive statistics and means plots (not shown) also indicate that Fac-
tors 1 and 2 are significantly higher for Cluster 1 than for Cluster 2 while Factors
3 and 4 are significantly higher for Cluster 2 than for Cluster 1. The interpreta-
tion of this trend can be found in Case Study 2, which provides a summary of the
linguistic findings from the study conducted for this analysis.
The same procedure is followed for the three- and four-cluster solutions. The
only change in the procedure is to select the variable 3-cluster solution for the
three-cluster solution and the variable 4-cluster solution for the four-cluster solution.
For the three-cluster solution, we again find that the mean differences in Fac-
tor scores are significantly different, as shown in Table 11.3. The table also shows
that the mean scores are different for each of the three clusters, except for Factor 3,
for which Clusters 1 and 2 show similar scores. The specific details and interpre-
tation of these trends based on the linguistic variables in the particular factors is
discussed in Case Study 2. However, we can see that the three-cluster solution
differentiates the cases further than found in the two-cluster solution.
TABLE 11.2 Means and standard deviations for the two-cluster solution
Factor Cluster N M SD
Factor 1*** 1 685 8.60 7.23

2 262 −5.28 3.81
Factor 2*** 1 685 2.72 4.63
2 262 −5.10 3.11
Factor 3*** 1 685 2.77 3.89
2 262 4.05 2.37
Factor 4*** 1 685 −.13 1.40
2 262 .29 1.70
*** p < .001
TABLE 11.3 Means and standard deviations for the three-cluster solution
Factor 1*** 1 540 5.75 4.76

2 262 −5.28 3.81
3 145 19.24 4.37
Factor 2*** 1 540 1.66 4.35
2 262 −5.10 3.11
3 145 6.70 3.27
Factor 3*** 1 540 3.90 3.50
2 262 4.05 2.37
3 145 −1.43 1.86
Factor 4*** 1 540 −.05 1.48
2 262 .29 1.70
3 145 −.43 .99
*** p < .001.
Similarly, for the four-cluster solution, the mean differences in Factor scores
are significantly different for the four clusters, as shown in Table 11.3. That table
also shows that the mean scores are different for each of the three clusters. The
specific details and interpretation of these trends based on the linguistic variables
within the particular factors is discussed in Case Study 2. However, we can see
that the four-cluster solution differentiates the cases further than found in the
three-cluster solution.
Step 11. Investigating the Composition of Each Cluster

Finally, we can investigate the composition of each cluster, trying to determine
the types of observations that are grouped into each one. It should be noted
TABLE 11.4 Means and standard deviations for the four-cluster solution
Factor 1*** 1 283 6.20 4.93

2 262 −5.28 3.81
3 257 5.24 4.52
4 145 19.24 4.37
Factor 2*** 1 283 5.00 3.04
2 262 −5.10 3.11
3 257 −2.03 1.93
4 145 6.70 3.27
Factor 3*** 1 283 2.47 3.48
2 262 4.05 2.37
3 257 5.48 2.76
4 145 −1.43 1.86
Factor 4*** 1 283 −.91 1.06
2 262 .29 1.70
3 257 .91 1.29
4 145 −.43 .99
*** p < .001.
FIGURE 11.20 Step 11
that this process is not straightforward but relies on the interpretation of the
researcher.To do this, we can use any of the a priori categorical variables available
in the data set (the outside criterion variables), to see how they correspond to the
new categories determined by the cluster analysis. Thus, in the present study, we
can investigate the correspondence between cluster membership and the original
categorical variables of task type and proficiency score level.
We will first look at the relation between cluster membership and task type
(independent or integrated). Using the Crosstabs function, you should select
Analyze > Descriptives > Crosstabs and select 2-cluster solution as the row and
task type as the column.
The resulting output for the two-cluster solution shows us that the indepen-
dent tasks (ind) are fairly evenly split between the two clusters while Cluster 2
contains predominantly integrated tasks (int) (see Figure 11.21).
In the three-cluster solution (Figure 11.22), independent tasks are still divided
up in the same way between the first two clusters. However, the integrated tasks
that had been grouped on the first cluster (in the two-cluster solution) have now
been split, with 145 of those texts now comprising the new third cluster.
The four-cluster solution (see Figure 11.23) mostly affects the composition of
the first cluster in the three-cluster solution. The integrated tasks on that cluster
are now split, so that 229 of those texts comprise the new Cluster 3. In the result-
ing solution, there are now two clusters consisting mostly of two different types of
independent task texts, and two clusters consisting mostly of two different types
of integrated task texts.
2-cluster solution *ttype Crosstabulation

Count
ttype
ind int Total
2-cluster solution 1 233 452 685

2 243 19 262
Total 476 471 947
FIGURE 11.21 Cluster membership by task type for the two-cluster solution

Count
ttype
ind int Total

2 243 19 262
3 0 145 145
Total 476 471 947
FIGURE 11.22 Cluster membership by task type for the three-cluster solution

Count
ttype
ind int Total

2 243 19 162
3 229 28 257
4 0 145 145
Total 476 471 947
FIGURE 11.23 Cluster membership by task type for the four-cluster solution
The same techniques can be used to investigate the relationship between clus-
ter membership and proficiency level (measured in terms of TOEFL iBT score
levels: 1.0–5.0 with .5 increments).
Figure 11.24 shows that there is no clear relationship between score level and
cluster membership for the two-cluster solution. Both the lower scorers (1.0–1.5)
and higher scorers (3.5–5.0) are grouped primarily in Cluster 1 (in fact all score
levels are grouped in Cluster 1 more prominently). The low-middle level scores
show a greater proportion of membership in Cluster 2, but it is not very mean-
ingful. The same (lack of ) pattern can be seen for the three- and four-cluster
solutions, as can be seen in figures 11.25 and 11.26.
In the final analysis, you should keep in mind that HCA is an exploratory
technique. Plotting the fusion coefficients is the first step to determining the
number of clusters that you will select for interpretation. However, the goal of the
analysis is to uncover groups and patterns that had not been previously anticipated
(rather than hypothesis testing). For this reason, it is important to investigate a
range of cluster solutions, choosing the one that is most informative. Two types
of descriptive information are especially useful for this purpose: investigating the
composition of each cluster (i.e., the cases that have been grouped into the clus-
ters), and investigating the mean scores of the predictor variables for each cluster
(in this example study, the means of the linguistic dimension scores).
In some cases, the composition of two clusters might appear to be very similar
with respect to external criteria, but the cluster analysis shows that they are dis-
tinct groups in terms of the predictor variables. For example, in the four-cluster
solution examined earlier, Cluster 2 and Cluster 3 are fairly similar in their com-
position based on outside criterion variables—see Figure 11.23 and Figure 11.26.
However, it turns out that these two clusters are distinct in terms of their perfor-
mance on particular predictor variables: Cluster 2 had quite low scores on Factor 1
while Cluster 3 had more moderate scores (see Table 11.3). Thus, the four-cluster
solution might be the most informative one for our exploratory purposes, even
2-cluster solution *score1 Crosstabulation

Count
score1
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Total
2-cluster solution 1 38 27 58 70 106 111 98 88 89 685
2 2 7 29 36 83 38 30 27 10 262
Total 40 34 87 106 189 149 128 115 99 947
FIGURE 11.24 Cluster membership by score level for the two-cluster solution

Count
score1
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Total
2 2 7 29 36 83 38 30 27 10 262
3 6 4 13 12 16 17 27 21 29 145
Total 40 34 87 106 189 149 128 115 99 947
FIGURE 11.25 Cluster membership by score level for the three-cluster solution

Count
score1
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Total
2 2 7 29 36 83 38 30 27 10 262
3 3 1 13 27 43 49 43 41 37 257
4 6 4 13 12 16 17 27 21 29 145
Total 40 34 87 106 189 149 128 115 99 947
FIGURE 11.26 Cluster membership by score level for the four-cluster solution
though analysis of the fusion coefficients points to the three-cluster solution,

and the composition of Clusters 2 and 3 appear similar in terms of the outside
criterion variables.
At the same time, it is important to remember that HCA is hierarchical,
meaning that you should always consider the ways in which clusters combine at
higher levels. Thus, in our example study, it is important to analyze the specific
differences between the three-cluster and four-cluster solutions (compare Tables

11.3 and 11.4 and Figures 11.22 and 11.23):
• Cluster 2 in the three-cluster solution corresponds exactly to Cluster 2 in the

four-cluster solution.
• Cluster 3 in the three-cluster solution corresponds exactly to Cluster 4 in the
four-cluster solution
• Cluster 1 in the three-cluster solution is split into two clusters in the
four-cluster solution (Clusters 1 and 3)
This comparison gives us a new perspective on the similarities and differences

among clusters. In sum, there is no single correct solution resulting from cluster
analysis. Rather, the goals are to explore new relationships in your data, identify-
ing new categories and new ways of grouping cases that you had not anticipated
ahead of time.This means that the exploration will require multiple examinations
of the data in new ways. For example, in Case Study 2, we note that for future
research, we would want to explore the clusters in more depth, since the two
outside categorical variables, “task type” and “score level,” are known to interact
(see Biber & Gray, 2013).
SAMPLE STUDY 1
Yamamori, K., Isoda, T., Hiromori, T., & Oxford, R. (2003). Using cluster analysis to
uncover L2 learner differences in strategy use, will to learn, and achievement over
time. International Review of Applied Linguistics, 41, 381–409.
Background
Yamamori et al. (2003) investigate the strategies and motivational profiles of
groups of learners in relation to their language achievement. As they indicate
in their study, the motivation to investigate achievement in this way stems
from previous research suggesting “there can be more than one route to
success in L2 learning” (p. 382).
Method
The data in this study consisted of survey and achievement test scores from
81 Japanese beginning learners of English as a foreign language, all in seventh
grade. A total of nine measures were used in a cluster analysis to group par-
ticipants, consisting of three measures collected at the end of three consecu-
tive terms. The three measures included (1) a strategy inventory consisting
of five Likert scale survey items (e.g., “I use the dictionary”); (2) a measure of
the will to learn captured by four Likert scale survey items (e.g., “I want to be
good at English”); and (3) end-of-term achievement test scores.
Results and Conclusions

Four clusters of learners were identified: an “overall-developing group,” a
“selective-use group,” a “low-awareness group,” and an “unmotivated
group.” Based on performance on achievement tests at each time period, the
first two clusters of learners (overall-developing and selective-use) were fur-
ther identified as high achieving and the second two clusters (low-awareness
and unmotivated) as low achieving, since the two high-achieving groups
consistently outperformed the two low-achieving groups.
The results show different strategy use by the two high-achieving groups,
indicating that learners may have different pathways to success. One group
(overall-developing) used many strategies at a consistently moderate level
and the other group (selective-use) used three strategies with greater fre-
quency. The authors discuss the possibility that this differential strategy use
reflects different learning styles. Another important finding was that the
unmotivated group, whose will to learn English declined over the three
terms, also used strategies less frequently as the terms progressed. This sug-
gests that there may be a relationship between will to learn and strategy
use. In addition, the low-awareness group was the only one to report high
agreement with the statement “I am not sure what I do for English learning”
as part of the strategy inventory.
Based on their findings, the authors recommend different types of strat-
egy instruction for the different clusters of learners.
SAMPLE STUDY 2
Examining linguistic profiles of L2 writing based on task type and proficiency
level
Background
Previous research on L2 writing has examined the relationship of profi-
ciency level and task type with linguistic characteristics used by L2 writers
using ANOVA and mixed factorial models, among other statistical analyses
(e.g., Biber & Gray, 2013; Cumming et al., 2006; Way, Joiner, & Seaman,
2000). While relationships have been shown between proficiency level and
linguistic features as well as task type and linguistic features, a great deal
of variation in the use of linguistic features within proficiency level in par-
ticular has been noted (e.g., Biber & Gray, 2013). In addition, Jarvis et al.
(2003) show that high proficiency learners may have multiple linguistic
profiles. Thus, cluster analysis was identified as a useful approach to explore
variability in the use of linguistic features in relation to proficiency level and
task type.
Method
This study uses the same data described in the “Procedures for Conducting
Cluster Analysis.” We examined data from 947 responses to writing prompts
on the TOEFL iBT. This data had previously been analyzed for relationships
between linguistic features, task type (independent vs. integrated), and
score level on the iBT (see Biber & Gray, 2013; Biber, Gray, & Staples, 2014).
Although relationships were found between both task type and proficiency
level and linguistic features, it was also revealed that there was variation in
the use of linguistic features across these two domains. Because linguistic
features are known to co-occur and correlate statistically with each other, a
wide range of linguistic features (e.g., personal pronouns, dependent clause
types) were first subjected to a factor analysis to identify underlying dimen-
sions of language use (see Biber & Gray, 2013 for a description of the linguis-
tic features included in the factor analysis).
Four dimensions of language use were identified from the factor analy-
sis: (1) literate versus oral responses (e.g., higher use of nouns vs. higher
use of verbs); (2) information source: text vs. personal experience (e.g.,
third-person pronouns vs. first- and second-person pronouns); (3) abstract
opinion vs. concrete description/summary (e.g., nominalizations vs. con-
crete nouns); (4) Personal narration (e.g., higher use of past-tense verbs).
The standardized dimension scores for each of these four dimensions were
used to cluster the texts.
Results and Conclusions

As described in “Procedures for Conducting Cluster Analysis,” three separate
cluster solutions were examined: a two-, a three-, and a four-cluster solution.
We will discuss the three- and four-cluster solutions here in relation to the
linguistic characteristics (the four dimensions) and outside criterion variables
(“task type” and “proficiency level”). First, in the three-cluster solution, the
means of Dimensions 1 and 2 were highest for Cluster 3, next highest for
Cluster 1, and lowest for Cluster 2. The means for Dimension 4 followed the
opposite trend, with Cluster 2 having the highest scores on Dimension 4,
Cluster 3 having the lowest scores, and Cluster 1 in the middle. The means of
Dimension 3 were similarly high for Cluster 1 and 2, but dramatically lower
for Cluster 3. In the three-cluster solution, the independent tasks were found
exclusively in Clusters 1 and 2. Thus, independent tasks used low to moder-
ate levels of literate features and references to texts, high levels of abstract
opinion, and high to moderate levels of personal narration. Integrated tasks
were found primarily in Clusters 1 and 3, indicating that integrated texts
included moderate to high levels of literate features and reference to text
along with moderate to low levels of personal narration. Some of the inte-
grated texts referred to concrete description, but some relied heavily on
abstract opinion. Thus, we can see that a three-cluster solution helps to

define the profiles of the two task types. However, there was not a clear
relationship between proficiency level and cluster membership, with most of
the texts grouping into Cluster 1, regardless of score level.
The four-cluster solution adds further complexity to the linguistic profiles
of the clusters. The means of Dimension 1 were similar for Clusters 1 and 3
(mid-range), with Cluster 4 having much higher scores and Cluster 2 much
lower. For Dimension 2, Clusters 1 and 4 were similarly high, with Clusters
2 and 3 much lower. For Factor 3, Cluster 3 has the highest score, with
Clusters 2 and 1 lower but not nearly as low as Cluster 4. Finally, on Fac-
tor 4, the four groups perform fairly differently, with Cluster 1 low, Cluster
4 moderately low, Cluster 2 moderately high, and Cluster 3 high. Relating
these scores to task type, we see that independent texts are found almost
exclusively, with equal prevalence, in Clusters 2 and 3. These two clusters
look remarkably similar to clusters 1 and 2 in the three-cluster solution. Thus,
independent tasks are characterized by low to moderate levels of literate
features and low reference to texts (more personal experience), high levels
of abstract opinion, and high to moderately high levels of personal narra-
tion. Integrated tasks, however, are now mostly found in Clusters 1 and 4.
We thus now see that the cluster membership is even further distinguished
by task type. Integrated tasks are characterized by moderate to high use of
literate features, similarly high use of textual reference, moderately low to
low use of abstract opinion (so higher use of concrete summary), and low to
moderately low use of personal narration. As with the three-cluster solution,
it is difficult to see a relationship between proficiency level and cluster mem-
bership, although lower level texts tend to appear primarily in Cluster 1 and
scores at higher levels are spread more throughout the clusters.
Given the strong relationship between task type and linguistic profile, in
future research it will be important to investigate the interaction between
task type and proficiency level. It may be that if task type is taken into
account, proficiency level will exhibit a clearer pattern in relation to cluster
membership, and thus in the use of linguistic features.
Sample Studies
Tools and Resources
Crawley, M. J. (2007). Tree models. In The R book. Chichester, UK: John Wiley &
Sons, Ltd.
Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster analysis (4th ed.). Chicester, UK: John
Wiley & Sons, Ltd.
Gries, S. Th. (2006). Exploring variability within and between corpora: Some methodo-
logical considerations. Corpora,1(2), 109–151.
Hair, J. F. & Black, W. C. (2000). Cluster analysis. In L.G. Grimm & P.R.Yarnold, Reading
and understanding more multivariate statistics. Washington, DC: American Psychological
Association.
Johnson, R. A. & Wichern, D. W. (2007). Applied multivariate statistical analysis (6th ed.).
Chapter 12: Clustering, distance methods, and ordination. Upper Saddle River, NJ:
Pearson Education.
Kaufman, L. & Rousseeuw, P. J. (1990). Finding groups in data: An introduction to cluster
analysis. New York: John Wiley & Sons.
Lorr, M. (1983). Cluster analysis for the social sciences. San Francisco, CA: Jossey-Bass.
Further Reading
Gayle, G. (1984). Effective second-language teaching styles. The Canadian Modern Lan-
guage Review, 40(4), 525–541.
Hayes, E. (1989). Hispanic adults and ESL programs: Barriers to participation. TESOL
Quarterly, 23(1), 47–63.
Hill, D. (1992). Cluster analysis and the interlanguage lexicon. Edinburgh Working Papers in
Applied Linguistics, 3, 67–77.
Huang, H. T. (2010). How does second language vocabulary grow over time? A multi-methodological
study of incremental vocabulary knowledge development. Unpublished dissertation. Univer-
sity of Hawai’i, Manoa, HI.
Kang. O., Rubin, D. & Pickering, L. (2010). Suprasegmental measures of accentedness
and judgments of language learner proficiency in oral English. The Modern Language
Journal, 94(4), 554–566.
Lee, J. (2012). The implications of choosing a type of quantitative analysis in interlan-
guage research. Linguistic Research, 29(1), 157–172.
Philp, J. (2009). Pathways to proficiency: Learning experiences and attainment in implicit
and explicit knowledge of English as a Second Language. In R. Ellis, S. Loewen, C.
Elder, R. Erlam, J. Philp, & H. Reinders, Implicit and explicit knowledge in second language
learning, testing, and teaching (pp. 194–215). Tonawanda, NY: Multilingual Matters.
Ranta, L. (2002). The role of learners’ language analytic ability in the communica-
tive classroom. In P. Robinson, Individual differences and instructed language learning
(pp. 159–180). Philadelphia: John Benjamins.
Rysiewicz, J. (2008). Cognitive profiles of (un)successful FL learners: A cluster analytical
study. The Modern Language Journal, 92(1), 87–99.
Shochi, T., Rillard, A., Auberge,V., & Erickson, D. (2009). Intercultural perception of
English, French, and Japanese social affective prosody. In S. Hancil, The role of prosody
in affective speech (pp. 31–60). New York: Peter Lang.
Uchikoshi, U., & Marinova-Todd, S. (2012). Language proficiency and early literacy skills
of Cantonese-speaking English language learners in the U.S. and Canada. Reading and
Writing: An Interdisciplinary Journal, 25, 2107–2129.
Yashima, T. & Zenuk-Nishide, L. (2008). The impact of learning contexts on proficiency,
attitudes, and L2 communication: Creating an imagined international community.
System, 36, 566–585.
1. We have emphasized the importance of researcher expertise in making sense
of cluster analytic output and results. Is cluster analysis unique in this regard?
Why? Why not?
2. This chapter has shown that the process of carrying out a cluster analysis
often involves the use of other statistical analyses. Explain in your own words
how the following analyses might be used in conjunction with cluster analy-
sis: ANOVA, data transformation, cross-tabs (or chi-square), factor analysis,
discriminant function analysis, correlation. Now examine a few of the cluster
analytic studies listed under Further Reading. Which analyses did they use
along with their cluster analysis? To what ends?
3. Other than the example studies described in this chapter, what types of
research questions or situations can you think of in which cluster analysis
might be a useful approach?
4. Cluster analysis is often contrasted with both factor analysis (see Loewen &
Gonulal, Chapter 9 in this volume) and discriminant function analysis (see
Norris, Chapter 13 in this volume). In what ways are these two procedures
similar to cluster analysis? In what ways are they different?
5. The authors of Sample Study 1 recommend different types of strategy
instruction based on the four distinct learner profiles or clusters obtained
through their analysis. What kinds of interventions do you think might be
most effective with each group? Can you think of other cases where the
results of a cluster analysis could inform L2 pedagogy, assessment, or policy?
Notes
1. Another approach involves combining HCA and K-means clustering. First, HCA is
used on a smaller sample of the data, to determine the optimal number of clusters, and
then the researcher runs a K-means analysis on the full data set, specifying that number
of clusters.
2. Cluster analysis can also be used to group together variables instead of cases; examples
of this type include Kang, Rubin, & Pickering (2010) and Lee (2012).
3. In SAS, it is also possible to produce goodness of fit measures, which can be used in the
process of deciding on a “stopping” point for cluster solutions.
References
Aldenderfer, M. S. & Blashfield, R. K. (1984). Cluster analysis. Thousand Oaks, CA: Sage.
Biber, D. (1989). A typology of English texts. Linguistics, 27, 3–43.
Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Chapter 9: Reg-
isters and text types in English and Somali. Cambridge: Cambridge University Press.
Biber, D. (2008). Corpus-based analyses of discourse: Dimensions of variation in con-
versation. In V. K. Bhatia, J. Flowerdew, & R.H. Jones. Advances in discourse studies
(pp. 100–114). New York: Routledge.
Biber, D., & Finegan, E. (1989). Styles of stance in English: Lexical and grammatical mark-
ing of evidentiality and affect. Text, 9, 93–124.
Biber, D., & Gray, B. (2013). Discourse characteristics of writing and speaking responses on the
TOEFL iBT. Princeton, NJ: Educational Testing Service.
Biber, D., Gray, B., & Staples, S. (2014, advanced access). Predicting patterns of grammatical
complexity across textual task types and proficiency levels. Applied Linguistics.
Csizer, K., & Dörnyei, Z. (2005). Language learners’ motivational profiles and their moti-
vated learning behavior. Language Learning, 55(4), 613–659.
Csomay, E. (2002).Variation in academic lectures: Interactivity and level of instruction. In

R. Reppen, S. Fitzmaurice, & D. Biber (Eds.), Using corpora to explore linguistic variation
(pp. 203–224). Philadelphia: John Benjamins.
Cumming, A., Kantor, R., Baba, K., Erdosy, U., Eouanzoui, K., & James, M. (2006). Analysis
of Discourse Features and Verification of Scoring Levels for Independent and Integrated Tasks for
the new TOEFL. (TOEFL Monograph No. MS-30 RM 05–13) Princeton, NJ: Educa-
tional Testing Service.
Eckes, T. (2012). Operational rater types in writing assessment: Linking rater cognition to
rater behavior. Language Assessment Quarterly, 9, 270–292.
Friginal, E., Lin, M., & Weigle, S. (2014). Revisiting multiple profiles of learner composi-
tions: A comparison of highly rated NS and NNS essays. Journal of Second Language
Writing, 23, 1–16.
Götz, S. (2013). Fluency in native and nonnative English speech. Philadelphia: John Benjamins.
Gries, S.T., Newman, J., & Shaul, C. (2011). N-grams and the clustering of registers. Empir-
ical Language Research Journal, 5(1). Last accessed September 4, 2014 at http://ejournals.
org.uk/ELR/article/2011/1
Gries, S. T., & Stoll, S. (2009). Finding developmental groups in acquisition data:
Variability-based neighbor clustering. Journal of Quantitative Linguistics, 16(3), 217–242.
Jarvis, S., Grant, L., Bikowski, D., & Ferris, D. (2003). Exploring multiple profiles of highly
rated learner composition. Journal of Second Language Writing, 12, 377–403.
Kang, O., Rubin, D., Pickering, L. (2010). Suprasegmental measures of accentedness and
judgments of English language learner proficiency in oral English. Modern Language
Journal, 94, 554–566.
Kojic-Sabo, I., & Lightbown, P. (1999). Students’ approaches to vocabulary learning and
their relationship to success. Modern Language Journal, 83(2), 176–192.
Lee, J. (2012). The implications of choosing a type of quantitative analysis in interlanguage
research. Linguistic Research, 29, 157–172.
Leung, G., & Uchikoshi, Y. (2012). Relationships among language ideologies, family lan-
guage policies, and children’s language achievement: A look at Cantonese-English
bilinguals in the U.S. Bilingual Research Journal: The Journal of the National Association for
Bilingual Education, 35(3), 294–313.
Norušis, M. (2011). IBM SPSS Statistics 19 Guide to Data Analysis. Upper Saddle River,
NJ: Prentice Hall.
Skehan, P. (1986). Cluster analysis and the identification of learner types. In V. Cook (Ed.),
Experimental approaches to second language acquisition (pp. 81–94). Oxford: Pergamon Press.
Sparks, R., Patton, J., & Ganschow, L. (2012). Profiles of more and less successful L2 learn-
ers: A cluster analysis study. Learning and Individual Differences, 22, 463–472.
Tsuda, A., & Nakata,Y. (2013). Exploring self-regulation in language learning: A study of Jap-
anese high school EFL students. Innovation in Language Learning and Teaching, 7(1), 72–88.
Uchikoshi, U., & Marinova-Todd, S. (2012). Language proficiency and early literacy skills
of Cantonese-speaking English language learners in the U.S. and Canada. Reading and
Writing: An Interdisciplinary Journal, 25, 2107–2129.
Way, D. P., Joiner, E. G., & Seaman, M. A. (2000). Writing in the secondary foreign language
classroom: The effects of prompts and tasks on novice learners of French. Modern Lan-
guage Journal, 84: 171–184.
Yamamori, K., Isoda, T., Hiromori, T., & Oxford, R. (2003). Using cluster analysis to
uncover L2 learner differences in strategy use, will to learn, and achievement over time.
International Review of Applied Linguistics, 41, 381–409.
12
RASCH ANALYSIS
Ute Knoch and Tim McNamara
Introduction
The use of Rasch measurement in second language (L2) research has grown sig-
nificantly in the past decade, in particular in the area of language testing (see, e.g.,
McNamara & Knoch, 2012). The current chapter introduces the basic concepts
of Rasch analysis. It will start by providing the conceptual motivation for using
techniques from the Rasch family of models and then provide a guide on how to
use four different Rasch models: the simple Rasch model, the rating scale model,
the partial credit model and the many-facet Rasch model. Readers will learn how
to choose the most appropriate model and how to interpret key output tables
from Rasch analyses. At the end of the chapter, we will describe some of the tools
and resources available as well as further readings on the topic of Rasch analysis.
Background
Why Rasch?
The Rasch family of models, a subset of a larger group of models known as item
response theory (IRT), is becoming more popular as a way of analyzing data col-
lected in L2 research. Rasch analysis found its way into L2 research through its
gradual adoption by language testers (McNamara & Knoch 2012) and has since
spread into other areas of the field. For language testers in particular, this approach
to measurement has provided a powerful new way of generalizing from a person’s
performance on a test to statements of underlying ability.
There are several reasons why Rasch analysis is appealing to researchers
involved in L2 research. For language testers, for example, Rasch provides a more
powerful way of analyzing test data than can be achieved by using more traditional
276 Ute Knoch and Tim McNamara
techniques such as the ones provided by classical test theory (CTT) (see e.g.
Eckes, 2011; Wright, 1992). Both set of techniques (CTT and IRT) are used to
analyze test data to gain a thorough understanding of the performance of test
items, the ability of test takers, and the performance of the measurement instru-
ment as a whole.The data used to analyze the test or instrument commonly come
from a specific population of learners from a certain context. In the case of a CTT
analysis, results might differ if the test or instrument is administered to a different
group of learners and therefore need to be interpreted differently. However, IRT
and Rasch analyses take this sample dependency into account.The models enable
an estimate of test takers’ underlying ability on the basis of their performance on
a particular set of items by making allowance for the difficulty of items and how
well they are matched with the candidates’ ability. The crucial element here is
how items are related to candidate ability, which is not the case in CTT. This dif-
ference between CTT and Rasch (and all IRT models) has been compared to the
difference between descriptive and inferential statistics (e.g., McNamara, 1996) as
the results from a Rasch analysis can be generalized beyond the sample.
Another benefit lies in the fact that the Rasch model can be applied to a wide
variety of data types. While the simple Rasch model could only be used to analyze
dichotomously scored items (e.g., multiple-choice items), extensions of this model
developed in the late 70s and early 80s could also handle data from polytomous items,
semantic differential scales, rating scales, as well as data scored by human raters (e.g.,
in the assessment of speaking and writing). When writing or speaking assessment
data is analyzed using the Rasch model, the system can provide powerful estimates
of rater quality, which has been very helpful for language assessors, in particular since
the increased interest in collecting performance data following the communicative
movement in the early 80s. A further data type that lends itself to a Rasch analy-
sis is that of questionnaires that are usually analyzed using more traditional meth-
ods (including reporting descriptive statistics and making use of factor analyses, see
Loewen & Gonulal, Chapter 9 in this volume). As we will see in this chapter, Rasch
analysis offers a powerful new way of analyzing such data. Further, for L2 researchers
interested in learner progress or development, Rasch analysis also makes it possible
to define the ability of learners on a single ability scale that links all tasks and learn-
ers. In this way, progress can be shown and different preexisting scales can be linked.
In sum, Rasch analysis offers a powerful, comprehensive way to analyze a
variety of data types and can be used to answer a variety of questions posed in L2
research. Rasch analysis is also rather forgiving in its data requirements and can
handle missing data relatively well, which is a major advantage in the real world
of research.
The Rasch Family of Models

All Rasch models are probabilistic; that is, they estimate the chances of success
of particular persons on particular items and report their findings in terms of
Rasch Analysis 277
probabilities or the chances of success. These probabilities are expressed using a

relatively unfamiliar index, the logit, a simple logarithmic expression of the odds
of success. Person abilities and item difficulties are thus measured on a scale of
probabilities expressed in logits (the logit scale). The underlying model assumes
that the chance of test takers succeeding on an item in a measurement instrument
is, quite intuitively, a function of the test taker’s ability as well as the difficulty of
the item in question. In other words, the likelihood of a given test taker answering
a given item correctly depends on both item difficulty and the learner’s ability.
There are a number of models that make up the Rasch family of models.
These correspond to the steps of evolution of the models and can each deal with
increasingly more complex data.
The Simple Rasch Model

The simple Rasch model was first developed by Georg Rasch to model responses
on a reading test (Rasch, 1960/1980). The simple Rasch model handles dichoto-
mous data (i.e., data where every item can be scored as either right or wrong).
Examples of language test data that yield dichotomous data are discrete-point
tests such as listening or reading tests, multiple-choice tests, and other restricted
response formats such as cloze tests.
The Rating Scale Model

To deal with items that involve more steps than just right or wrong, and include
data from rating scales, Andrich (1978a, 1978b) developed the rating scale model.
His model makes it possible to show that the step structure between points (steps)
on the scale might differ. For example, as with many ordinal scales, the distance
between scoring a 2 and a 3 on a scale out of 5 might be smaller than the distance
between a 3 and a 4. Being able to analyze data with more than one scale cat-
egory was a major advance, but the rating scale model has the disadvantage that
it assumes that all items in a measurement instrument are functioning the same
(i.e., have the same scale structure), which is of course usually not the case. It also
assumes that all judges (if applicable to the type of data) interpret the rating scale
in the same way.
The Partial Credit Model

To allow for variability in the step structure across test items, a further model,
the partial credit model, was developed by Geoff Masters and his supervisor Ben
Wright (Wright & Masters, 1982). This model makes it possible to estimate sepa-
rate step difficulties for each item on a test or for each aspect of a performance
(e.g., in a writing or speaking test). Just like Andrich’s rating scale model, the par-
tial credit model can handle partial credit data as well as rating scale data (e.g. from
Likert scale questionnaires). As is the case with the rating scale model, the partial
credit model assumes that all raters are applying the scale in the same way. This
problem was addressed in a further development, the many-facet Rasch model.
The Many-Facet Rasch Model

The many-facet Rasch model developed by Mike Linacre (Linacre, 1989) adds
further powerful aspects to the analysis for researchers in the field of L2 research.
It makes it possible to model multiple aspects (or facets) of the rating situation.
The most commonly modeled aspect is that of the rater. This makes it possible
to examine characteristics such as rater severity, consistency, and how raters apply
the rating scale as a whole or different aspects of the rating scale. Researchers can
include any facets of the assessment situation of interest. For example, it is pos-
sible to examine whether and to what extent the test location, the experience
of raters, or the first language of test takers has any impact on the measurement
outcomes.
It is important to note that the many-facet Rasch model is the most general of
the models discussed in this chapter.The other three models can be seen as subsets
of this model. If more than one model is available for an analysis (as is the case
in the application of rating scale data without judge mediation for example), the
researcher can analyze the data with more than one possible model and compare
the results of the analyses to see if they differ. If the results are the same, the most
basic and parsimonious model should be used.
Table 12.1 summarizes the different data types and models and introduces two
commonly used computer programs used to analyze such data.
TABLE 12.1 Data type, response formats, Rasch models, and programs
Data type Response format/scoring Possible Rasch model Program

procedure
Dichotomous Multiple choice, true/ Simple Rasch model e.g. Winsteps

false, short-answer
questions
Polytomous (without Short-answer Rating scale model e.g. Winsteps
or ignoring judge questions (with partial credit model
mediation) partial credit
scoring), rating
scale, Likert scale,
semantic differential
scale
Polytomous (taking Rating scale model, Many-facet Rasch e.g. Facets
judges into Likert scale, model
account) semantic differential
scale
Rasch Analysis 279
The data design for a Rasch analysis offers more flexibility than designs
required for a CTT analysis: Crossed, nested, and mixed designs as well as missing
data can be accommodated (see, e.g., Schumacker, 1999 for a detailed discussion).
A further useful discussion of the data requirements for a Rasch analysis in terms
of sample size can be found in Linacre (1994).
How to Conduct a Rasch Analysis

There are several steps involved in conducting a Rasch analysis. As the discus-
sion of the different Rasch models as well as the summary in Table 12.1 shows,
the first step for any researcher is to critically evaluate the research questions and
to choose the appropriate research instruments accordingly. Then, the researcher
needs to decide on the most appropriate Rasch model for the analysis required as
well as the most suitable program.
For the purpose of this chapter, we will focus on the two Rasch programs
described in Table 12.1, Winsteps and Facets. There are several reasons for this.
First, they are easily accessible and fairly inexpensive to use. They also have the
advantage over other programs that they are compatible with other software such
as Microsoft Excel or SPSS, which simplifies data entry. There is also fairly good
support available for both programs, both through their user manuals as well as
through online forums. In the “Tools and Resources” we will take some time to
describe other software packages available.
In the following sections of this chapter, we will examine how to create input
files for the different analyses and then scrutinize the most commonly used out-
put tables of a Rasch analysis, drawing out features of the different Rasch models.
Creating Input Files for a Rasch Analysis

Before conducting a Rasch analysis, you will need to create a control file. For
both Winsteps and Facets, data can be read into the program using applications
such as Excel or SPSS.We recommend doing this.To be able to convert data from,
for example, an Excel file into a Winsteps or Facets control file, it is important to
set up the data in the correct way. The data need to be organized following the
format in Table 12.2. Each learner needs to be entered into a separate row and
responses to each test item or questionnaire question into a separate column.
TABLE 12.2 Data input format for analyses not involving multiple raters
Student ID Item 1 Item 2 Item 3 Item 4

1 1 0 1 1
2 0 1 1 1
3 1 1 1 1
4 0 0 0 1
If each learner’s performance has been rated by more than one rater, the data
should be set out as in the example in Table 12.3. Here, the ratings for each rater
are listed in separate rows.
TABLE 12.3 Data input format for analyses involving multiple raters
Student ID Rater Accuracy Content Organization

1 1 5 6 5
1 2 6 5 5
2 1 2 2 2
2 2 2 2 3
Creating Input Files for a Winsteps Analysis

Here we detail the exact steps on how to conduct a Winsteps analysis. Further
details can be found in the Winsteps Help manuals.
Opening Winsteps
1. Select the Winsteps icon on your desktop.
2. Close the smaller Winsteps Welcome window.
Creating a New Winsteps Input File from Excel

1. Select Excel/RSSST from the task bar at the top
A new window will open:

Rasch Analysis 281
2. Click the green Excel button.

A new window will open:
3. Click on Select Excel file.

4. Locate and double-click the Excel file you want to read into Winsteps.
5. The Window will now look something like this:

You will see three red headings:
a. Item Response Variables
b. Person Label Variables
c. Other Variables
Underneath, you can see the headings you have on your Excel sheet. In this
case, there is a heading called “student” (this is a person label ) and many headings
with our item/question numbers (these are the item labels).
Copy the person labels under the red heading “Person Label Variables” and
copy the item variables under the heading “Item Response Variables.”Your win-
dow should now look something like this:
Rasch Analysis 283
7. Click on “Construct Winsteps file.”
8. Check the Winsteps file for any obvious mistakes.

9. You can now either save the file and run the analysis at a later time, or you
can select Launch Winsteps and run the analysis immediately.
Opening an Existing Winsteps File

1. Click the Winsteps icon on your desktop.
2. Close the smaller Winsteps Welcome window.
3. Select File and then Open File.
4. Locate and double-click your Winsteps input file (usually a. txt file)
5. Select the Enter key on the keyboard twice.
Creating Input Files for a Facets Analysis

Facets requires an input file to be created by the user and then read into the
program. Facets input files are comprised of two elements, the specifications that
tell Facets what to do with the data, and the data itself. A sample control file for
a basic analysis with candidates, raters, and three criteria (using an analytic rating
scale with six scale steps) is shown next. We have only indicated the first two and
the final data line for space purposes.
Once a data file has been created, this can be read into Facets. Click the Facets
icon on the computer to start the program. Then, select Files > Specification
File Name? and choose the Facets input file from the location it was saved.Then
click Open and OK.
Interpreting the Output from a Rasch Analysis

In the following section, we will examine some of the most common output
tables and figures to illustrate the types of analyses that can be conducted using
the Rasch models. We will start by introducing the most basic concepts of a
Rasch analysis by looking at a simple data set containing only two facets: items
and test takers. Then we will look at what additional analyses can be done using
the rating scale model and the partial credit model. Finally, we will present the
output from the many-facet Rasch analysis.
Rasch Analysis 285
Item/Person Map (Wright Map)

One of the most helpful output elements of a Rasch analysis is the Wright map
(or item/person map). A sample Wright map of a 10-question reading test can
be found in Figure 12.1. This map provides a visual representation that matches
up the ability of the test takers (listed as names) with the difficulty of the items
(listed as numbers in the right-hand column).This mapping of item difficulty and
person ability on the same scale is one of the most useful properties of a Rasch
analysis. The linear, equal-interval logit scale, shown to the far left of the map, is a
scale upon which the other facets in the analysis (in this case test takers and items)
are positioned. It creates a single frame of reference for comparisons between the
different facets and is one of the most important aspects of a Rasch analysis.
TABLE 1.0 Book1.xlsx

INPUT: 19 PERSON 10 ITEM REPORTED: 19 PERSON 10 ITEM 2 CATS WIN-
STEPS 3.74.0
----------------------------------------------------------------------------------------------------
MEASURE PERSON - MAP - ITEM
<more>|<rare>
3 Bonny Catherine +
|
|
| 9
Angela Penny T |
|
|
| T
|
|
|
|
2 +
|
|
|
|
Nami |
S |
|
|
| S
|
|
1 + 8
|
Janet Mark Naoki |
|
| 3
|
|
M |
|
|
Carla Doreen John Marco Sebastian |
|
0 + M6
|
|
|
Linda Sammy | 5
|
|
|
| 1
S |
|
Karl Susie |
-1 + 2 7
|
|
| S
|
| 4
|
Bruce |
|
|
|
T |
-2 Tim + 10
<less>|<frequent>
Note: M=mean; S=one standard deviation from mean; T= Two standard deviations from mean
FIGURE 12.1 Sample person/item (Wright) map

Rasch Analysis 287
The items are ordered from the most difficult item at the top (Item 9) to the
easiest at the bottom (Item 10). Candidates with more reading ability are located
near the top of the figure while less able test takers are shown near the bottom.
As the test takers and the reading test items are pictured on the same scale, the
logit scale, it is now possible to make direct comparisons. A test taker placed at
the same logit value as an item has a 50% chance of answering that item cor-
rectly. Test takers mapped higher than an item have a higher than 50% chance of
answering the item correctly. Those mapped lower have less chance of answering
the item correctly (see Wright and Linacre, 1991 for an exact logit to probability
conversion table). The logit scale has the further advantage that it is an interval
scale. Therefore, not only does it tell us that one item is harder than another or
that one candidate is more able than another, but it also gives us a measurement
of how much that difference is.
Apart from descriptive observations about our measurement instrument
(including which students are the most and least able, and which items are the
most and least difficult), the Wright map can provide us with information about
(1) item coverage (i.e., whether there are sufficient items to match the different
ability levels of our students), (2) each individual student’s probability of success
on certain items, and (3) whether the overall difficulty of the items matches the
ability of the test takers and vice versa.
As we will see in Sample Study 1 (Malabonga, Kenyon, Carlo, August, & Lou-
guit, 2008), the authors used the Wright map to guide their evaluation of item
coverage across difficulty levels. Following the pilot administration of the cognate
awareness measure (CAT), they added a group of easier items to more adequately
match the students’ ability.
Person Statistics
The output from a Rasch analysis also provides us with estimates of person ability
and person fit as exemplified for our reading data set in Table 12.3.
The table lists all the test takers in order of ability (Catherine is the most able
and Tim the least able student). It also provides us with their raw score (total
score), the number of items they attempted (total count), their position on the
logit scale (measure), and the standard error associated with this measure (Model
S.E.).The standard error for our data set is large because it is based on a very small
sample (for both items and test takers).
A further feature of a Rasch analysis (which cannot be found in the output of
a CTT analysis) is fit statistics. Rasch analysis is based on a probabilistic model.
It proceeds by comparing expected and observed responses of test takers. Once
complete, the best estimates of person ability (as can be seen in Table 12.3) and
item difficulty (Table 12.4) are displayed. The extent to which the prediction and
observation match is shown in the fit statistics. For both test takers and items,
three types of fit can be found: (1) appropriate fit (the pattern identified by the
program is within a normal range, i.e., as expected), (2) misfit (the pattern does
not correspond to the expected pattern in that it is less predictable), and (3) over-
fit (the patterns found by the program are too predictable).
Appropriate fit values (expressed in Table 12.3 as MNSQ [mean-square])
generally range from 0.8 to 1.3 (McNamara, 1996). (These values can also be
expressed in terms of the normal distribution as z-statistics, where the acceptable
range is +2 to –2). Fit can be calculated in two ways: using all the data, including
outliers (Outfit); or using trimmed data, with the outliers removed (Infit). Infit is
usually preferred. Person fit provides us with the ability to examine whether the
ability of a learner can be defined in the same terms as the ability of others in
the group. If a person is identified as misfitting, it means that his or her ability
has not been captured well by our instrument. For an accessible description of
the differences between the different measures of fit, please refer to McNamara
(1996) or Green (2013); Eckes (2011) provides a discussion of fit in a many-facet
Rasch analysis.
TABLE 12.4 Sample person measurement report (shortened)
Entry Total Total Measure Model Infit Outfit Person

Number Score Count
Mnsq Zstd Mnsq Zstd
5 10 10 4.18 1.92 MAXIM MEASURE Catherine

7 10 10 4.18 1.92 MAXIM MEASURE Bonny
4 9 10 2.66 1.20 1.84 1.1 4.54 1.9 Angela
16 9 10 2.66 1.20 1.77 1.1 2.33 1.2 Penny
2 8 10 1.56 .94 .93 .1 .92 .2 Nami
8 7 10 .81 .82 .51 –1.3 .40 –1.2 Mark
11 7 10 .81 .82 1.18 .6 1.04 .3 Naoki
12 7 10 .81 .82 .51 –1.3 .40 –1.2 Janet
6 6 10 .19 .77 .63 –1.2 .54 –1.0 Carla
10 6 10 .19 .77 .72 –.9 .64 –.7 Doreen
13 6 10 .19 .77 1.07 .3 .99 .2 John
17 6 10 .19 .77 .98 .1 .91 .0 Sabestian
18 6 10 .19 .77 1.43 1.3 1.30 .7 Marco
9 5 10 –.37 .75 .76 –.9 .66 –.5 Sammy
15 5 10 –.37 .75 1.10 .5 1.04 .3 Linda
1 4 10 –.94 .77 1.23 .8 1.03 .3 Karl
14 4 10 –.94 .77 .85 –.5 .68 –.2 Susie
3 3 10 –1.59 .85 1.00 .1 .90 .3 Bruce
19 1 10 –3.83 1.86 MINIMUM MEASURE Tim
MEAN 6.3 10.0 .56 1.01 1.03 .0 1.15 .0
S.D. 2.3 .0 1.88 .41 .38 .9 .98 .8
Rasch Analysis 289
Item Statistics
A further piece of output from a Rasch analysis is a table that provides estimates
of item difficulty and item fit, as can be seen in Table 12.5 for our data set. These
indices mirror the data in Table 12.3 but for items rather than test takers.
The items in this table are arranged according to their position on the
logit scale (measure column), which indicates the degree of difficulty of each
item. In the case of our data set, Item 9 is the most challenging and Item 10
the easiest. We can also see how many test takers answered each item correct
(total score) and how many attempted each item (total count). As was the case
with the person statistics reported in Table 12.3, the item statistics table also
indicates the standard error relating to each item measure (Model S.E.). These
are unusually high because of the small sample size of this data set. There are
again two columns of fit statistics reported. As was the case with the person
statistics, these can be categorized into three groups: (1) those displaying very
high positive values and are therefore misfitting, (2) those in the middle range
showing appropriate fit values, and (3) those with very low values and are
therefore categorized as overfitting. Misfitting items are ones where the pat-
terns of responses from the test takers do not follow predictions, in that some
good students might have answered this item incorrectly even if they were
predicted to be able to answer it correctly or that some test takers with less
ability answered correctly. These items do not add much to our measurement
instrument as they create unwanted noise and should be revised or discarded.
Item overfit is less of a concern. A detailed discussion of item fit statistics can
be found in McNamara (1996).
TABLE 12.5 Sample item measurement report (shortened)
Entry Total Total Measure Model Infit Outfit Item

Number Score Count S.e.
Mnsq Zstd Mnsq Zstd
9 4 19 2.77 .83 .46 –1.1 .22 –.8 9
8 8 19 .97 .58 .77 –.9 .65 –.9 8
3 9 19 .64 .57 .90 –.4 .78 –.6 3
6 11 19 .01 .56 1.14 .8 1.13 .5 6
5 12 19 –.30 .57 1.27 1.2 2.23 2.3 5
1 13 19 –.64 .59 1.03 .2 .93 .0 1
2 14 19 –1.01 .63 1.06 .3 3.00 2.3 2
7 14 19 –1.01 .63 1.00 .1 .81 –.1 7
4 15 19 –1.43 .69 .87 –.2 .57 –.3 4
10 19 19 –4.08 1.85 MINIMUM MEASURE 10
MEAN 11.9 19.0 –.41 .75 .94 .0 1.15 .3
S.D. 4.0 .0 1.69 .37 .22 .7 .84 1.2
Variations in item difficulty or item fit can point to a measurement instru-

ment measuring more than one underlying construct and therefore possibly not
conforming to the underlying basic requirements for the use of Rasch models
(which are unidimensional models). Discussions of the issue of dimensionality as
well as techniques for testing for unidimensionality can be found in McNamara
(1996) and Eckes (2011).
Winsteps also reports item separation and reliability indices in the item mea-
surement report. Item separation below 3 can indicate that the candidate sample
was not large enough to provide stable information about the location of the
items on the logit scale. The reliability of the separation index provides an indi-
cation of the reproducibility of the relative measure location on the logit scale
(Linacre, 2014c).
In Sample Study 1 (Malabonga et al., 2008), the authors categorized items
with Infit mean-square values of above 1.3 as misfitting. Following a pilot study
of their vocabulary instrument, they deleted misfitting items. In the operational
administration of the test, they were able to show that an acceptable percentage
of items fit the Rasch model.
SAMPLE STUDY 1
Malabonga, V., Kenyon, D. M., Carlo, M., August, D., Louguit, M. (2008). Devel-
opment of a cognate awareness measure for Spanish-speaking English language
learners. Language Testing, 25(4), 495–519.
Background and Aim

The study sets out to describe the development and validation of the Cog-
nate Awareness Test (CAT) which is designed to measure the cognate aware-
ness of Spanish-speaking children (fourth and fifth grades) learning English
in the US. The authors argue that awareness of English-Spanish cognates
can help Spanish-speaking children comprehend English texts and aid the
development of their reading ability in English. Previous tests developed by
other authors (e.g., Nagy et al., 1993; Cunning & Graham, 2000) had not
been sufficiently validated or piloted on the target population.
Method and Statistical Tools

The paper reports on three administrations of the test. The first was a pilot
administration of the first version to 100 fourth and fifth graders. In the sec-
ond administration, the revised CAT was administered to 173 fourth graders.
In the final administration, the CAT was administered to the same children
as in the second administration, but one year later, when they were in fifth
grade. Following each administration, the authors implemented the simple
Rasch Analysis 291
Rasch model, using the computer program Winsteps, to investigate item

coverage and item fit as well as the reliability of the CAT.
Results
(1) Pilot administration: The results from the Winsteps administration fol-
lowing the pilot administration showed that the items were not per-
fectly matched to the children’s ability in that the mean difficulty of the
items was well above the mean of ability of the test takers, although
there was an even spread of cognate and noncognate items along the
logit scale. The analysis also identified three items as misfitting. The item
difficulty of items barely varied if analyzed in the whole data set or in
separate sections of cognates and noncognates. Following the analysis,
the authors deleted the misfitting items and added some easier items to
the test.
(2) Operational administration of CAT (fourth grade): The results from the
Winsteps analysis showed an even spread of the two item types along
the scale. The Wright map showed that the items were much better
matched to the children’s ability than the trial instrument but that the
mean difficulty of items was still higher than the mean ability of stu-
dents. The authors argued that this is acceptable as the CAT is designed
for both fourth and fifth graders. The overall fit of the items was accept-
able with 96% of the items fitting the Rasch model.
(3) Operational administration of CAT (fifth grade, one year later): The
Rasch analysis showed that the children’s knowledge of English vocabu-
lary and in particular cognates had increased. The Rasch map showed
that the mean ability of the children was slightly higher than the mean
difficulty level of the test takers (reversing the situation in the previous
year). The findings showed that the CAT is of appropriate difficulty for
both fourth and fifth graders: 90.4% of the items fit the Rasch model
and items that were identified as misfitting were usually among the
most difficult.
Rating Scale Analysis

The output figures and tables discussed so far (i.e., Wright map, person statistics,
and item statistics) are features of a Rasch analysis regardless of which model is
used—although slight variations in the data type, display, and interpretation might
be applicable for certain output tables or statistics. In the following section, we
will move beyond the simple Rasch model and describe some of the output from
a Rasch analysis involving rating scale or partial credit data.
While in a simple Rasch analysis each item attracts only one of two score
points (either correct or wrong, usually coded 0 and 1), items in an analysis using
the rating scale or partial credit model has more than one possible score point.
An example of such an item can be seen in Table 12.6, which is an extract from
the data collected on a listening test with a number of testlets that attracted more
than one score point. Item 3 in this table has five score points, ranging from 0 to
4.We can see the number (and percentage) of test takers who scored each of these
points in the data count column. A powerful feature of a Rasch analysis is that it
also provides us with information about the average ability of the students at each
score point (i.e., the average location or measure of the students with a certain
score on this item). We expect students who score lower to have less ability and
that the average ability level advances with each score point. In our analysis, this
was generally the case (students at score point 0 for Item 3 were of the lowest
average ability, –1.10) and this ability slowly increased as the score point increased.
However the one student achieving the highest score point, 4, was no more able
than those achieving 3. This is probably an artifact of the artificially small data
set we are using here. Items in which the average ability does not increase with
increasing score points might need revision.
The information in Table 12.6 is available for each individual item and for the
entire data set, as can be seen in Table 12.7. Here not only the average ability for
each test taker at each score point is shown, but also the Andrich thresholds (also
known as step difficulty, tau or delta). These indicate the point where it is equally
likely that someone of this ability would achieve either of the adjacent score
points. This information can be used in the process of rating scale development
and/or revision as it provides useful information about the width of different scale
categories that can be used when descriptors are refined or revised.
The information in Table 12.6 can also be represented visually as shown in
Figure 12.2 (these are known as category characteristic curves). The x-axis indi-
cates the average measure (ability) while the y-axis shows the probability of a
response. It can be seen that as the candidate ability increases, the score increases.
At the lowest end of ability, as the average measure increases, a score of 0 becomes
less and less probable. A score of 1 is only likely at a very narrow band of abil-
ity, while a score of 2 is matched to a much broader band of ability. The higher
TABLE 12.6 Sample item measurement report for partial credit data
Entry Data Score Data Average S.e. Outf Ptmea Item

Number Code Value Ability Mean Mnsq Corr.
Count %
3 0 0 2 13 –1.10 .38 .6 –.62 Item 3

1 1 6 40 –.05 .31 1.0 –.19
2 2 4 27 .43 .09 .3 .23
3 3 2 13 .93 .37 .8 .40
4 4 1 7 .93 1.0 .27
Rasch Analysis 293
peak of score 2 also indicates that this score is the most probable. The category
characteristic curves show visually whether any of the rating scale categories
are wider than others or whether any of them are never the most probable. This
information might lead a test developer to revise the wording of the descriptors,
for example, or to collapse (or expand) scale categories.
There are two Rasch models that can be used to analyze data in partial credit
or rating scale format when only one rater/marker is involved. The rating scale
model can be used if all items have the same structure and number of score points,
while a partial credit model can be used in all other instances. We will now look
at a special case of rating scale analyses: questionnaires.
TABLE 12.7 Sample rating scale category structure report
Category Score Observed Obsvd Sample Infit Outfit Andrich Category

Label Count % Avrge Expect Mnsq Msnq Threshold Measure
0 0 5 3 –2.37 –2.46 1.07 1.11 NONE (–4.49) 0

1 1 16 11 –1.65 –1.55 1.02 .97 –3.17 –2.77 1
2 2 56 37 –.42 –.49 1.16 1.15 –2.28 –.93 2
3 3 42 28 .47 .62 .87 .91 .35 .99 3
4 4 24 16 1.85 1.69 .85 .86 1.71 2.64 4
5 5 7 5 2.48 2.56 .90 .89 3.38 (4.60) 5
Observed Average is mean of measure in category. It is not a parameter estimate.
CATEGORY PROBABILITIES: MODES - Structure measures at intersections

-+ - - - -- + - -- -- + - -- -- +- -- -- + - -- -- +- -- -- +- -- -- + - -- -- +- -- -- +- -- -- + -
P
1. 0 +
| |
R
O | |
B | 00 5|
A .8 + 0 55 +
B | 00 5 |
I | 0 5 |
L | 0 22 55 |
I .6 + 0 222 222 5 +
T | 0 2 2 5 |
Y .5 + 0 22 22 333 4444444 5 +
O | 0 111 2 2 33 33 4 4 *4 |
F .4 + 1* 1 *1 332 *3 5 4 +
| 11 0 2 1 1 33 2 4 3 5 44 |
R
E | 11 02 1 3 2 44 3 5 4 |
S | 11 20 11 3 * 33 5 44 |
P . 2 + 11 2 00 1133 4 2 55 3 44 +
O | 11 22 0 331 44 22 5 33 4|
N | 22 00 33 111 44 2 *5 33 |
S | 2222 333 3 00 0 44 *11* 5555 22*22 33333 |
E . 0 +***************************************** **** ******* ******** * +
-+ - - - -- + - -- -- + - -- -- +- -- -- + - -- -- +- -- -- +- -- -- + - -- -- +- -- -- +- -- -- + -
-5 -4 -3 -2 -1 0 1 2 3 4 5
PERSON [MINUS] ITEM MEASURE
FIGURE 12.2 Sample category characteristic curve

Questionnaire Analysis
One of the data types commonly used in L2 research is yielded by question-
naires. However, as mentioned earlier in this chapter, researchers often do not use
a Rasch analysis to analyze this type of data even though it offers more powerful
tools than other traditional analysis techniques. In this section, we briefly explain
what a Rasch analysis can offer to researchers administering questionnaires.
Imagine we are using a questionnaire to measure a certain construct such as
motivation to learn languages, L2 anxiety, or willingness to communicate (see for
example Sample Study 2). We use a questionnaire with Likert scale items and
administer this to a group of learners. A Rasch analysis can provide us with some
powerful information about our measure.The Wright map can show how well our
items are able to tap into the construct (as a whole) or whether some of them are
easier to endorse than others (i.e., whether respondents are more likely to select
“strongly agree” or “agree” to certain items than others).The fit statistics can show
us whether any of the items are misfitting (i.e., not measuring the overall construct
in line with the other items) or whether our items are a unidimensional measure-
ment of the construct. Dimensionality can be established by examining the residu-
als of the data with a principal components analysis to examine whether there
is a common factor that explains the residuals (and points to a multidimensional
underlying latent measure; see Loewen & Gonulal, Chapter 9 in this volume) or
whether the residuals are just random noise (Linacre, 1998). We can also gather
information about the different step difficulties for each item (in the case of Lik-
ert scale questions, this can indicate to us the distance between different response
categories—for example, whether the step between “strongly disagree” and “dis-
agree” is much wider than between two other adjacent categories. Finally, we can
examine the category characteristic curves to examine any of the Likert scale cat-
egories are not providing useful information for our measurement (e.g., it might be
possible that the category “neutral” is subsumed under other scales). For a detailed
account of using Rasch analysis to analyze questionnaires, refer to Bond and Fox
(2007). Sample Study 2 is an example of how L2 researchers using questionnaires
can make use of Rasch techniques to investigate the quality of their instruments.
SAMPLE STUDY 2
Weaver, C. (2005). Using the Rasch model to develop a measure of second lan-
guage learners’ willingness to communicate within a language classroom. Journal
of Applied Measurement, 6(4), 396–415.
Background
This study set out to investigate the psychometric properties of a question-
naire designed for an L2 research project on willingness to communicate
Rasch Analysis 295
(WTC). This is a valuable procedure that is often not undertaken in sufficient

detail by L2 researchers using questionnaires to measure underlying traits.
Research Questions
1. How does the rating scale model differ from the partial credit model in
reflecting students’ responses to the WTC questionnaire?
2. How well do the questionnaire items define a useful range of students’
willingness to speak and write English inside the language classroom?
3. How well do the writing and speaking items perform to create a unidi-
mensional measure of students’ WTC in English?
4. How well does the questionnaire’s four-point Likert scale reflect meas-
ureable differences in students’ WTC in English?
Methods
A total of 500 students (232 first year and 268 second year university stu-
dents in an English as a foreign language environment) completed a 34-item
questionnaire designed to measure the WTC in both speaking and writing.
Each item was designed in a four-point Likert scale format: 1. Definitely not
willing; 2. Probably not willing; 3. Probably willing; and 4. Definitely willing.
To answer the first research question, the author compared the results of the
analyses using the rating scale model and the partial credit model to evaluate
whether the item structure differed for the different items or whether they
could equally all be modeled together. To answer the second question, the
item fit statistics and the item difficulty of the questionnaire were scrutinized.
To answer the third question, the author investigated the unidimensionality
of the questionnaire by examining the residuals with a principal components
analysis. To answer the fourth question, Weaver undertook a variety of analy-
ses focused on rating scale functioning as outlined by Linacre (1999).
Results
The comparison of the rating scale and partial credit model analyses showed
that the category thresholds were largely consistent across the two models.
Therefore the use of the more parsimonious model, the rating scale model,
is supported. The questionnaire was also found to define a useful range of
students’ WTC. The two groups of items focusing on the respondents’ WTC
in speaking and writing could be distinguished by the analysis of the residu-
als, but Weaver was also able to show that they worked together to form
the larger construct of willingness to communicate. Finally, the monotoni-
cally increasing step difficulties of the four-point scale showed that the scale
worked well to define students WTC.
Analyses with More Than Two Facets

So far we have examined analyses of data that involved two facets: candidate and
item. In some instances the item attracted only two score points (dichotomous
data) and could be analyzed with the simple Rasch model; in other instances, the
items were made up of more than two score points and could be analyzed with
either the rating scale model or the partial credit model. In this section, we will
examine data sets that have more than two facets and are therefore analyzed using
the many-facet Rasch model.
The many-facet Rasch model makes it possible to analyze more facets than
candidate and item, the two aspects we have examined thus far. The most com-
monly analyzed additional facet is that of the rater, but as mentioned earlier in
this chapter, other facets can also be modeled. Modeling the effects that a rater has
on the outcome of a performance assessment has provided language testers with
powerful tools to enhance rater training and to report fairer scores to test takers.
In this section, we will discuss the type of information a Rasch analysis can
provide about extra facets. Because the rater facet is the one that is most com-
monly modeled, we will use this as an example. The data in the following section
are based on a writing task that was administered to 100 students.The essays were
rated by a group of ten raters and each script was rated twice. The raters used an
analytic rating scale with categories for Organization, Content, Grammar, and
Vocabulary, each on a scale from 1 to 6.
Each additional facet in a many-facet Rasch analysis is shown on the Wright
map (Figure 12.3). In the column indicating the candidates, each asterisk rep-
resents one test taker. The column labeled Rater indicates with an asterisk the
location of the raters on the logit scale. We can see here that some raters are
more severe (those indicated higher on the scale) and some more lenient (those
located lower on the scale). We will examine this in more detail when scrutiniz-
ing the rater measurement report in Table 12.8. There is also a column indicating
the different rating scale criteria. The Wright map shows us that Organization
and Vocabulary are located higher on the logit scale than, for example, Content.
This means that it is harder to achieve a certain score level on Organization and
Vocabulary than it is for Content. The final column in Figure 12.3 provides us
with the overall scale steps on the 6-point scale that was used. It can be seen
that the raters rarely made use of the lower steps as the test takers were generally
placed into bands 4 through 6. It also shows where the boundaries are between
the scale categories and how these relate to our test takers.
More detailed information about the raters can be found in the rater mea-
surement report shown in Table 12.8. As we have seen in Tables 12.3 and 12.4,
the Rasch program provides us with the exact location of each rater on the logit
scale (measure column), the standard error for each measure as well as fit statistics.
The fit statistics offer more detailed information about the performance of the
+- -- -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -+
| Measr | +Candidate | -Rater | -Criterion | Scale ||
|| - - - - - + - - - - - - - - - - + - - - - - - + - - - - -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- - ||
|| 7 + * + + + ( 6 ) ||
| | | | | |
| | | | | |
| | ** | | | |
| 6 + + + + |
| ***| | | | |
|| * | | | | ||
|| ||
** | | | |
|| 5 +
* ***** + + + ||
| |
* ** | | | |
| |
* **** *
| | | |
| |
** | | | | - --
| 4 * + + + + |
| | | | | |
**
| |
** | | | |
| * **
| | | | |
| |
3 * + + + +
| |
* ****
| | | |
| * ****
| | | | |
|
* ****
| | | | 5 |
| 2 |
* ***
+ + + +
| | | | | |
* **** *
| |
* | | | |
| |
* * * * * **
| | | |
| 1 * + + + + |
| * **** *
| | | | |
| | Organisation vocab |
* ***
| | * |
| * ** |
* ***
| | | |
0 * ** * * ** * * --- *
||
| * ** | | Grammar | |
|| | ** | * ** | | |
|| | * *** | | | ||
| -1 + * + + Content + |
| | * | | | |
| | ** | | | |
| | | | | |
| -2 + + + + |
| | | | | 4 |
| | | | | |
| | * | | | |
| -3 + + + + |
| | | | | |
| | | | | |
| | | | | |
| -4 + + + + |
| | | | | |
| | * | | | --- |
| | | | | |
| -5 + + + + (3) |
| - -- + - - - - - - - - - + - - -- - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - + - - - - - |
| Measr | * = 1 | * = 1 | -Criterion | Scale
|
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -+
FIGURE 12.3. Sample facets map

TABLE 12.8 Sample rater measurement report
Total Total Obsvd Fair-M Model Infit Outfit Estim. Corr. Extract Agree. Nu Rater
Score Count Average Average Discrm PtBis
Measure S.E. Mnsq Zstd Mnsq Zstd Obs % Exp %
333 68 4.90 5.09 –.48 .25 1.11 .7 1.14 .8 .86 .48 44.1 56.8 1 1
317 64 4.95 5.11 –.59 .26 1.23 1.3 1.23 1.2 .72 .41 45.3 56.5 2 2
428 88 4.86 4.97 .11 .22 .98 .0 .96 –.1 1.03 .50 45.0 59.0 3 3
434 88 4.93 4.93 .30 .22 .78 –1.5 .79 –1.3 1.25 .50 53.4 58.1 4 4
384 76 5.05 5.10 –.51 .25 1.04 .2 1.01 .0 .96 .43 53.9 58.4 5 5
528 104 5.04 4.94 .23 .21 .94 –.4 .98 .0 1.05 .45 59.5 58.3 6 6
573 112 5.12 4.99 –.01 .20 .75 –2.1 .70 –2.3 1.31 .52 56.7 57.8 7 7
441 88 5.01 4.94 .28 .22 1.08 .5 1.10 .6 .91 .32 44.3 57.4 8 8
277 56 4.95 4.97 .11 .27 1.08 .4 1.12 .6 .89 .15 41.7 56.6 9 9
280 56 4.92 4.88 .55 .29 1.20 1.0 1.23 1.1 .80 .35 40.4 55.4 10 10
399.5 80.0 4.97 4.99 .00 .24 1.02 .0 1.03 .1 .41 Mean (Count: 10)
95.4 18.3 .07 .08 .37 .03 .15 1.1 .17 1.1 .11 S.D. (Population
100.6 19.3 .08 .08 .39 .03 .16 1.1 .18 1.1 .11 S.D. (Sample)
Rasch Analysis 299
raters. Raters with very high fit statistics (usually with infit mean-square values of
above 1.3) are considered to be misfitting. That means that their rating patterns
do not fall within the range that the program predicts.This usually points to raters
who are rating inconsistently. These raters are not adding meaningful informa-
tion to the measurement of these students and should therefore be required to
undergo standardization training. Raters with very low infit mean-square values
(usually with infit mean-square values below 0.8) are rating more predictably than
the program predicts. This could point to raters who are overusing certain band
levels on the rating scale and therefore not displaying the kind of expected varia-
tion across test takers. A detailed, accessible discussion of the influence different
raters on raw scores can be found in McNamara (1996, Chapter 5).
Facets also produces a reliability index as part of the rater measurement report.
It is important to note that this is not interpreted in the same manner as tradi-
tional rater reliability indices. The Rasch reliability index on the rater measure-
ment report is interpreted in the opposite way, in that low reliability indices are
desirable. These indicate that the raters are rating reliably the same.
The many-faceted Rasch model also reports a score for each test taker that
takes into account the different facets in an analysis. For example, if the analysis
has identified that a test taker was rated by harsh raters or encountered difficult
tasks, this is accounted for in the “Fair-M Average.”
Finally, the many-faceted Rasch model also makes it possible to model interac-
tions between different raters. For example, it is possible to explore whether par-
ticular raters have certain patterns of interaction with certain rating criteria (e.g.,
always rating more harshly than expected when assessing content) or whether the
background of the students influences the rater’s assessment. This is called a bias
analysis and was one aspect investigated by the authors in Sample Study 3 (Elder,
Knoch, Barkhuizen, & von Randow, 2005).
SAMPLE STUDY 3
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback
to enhance rater training: Does it work? Language Assessment Quarterly, 2(3),
175–196.
Background
This study set out to investigate whether providing raters with detailed indi-
vidualized feedback on their rating performance is effective. The purpose of
the feedback was to enhance the reliability of scores on a writing assessment
for undergraduate students in an English-medium university.
Research questions
1. Does individualized feedback reduce interrater differences in overall
severity?
2. Does individualized feedback make individual raters more internally
consistent in their judgments?
3. Does individualized feedback reduce individual biases in relation to the
scoring of particular categories on the rating scale?
Method
Eight experienced writing raters rated 50 writing samples each. The research-
ers then used a many-facet Rasch analysis to generate individualized feed-
back profiles, which included feedback on the raters’ (1) relative severity in
relation to the group of raters, (2) overall consistency, and (3) patterns of
bias in relation to particular rating scale categories. The raters then rated a
further 60 writing samples. A subsequent Rasch analysis was undertaken to
investigate whether the feedback helped raters to rate more consistently,
reduce any patterns of harshness or leniency, and reduce individual biases in
relation to scale criteria.
Results
The results showed that some raters were able to successfully take on the
feedback in their subsequent ratings but that there was large variation
among raters in terms of their receptiveness. The raters varied less in terms of
their severity in the post-feedback rating round, but this was at the expense
of the overall discrimination power of the test. The authors therefore argued
that costs of implementing this rather labor-intensive feedback may out-
weigh the benefits.
Conclusion
Rasch analysis has enormous potential to be used in L2 research. It has a number
of strengths: Its estimates of the characteristics of subjects relevant to the research
are likely to be robust and stable, as they factor in the quality of the data on which
they are based; it allows the linking of separate measurement instruments (e.g.,
tests) so that “before” and “after” testing is not subject to the idiosyncrasies of
the tests used in each case, and test familiarity effects are avoided; and it allows
detailed analysis of the impact of the quality of judges or raters and other aspects
of the data-gathering setting on measures used in the research. The examples
presented in this chapter show some of the range of research questions that can
Rasch Analysis 301
be answered with Rasch analyses, including answering research questions posed

in interventionist research.The Rasch family of models is also growing ever more
complex and sophisticated—a summary of models not described in this chapter
and their applications can be found in Iramaneerat, Smith, & Smith (2008). As this
area grows, we hope that L2 researchers continue to turn increasingly to Rasch
analyses as viable and highly appropriate set of tools to answer questions in the
field.
Tools and Resources

Software
There are a number of computer programs for conducting a Rasch analysis. We
recommend Winsteps (Linacre, 2014b), which we have used in this chapter for
all analyses except for applications of many-facet Rasch analysis, for a Microsoft
Windows–based application that can handle the simple Rasch model, the rating
scale model, and the partial credit model. The software can be purchased from
the WINSTEPS and Facets Rasch Software website (http://www.winsteps.com).
The same website also provides a free trial version, Ministeps. Ministeps has full
functionality, but is limited to analyses with 25 items and 75 test takers. Winsteps
comes with an extensive detailed manual, and more information and help can be
found on its website.
For conducting a many-facet Rasch analysis, we recommend the software
Facets (Linacre, 2014a), which is also Windows-based and that was used in this
chapter to show the applications of data with more than just two facets (e.g., if
raters are specified as a facet). Facets can be obtained from a page on the WIN-
STEPS and Facets Rasch Software website (http://www.winsteps.com/facets.
htm).The program uses a joint maximum likelihood estimation procedure to cal-
culate the parameter estimates (i.e., estimates of items, persons, raters, etc.) and fit
statistics for each element of every facet entered into the analysis. Facets is versa-
tile and offers users a variety of output tables, flexible input and output functions,
and opportunities for further investigations of the data beyond what is automati-
cally generated. Facets also comes with a detailed manual and further information
can be found on its website. As is the case with Winsteps, a free student version of
Facets is available: Minifac is available from a page on the same website (http://
www.winsteps.com/minifac.htm). Minifac is limited to 2,000 data points.
A further popular program to implement Rasch analyses is ConQuest, which
can be obtained at the ACER Shop Online (https://shop.acer.edu.au/acer-shop/
group/CON3). A free R-compatible version called TAM has recently been
developed (http://cran.r-project.org/web/packages/TAM/index.html). Con-
Quest can implement both unidimensional and multidimensional Rasch models.
The researcher can choose from both marginal maximum likelihood or joint
maximum likelihood estimation.1
For a full list of possible Rasch analyses software, please refer to the Rasch
Measurement Analysis Software Directory (http://www.rasch.org/software.htm).
The different options are listed in a helpful table that outlines where they can be
obtained, whether they are free, and which models they support. Many of pro-
grams offer free trial or student versions.
Other Resources
Further useful information about Rasch analysis and answers to questions can be
obtained by joining a discussion list. The two most well-known listservs are the
Mathilda Bay Club (http://www2.wu-wien.ac.at/marketing/mbc/mbc.html)
and the Rasch listserv hosted by the Australian Council of Educational Research
(http://mailinglist.acer.edu.au/mailman/listinfo/rasch). A Facebook group aim-
ing at Rasch measurement is also available (http://www.facebook.com/groups/
raschmeasurement). For up-to-date research articles using Rasch measurement,
we recommend the Rasch Measurement Transactions (http://www.rasch.org/
rmt/contents.htm), which is the official newsletter of the Rasch Measurement
Special Interest Group (http://www.raschsig.org/). The Institute of Objective
Measurement offers a useful website that summarizes Rasch-friendly journals (i.e.
journals publishing research using Rasch analysis), upcoming conferences, book
titles and much more information relating to Rasch analysis (http://rasch.org/).
Further Reading
• Bond,T., & Fox, C. (2007). Applying the Rasch model: Fundamental measurement
in the human sciences. An accessible, detailed introduction to the Rasch model.
The book does not use examples from L2 research but covers a broad range
of issues useful for practitioners and researchers in our field.
• McNamara, T. (1996). Measuring second language performance. This book was
the first introduction of the Rasch model to L2 research. It is a very detailed
and accessible step-by-step guide on how to interpret the different aspects of
a Rasch analysis, although the main focus is on the many-facet model. This
book is now out of print, but a scanned copy can be obtained free of charge
on Tim McNamara’s website (http://languages-linguistics.unimelb.edu.au/
academic-staff/tim-mcnamara).
• Green, R. (2013). Statistical analyses for language testers. This book includes
screenshots and step-by-step instructions on how to conduct and interpret a
Rasch analysis. This book is very accessible to complete beginners and could
be used as a starting block for further reading about Rasch.
• Eckes, T. (2011). Introduction to many-facet Rasch measurement. This volume
focuses entirely on the many-facet Rasch model. It provides detailed infor-
mation on how to interpret the output tables and also covers some more
advanced topics.
Rasch Analysis 303
1. Choose a number of L2 research studies that have made use of Rasch analysis.
a. Is it clear which Rasch model was used in the analysis?
b. Are the research questions clearly stated and answerable?
c. Are the analyses described in a clear, replicable manner?
d. Are the results presented clearly?
2. In some research designs, subjects have to be tested before and after treat-
ment. It is not usually advisable to use the same test again, because of test
familiarity effects. How does Rasch analysis help get around this problem?
3. One of the differences between CTT and Rasch analysis is that the latter
factors in the quality of the data used to estimate person characteristics, item
difficulties, rater qualities, and so on. In what aspects of the output is there
evidence of this feature of Rasch analysis?
4. Read the sample data set into Winsteps,which can be downloaded from this
book’s companion website (http://oak.ucc.nau.edu/ldp3/AQMSLR.html),
using the procedures described in the chapter.
a. What information can you learn from the Wright map?
b. Is the spread of test items well suited to the test takers?
c. Are there any items that are misfitting or overfitting?
d. Is there any information given by the Rasch analysis that you could not
readily learn from an analysis using classical test theory?
Notes
1. A discussion of the two methods can be found in Linacre (1999).
References
Andrich, D. (1978a). A general form of Rasch’s extended logistic model for partial credit
scoring. Applied Measurement in Education, 4, 363–378.
Andrich, D. (1978b). A rating scale formulation for ordered response categories. Psy-
chometrika, 43, 561–573.
Bond, T., & Fox, C. (2007). Applying the Rasch model. Fundamental measurement in the human
sciences. New York: Routledge.
Cunningham, T.H., & Graham, C.R. (2000). Increasing native English vocabulary rec-
ognition through Spanish: Cognate transfer from foreign to first language. Journal of
Educational Psychology, 92, 37–49.
Eckes, T. (2011). Introduction to many-facet Rasch measurement. Frankfurt: Peter Lang.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to
enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196.
Green, R. (2013). Statistical analyses for language testers. New York: Palgrave.
Iramaneerat, C., Smith, E. V., & Smith, R. M. (2008). An introduction to Rasch measure-
ment. In J. Osborne (Ed.), Best practices in quantitative methods.Thousand Oaks, CA: Sage.
Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago: MESA Press.

Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transac-
tions, 7(4), 328.
Linacre, J.M. (1998). Rasch analysis first or factor analysis first? Rasch Measurement Transac-
tions, 11(4), 603.
Linacre, J. M. (1999). Understanding Rasch measurement: estimation methods for Rasch
measures. Journal of Outcome Measurement, 3, 381–405.
Linacre, J. M. (2014a). Facets Rasch measurement computer program. Chicago: Winsteps.
com.
Linacre, J. M. (2014b). Winsteps Rasch measurement computer program. Beaverton, OR:
Winsteps.com.
Linacre, J. M. (2014c). Winsteps® Rasch measurement computer program User’s Guide.
Beaverton, Oregon: Winsteps.com.
Malabonga, V., Kenyon, D. M., Carlo, M., August, D., & Louguit, M. (2008). Development
of a cognate awareness: Measure for Spanish-speaking English language learners. Lan-
guage Testing, 25(4), 495–519.
McNamara,T. (1996). Measuring second language performance. London & New York: Longman.
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measure-
ment in language testing. Language Testing, 29(4), 553–574.
Nagy, W., Garcia, G. E., Durgonoglu, A. Y., & Hancin-Bhatt, B. (1993). Spanish-English
bilingual students’ use of cognates in English reading. Journal of Reading Behavior, 25,
241–259.
Rasch, G. (1960/1980). Probablilistic models for some intelligence and attainment tests. (Copen-
hagen, Danish Institute for Educational Research), expanded edition (1980) with fore-
word and afterword by B.D. Wright. Chicago: University of Chicago Press.
Schumacker, R. E. (1999). Many-facet Rasch analysis with crossed, nested, and mixed
designs. Journal of Outcome Measurement, 3, 323–338.
Weaver, C. (2005). Using the Rasch model to develop a measure of second language
learners’ willingness to communicate within a language classroom. Journal of Applied
Measurement, 6(4), 396–415.
Wright, B. D. (1992). Raw scores are not linear measures: Rasch vs. Classical Test Theory
CTT comparison. Rasch Measurement Transactions, 6(1), 208.
Wright, B. D., & Linacre, J. M. (1991) BIGSTEPS computer program for Rasch measurement.
Chicago: MESA Press.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
13
DISCRIMINANT ANALYSIS
John M. Norris
Introduction and Conceptual Motivation

Discriminant Analysis or Discriminant Function Analysis (Discrim, for short) pro-
vides a statistical approach to investigating the extent to which a set of measured
variables can distinguish—“discriminate”—between members of different groups
or distinct levels of another, nominal or possibly ordinal, variable. Discrim is par-
ticularly useful where questions regarding accuracy of classification (e.g., into
types, sources, degrees, levels) are at stake. Rather than simply asking whether
there are detectable differences between groups according to one or more mea-
sured/dependent variables, Discrim asks how well and to what extent individual
cases can be classified into a particular group based on combinations of patterns
observed in the measured variables. For example, a corpus analyst might want
to know the proportion of texts representing three distinct genres that can be
accurately classified as one or another genre based on a set of automated lexical
measures. Or, for another example, language testers might be interested in how
well a variety of direct measures of spoken language production (e.g., grammatical
accuracy, syntactic complexity, lexical diversity) can predict the global proficiency
of language learners who have been grouped, based on a proficiency test, into the
six levels of the Common European Framework of Reference (CEFR). Discrim
is thus particularly well suited to researching how different combinations of mea-
sured variables can predict membership of individuals in different existing groups,
levels, or other a priori classifications.
Discrim can be used descriptively or predictively: Descriptive Discrim explores
how well combinations of variables account for membership in groups that have
been identified a priori, typically investigating hypothesized explanations for dif-
ferences among groups. Predictive Discrim, by contrast, seeks to build further
306 John M. Norris
upon such findings by creating a means for estimating future group member-
ship as efficiently and reliably as possible on the basis of a set of measurable
phenomena.
Discrim is a unique application of the multivariate ANOVA (MANOVA)
family, falling within the general linear model approach to inferential statistics
(see Plonsky, Chapter 1 in this volume). It essentially turns MANOVA around
and treats the independent variable (a single grouping factor of some kind) as
the criterion or dependent variable, and the dependent variables (a set of interval
measures of various phenomena) as the predictor or independent variables. The
terminology can therefore become somewhat confusing when moving between
MANOVA and Discrim, so care must be taken to keep in mind precisely what is
being referred to by labels like “independent” or “dependent” variable. Discrim
is most meaningful when applied to naturally occurring groups that are mutually
exclusive and exhaustive of the phenomenon of interest (e.g., individuals that
belong to only one global proficiency level, or texts that can only be classified a
priori as one type of genre or another).Where groups are artificially or arbitrarily
created (e.g., by separating cases above and below the median value of a given
measure), Discrim is typically less effective and/or more difficult to interpret (and
alternative techniques that search for groups, rather than investigate predictabil-
ity of group membership, may be more appropriate, such as factor analysis, see
Loewen & Gonulal, Chapter 9 in this volume).
Where Discrim is used to distinguish cases belonging to a grouping factor
with only two levels, such as first-language (L1) versus second-language (L2)
speakers, it is mathematically identical to MANOVA, with the added benefit of
identifying proportions of cases accurately and inaccurately classified as one or the
other group as well as which variables are best able to do so. However, Discrim is
potentially much more interesting when analyzing membership of cases in more
than two groups. Here, Discrim works by combining the measured variables into
functions (similar to factors in factor analysis), which are essentially new latent
(unobservable) variables based on linear combinations of observable phenomena.
Functions are created mathematically by weighting the contribution of each pre-
dictor variable (based on its correlation with the grouping variable) in different
ways, and then looking at which combination of weighted values is the most dis-
criminatory between the different groups. With multiple groups (i.e., more than
two), there may be multiple ways of weighting and combining measured variables
in order to distinguish between different pairings/sets of groups: It is possible, even
likely, that different combinations of measures are more discriminatory between
certain groups than other groups. Thus, in the CEFR example earlier, it may
be that one function that more heavily weights certain measures (perhaps basic
syntax and pronunciation) is better at discriminating between lower proficiency
levels (A1, A2) whereas another function, emphasizing other measures (say, lexi-
cal variety, morphological accuracy, and fluency), may be better at discriminating
Discriminant Analysis 307
among higher proficiency levels (B2, C1, C2). Luckily, as the math involved can
become extremely complicated, with multiple measures and multiple groups (see
brief math demonstration in Tabachnick & Fidell, 2013), statistical software appli-
cations like SPSS automatically do all of the math for us, so the only real challenge
in applying Discrim is to make sure that it is set up and interpreted correctly.
Discrim applications also provide very useful tables and figures in the output that
help the researcher focus directly on the most important findings.
Note that Discrim is also similar to logistic regression (LR) and cluster analysis
(CA). A major difference with LR is that Discrim adopts stricter assumptions
regarding the normality of variables in data sets and within the population of
interest, while LR makes no distributional assumptions at all about predictor
variables or the linearity of relationships with criterion variables. LR is therefore
much more flexible, but also less powerful, depending on the qualities of the data.
A major difference with CA is that, in Discrim, the number and definition of
groups into which membership is being predicted is known a priori, whereas in
CA the number of predictable “clusters” or groups is not known until the analysis
is completed.
Though one of the lesser utilized statistical approaches within applied linguis-
tics research, Discrim has featured sporadically across multiple domains of inquiry.
For example, in language testing research, Discrim has been used to investigate
the elements of rating scales and rubrics that distinguish between hierarchical
levels of oral test performance (e.g., Fulcher, 1996; Norris, 1996), to examine the
accuracy of criterion-referenced testing and pass-fail decisions (e.g., Robinson &
Ross, 1996), and to explore test-method effects (e.g., Zheng, Cheng, & Klinger,
2007). In L2 composition research, questions regarding which features of writing
performance (e.g., syntactic, lexical, discoursal) are best able to distinguish among
holistically rated higher or lower compositions have been addressed through
Discrim (e.g., Ferris, 1994; Homburg, 1984; Oh, 2006; Perkins, 1983). Reading
researchers have employed Discrim to investigate effects of lexical transfer and
reading processes on comprehension (e.g., Koda, 1989; Nassaji, 2003).
Perhaps the most frequent application of Discrim within applied linguistics
research has come from corpus linguistics. Within this broad domain, research-
ers have utilized Discrim to investigate writing quality (e.g., Crossley & McNa-
mara, 2009; McNamara, Crossley, & McCarthy, 2010), genre identification (e.g.,
Martín Martín, 2003), register variation (e.g., Biber, 2003), the deployment of
specific grammatical phenomena in language use (such as particle placement,
Gries, 2003), and L2 learner production (e.g., Collentine, 2004; Collentine &
Collentine, 2013), to name a few examples. Discrim has also featured from time to
time in research on L2 interactional strategies (e.g., Rost & Ross, 1991), anxiety
effects on learning (e.g., Ganschow & Sparks, 1996), mother tongue maintenance
(e.g., Okamura-Bichard, 1985), language impairment (e.g., Gutiérrez-Clellen &
Simon-Cereijido, 2006), motivation and personality relations with proficiency
308 John M. Norris
(e.g., Brown, Robson, & Rosenkjar, 2001), and the effectiveness of self-instruction
(e.g., Jones, 1998).
How to Conduct a Discriminant Analysis

In this section, I provide a step-by-step guide to conducting a basic Discrim,
referring to portions of a data set from a study by John Davis (2012), and adopt-
ing much of the nomenclature and procedures from SPSS. There are, of course,
many other available statistical packages that will facilitate Discrim, and several
of these are listed at the end of this chapter; however, for the sake of providing
concrete and consistent guidance, I base this discussion largely on the conduct
of Discrim within SPSS version 21. I also provide explanations of key portions
of the output from Discrim, and finally an example of how to report the find-
ings. There are several additional options in Discrim (see the “Further Reading”
section), beyond what I present here, but the intent is to offer a straightforward
approach for getting started toward understanding how to run the analysis and
interpret the results.
Variables
Discrim begins with the identification of grouping and predictor variables.
A grouping variable typically takes the form of a categorical factor of some kind,
often the causal variable that has been operationalized in a study and already ana-
lyzed within MANOVA as an independent variable. Grouping variables can have
two or more levels, and each case in the analysis must belong to one and only
one level of the grouping variable. For example, grouping variables might be text
types (argumentative, narrative, etc.), experimental groups (explicit, implicit, con-
trol, etc.), proficiency levels (low, medium, high, etc.), and so on. In our example
study, Davis (2012) identified three groups of foreign language (FL) programs
based on the extent to which they self-reported low, medium, or high degrees of
using and learning from outcomes assessment. Note that, although the programs
rated their degrees of use on a rating scale, the nature of the scale categories
was essentially categorical: respondents self-identified their programs according
to how much they utilized or learned from doing outcomes assessment, from low
to high (i.e., the distinctions between the levels of the grouping variable are by
no means arbitrary). “Level of learning from assessment use,” then, is the grouping
variable for this study, and membership in one of these three levels is what we will
try to predict on the basis of a set of measures.
Predictor variables consist of interval-scale measures of whatever phenomena
will be used in attempting to classify individual cases according to their mem-
bership in the levels of the grouping variable. Predictor variables typically come
from the measures that have been operationalized and analyzed in MANOVA as
dependent variables; the purpose of Discrim is to investigate how these measures
can be combined to predict membership in the groups of the grouping variable.

There can be one or many predictor variables, though usually Discrim begins
with numerous possible predictors (i.e., it may not be particularly informative to
run Discrim on a single or even several predictors; in this case, logistic regression
is likely a better option). It is possible to run Discrim on many, many predictor
variables; however, the power of the analysis declines substantially with each new
predictor variable, and the likelihood of multicollinearity (correlated predictor
measures) goes up. Indeed, there should be some reason to include each and every
predictor variable in a Discrim study, as opposed to “dragnet” exploration of a
maximum number and variety of possible predictors.
The predictor variables for our example study consisted of nine capacity fea-
tures of FL programs that were hypothesized (based on extensive review of the
literature and theories of educational outcomes assessment) to foster engagement
with and learning from outcomes assessment practices. These features were elic-
ited through a survey that sought to characterize the activities, conditions, and
dispositions of FL programs toward assessment-related issues in the following
ways: institutional support for assessment, the nature of institutional governance,
institutional infrastructure for assessment, program-internal support for assess-
ment, leadership within the program, culture and ethos of improvement, col-
laboration, communication, and engagement in assessment activities.The research
question for the current Discrim approach to these data is “To what extent can
level of learning from assessment use be predicted by nine contextual capacity
factors hypothesized to support learning from assessment?”
Assumptions
Prior to initiating Discrim, the standard assumptions for multivariate statistical
analyses should be checked in the full data set. Essentially, the assumptions for
Discrim are the same as they are for MANOVA: independence of observations
on each variable, univariate and multivariate normality, homogeneity of variance
and covariance, and no evidence of multicollinearity. As these assumptions are
discussed elsewhere (e.g., Jeon, Chapter 7 in this volume; Tabachnick & Fidell,
2013), I will not describe them in detail here. Suffice it to state that violation of
these assumptions will affect the quality and reliability of the Discrim analysis, so
they should be taken seriously. Where violations are encountered, steps should be
taken to select alternative appropriate analyses or adjust data such that the analysis
is not threatened. However, I will make mention of three assumptions that are
particularly important for Discrim in that they may have undue effect on the
outcomes of the analysis: (a) Outliers may exert considerable influence, especially
at lower sample sizes, so care should be taken to inspect (graphically) the distribu-
tions of cases on each predictor measure for each level of the grouping variable.
Where identified, serious outliers should be eliminated or their scores adjusted.
(b) Sufficient sample size is also critical for interpreting a Discrim analysis; as a
310 John M. Norris
rule of thumb, at a minimum the smallest group sample size must contain more
cases than the total number of predictor variables to even begin to trust the solu-
tion from the Discrim analysis; more typically, a criterion of 20 cases per predictor
is adopted to avoid problems of overfitting the model with too many predictors.
Finally, (c) multicollinearity should be considered carefully prior to entering pre-
dictor variables into the analysis; high correlations (a typical criterion is r > .70)
between predictor variables reduce the power of the analysis and may confuse the
determination of discriminant functions due to superfluous variables.Where high
correlations are identified, a single marker variable should be selected and other
correlated variables eliminated from the analysis (note that the issue of determin-
ing the magnitude of correlation that should be considered “too much” overlap
is far from resolved in black-and-white terms; see Tabachnick & Fidell, 2013, for
discussion).
In our example study, no severe violations of assumptions were found on the
data set of 90 FL program survey respondents. In particular, significant outliers
were not identified for any of the predictor measures within any of the three
groups, and none of the measures correlated with any other measure at higher
than r = .60. The smallest group sample size, n = 20 for the “low” assessment use
level, was higher than the total number of predictors (n = 9).
Setting up the Analysis

Once assumptions have been checked and any adjustments made, the data should
be ready for entering into a Discrim analysis. In this section, I walk the reader
through basic setup of a Discrim in SPSS, but the steps involved are essentially
the same regardless of the statistical software tools being used. Assuming the data
are appropriately available and labeled within a spreadsheet program (like SPSS),
the first step is to identify the correct analysis. We begin by selecting Analyze >
Classify > Discriminant using the drop-down menus, as shown in Figure 13.1.
In the basic operations window that pops open (see Figure 13.2), we first
need to define our variables for the analysis. Recall that what we want to do
here is utilize a set of measures to predict the group membership of cases; so, our
measures are the independent (or predictor) variables and our nominal variable
is the dependent (or grouping) variable. First we select the grouping variable, in
our case “Processuse” (the name given to the variable for “Level of learning from
assessment use”), then we click on Define Range. The Range is the set of values
that represent each possible group; in this case, 1, 2, and 3 for low, medium, and
high assessment use. To define the range, we type in 1 as the minimum value and
3 as the maximum value.
Next, we need to select our Independents. Don’t get confused! Recall that Dis-
crim turns MANOVA around and treats a set of measures (which in MANOVA
would be considered the dependent variables) as predictor—or independent—
variables. So, here we just want to select the set of nine measures we are using to
FIGURE 13.1 Selecting the right analysis in SPSS
FIGURE 13.2 Selecting and defining grouping variables

312 John M. Norris
predict group membership and move them into the Independents window (see
Figure 13.3).
In SPSS we are also prompted at this point to choose a particular approach to
entering the data in the statistical analysis (also shown in Figure 13.3). Note that it
is possible to analyze the predictor measures in a particular stepwise or sequential
order (e.g., the most important or statistically strongest first, followed by others
after that one is factored out); however, we need a theoretical/logical reason to do
so. In this case, in the absence of any particular reason to look at the effects of one
measure first and others in a particular order, we are going to treat all measures
equally (and that is typically the case for Discrim in L2 research). That means we
will select Enter independents together to enter them all at once, with no particular
order. This approach is also known as Direct Discriminant Analysis.
The next step is to select the statistics that we want to calculate for the Dis-
crim analysis. In SPSS, we click on the Statistics button and a new window pops
open to display a variety of possibilities (see Figure 13.4). What we choose here
depends somewhat on the nature of the data and our approach to analyzing it
(see discussion in Huberty & Olejnik, 2006), but a basic approach will suffice
for most situations. Several options here are quite useful. First, selecting Means
and Univariate ANOVAs will give us the descriptive statistics and ANOVAs for
each group (1, 2, 3) on each of the nine measures. Second, selecting Box’s M will
give us a test of the homogeneity of variance-covariance, which is helpful in
determining whether the MANOVA inferential test is trustworthy or not. Third,
selecting Fisher’s under Function Coefficient will provide us with the newly cal-
culated average value for each group on each measure based on the newly created
FIGURE 13.3 Selecting predictor variables

FIGURE 13.4 Selecting statistics for the analysis
discriminant functions. Last, choosing Within-groups correlation under Matrices will

give us a Pearson correlation table showing the relationships between all pairs of
measures, which is useful for considering multicollinearity. (Note that although
we were careful and examined these characteristics when we reviewed the data
for statistical assumptions, it doesn’t hurt to double-check our data within the
Discrim analysis itself.)
After selecting Continue in the Discriminant Analysis: Statistics window, we
then have a few final choices to make by clicking on Classify (see Figure 13.5).
Here, we want to find out how well the combined measures are able to classify
all 90 cases into the three different groups. First, it is somewhat important to con-
sider the basis for estimating classification accuracy. Note that the group sizes are
unequal in our example data set (n = 20, 28, 42). If the group sizes were similar,
we would leave the approach as “All groups equal”; if they are quite different, as
they are in our case, then choose Compute from group sizes. Second, assuming that
the variances and covariances are not radically different between the groups, it is
safe to proceed with Discrim that uses the Within-groups covariance matrix.Third,
the most important item to select, under Display, is Summary table, which will tell
us the percentage and number of correctly classified cases based on our analysis.
Under Display we can also set up a cross-validation study by selecting Leave-one-
out classification, which will split the data into two halves and run a first Discrim
on one half, then validate it by running a second Discrim on the other half (and
compare the classification accuracy of the two analyses). In our example case, if
we split the data into halves, then we may get into the territory of having close to
314 John M. Norris
FIGURE 13.5 Selecting analysis and display option for classification
the same number of cases within the smallest group as we have predictor variables,
so we will not conduct a cross-validation on this data set. Also, if we are really
interested in how each specific case was classified (accurately or not), then we can
select Casewise results, which will produce a table with each of the individual cases
and the group that it was predicted to be in (based on the combined measures), as
well as the group that it was actually in (based on the grouping variable). Finally,
under Plots we can also ask for graphs that help us to conceptualize the overall
analysis. In our case we will select Combined-groups, which will show us on a single
graph how well the functions differentiate each group.
At this point, our Discrim is ready to be processed, so all we have to do is
click on OK. One additional note is in order at this point, prior to examining
the output of results: It is common practice to run multiple Discrim procedures
by varying the number of predictor variables entered into the analysis. While this
approach may increase Type I error overall, it is also quite useful for determining
how well different subsets of variables combine to predict group classifications.
I will not explore this approach in the current chapter, but I refer readers to
several of the resources in Further Reading for examples of L2 studies that have
done so.
Interpreting Output
Most statistical software applications, SPSS included, provide copious output for
Discrim in the form of tables and figures. Fundamentally, we are interested in
finding answers to questions such as: (a) How well was the set of measures, as a
whole, able to predict group membership? (b) Were some groups more highly
predictable than others? and (c) Which individual measures seem to be the best
predictors of group membership? While only portions of the output are particu-
larly useful for interpreting the main findings, it is helpful to understand the basics
of what all is calculated. The first substantive table produced in SPSS output is
called “Group Statistics.” This table shows the means and standard deviations for
each of the three groups (1, 2, 3) on each of the nine measures in our example
data set. We can skim through it and compare values on a given measure between
the three groups, and we should begin to see which measures may be the best at
predicting differences across the groups.
The next table shows the individual ANOVA results for each of the predictor
variables; in other words, it reports whether each measure on its own showed sta-
tistically significant differences across the levels of the grouping variable.Table 13.1
shows the output for our sample data. Here we see whether there is an overall
significant effect across the three groups (1, 2, and 3 on Processuse) for each of
the measures. Clearly, the answer is yes, as indicated by the very small p values in
the final column. Note that we do not know where the differences might be yet
(between which pairs of 1, 2, and 3), just that there is an overall significant effect
for each measure. Measures that do not show a significant effect here will not
contribute to the predictions later on in the analysis. Note here that Wilks’ lambda
and the F value are inversely proportional; the smaller the lambda, the higher the
F, and the greater the effect of the given measure. At this point, then, we should
be able to identify which of the measures is likely to be the strongest predictor of
group differences (i.e., the measure with the smallest lambda).
The next several tables of output help us to review the assumptions of the
multivariate family of analyses. A bivariate correlations table is provided, showing
TABLE 13.1 ANOVA output for nine predictor variables
Tests of Equality of Group Means

Wilks’ F df1 df2 Sig.
Lambda
InstSupA .883 5.744 2 87 .005
InstGovA .820 9.523 2 87 .000
InfraA .724 16.581 2 87 .000
ProgSupA .848 7.790 2 87 .001
LeadA .664 21.980 2 87 .000
CulEthoA .689 19.623 2 87 .000
ColA .582 31.214 2 87 .000
ComA .639 24.563 2 87 .000
ActCondA .468 49.531 2 87 .000
316 John M. Norris
the relationships between each pair of the nine predictor variables. In our exam-
ple data set, we find that the correlations are all positive and range from r = .14
to r = .60; thus, although there are obviously some overlapping relationships here,
none is strong enough to suggest multicollinearity, and hence all of the measures
can be included safely in the Discrim analysis.The next two tables show the Box’s
M tests of equality of covariance matrices. The log determinants show a measure
of variability for the predictor measures combined in each group—here, we hope
that they are relatively similar, meaning that variability is not radically different on
the combined measures between each group. In Table 13.2, we see that the vari-
ability is indeed quite similar for our sample data. Box’s M then tests the signifi-
cance of that comparison of variability between the groups on the nine measures.
For our data, the test is not significant ( p = .248), and given the relatively similar
values in the log determinants output, it is safe to assume that covariance is not
overly heterogeneous.
With the subsequent output, the findings specific to Discrim begin in ear-
nest. First, we encounter information regarding the calculated discriminant func-
tions themselves. A function is a combination of measures used to predict group
membership (conceptually kind of like a factor in factor analysis, see Loewen &
Gonulal, Chapter 9 in this volume; and similar to a cluster in cluster analysis, see
Staples & Biber, Chapter 11 in this volume). Discrim tries to find the best linear
combination of measures to distinguish among groups. Table 13.3 shows that in
our example data set, since there are three groups, Discrim only tries to make
two distinctions (hence, two functions). Each function predicts a certain amount
of the total variance that can be accounted for in the three groups. For our data,
TABLE 13.2. Box’s M output for testing homogeneity of covariance across three groups
Log Determinants
Processuse Rank Log Determinant
1 9 –16.303
2 9 –16.258
3 9 –14.130
Pooled within-groups 9 –13.900
The ranks and natural logarithms of determinants printed are those of the group covariance matrices.
Test Results
Box’s M 118.774
F Approx. 1.097
df1 90
df2 11,575.999
Sig. .248
Tests null hypothesis of equal population covariance matrices.
TABLE 13.3. Canonical discriminant functions output
Eigenvalues
Function Eigenvalue % of Variance Cumulative % Canonical
Correlation
1 1.441a 91.0 91.0 .768
2 .142a 9.0 100.0 .352
a. First two canonical discriminant functions were used in the analysis.
Wilks’ Lambda
Test of Wilks’ Lambda Chi-square df Sig.
Function(s)
1 through 2 .359 85.059 18 .000
2 .876 10.994 8 .202
the first function is doing the lion’s share of prediction (91%), with a correspond-
ingly large eigenvalue (a measure of the variance attributable to the function,
but not very interpretable in practical terms). The canonical correlation shows
that the first function is also quite highly correlated with the grouping variable
(Processuse).
Wilks’ lambda is a significance test of the overall ability of the functions to
identify differences between groups. Here again, the smaller the lambda, the
greater the predictive power. The lambda value actually shows the amount of
variance that the model cannot explain (so, you see it is much better at explaining
with the combined Functions 1 and 2). Note that the first lambda test is of both
functions combined. The second test is for Function 2 alone, after the variance
attributable to Function 1 has been factored out. Here we see that Function 2
on its own does not significantly distinguish between the three groups. However,
it does add some additional discrimination (most likely, it discriminates between
two groups but not all three).
The next several tables provide indications of the extent to which each pre-
dictor variable is related to each of the discriminant functions. Standardized
canonical function coefficients are just standardized values (like z-scores) for each
measure that are used to calculate the overall discriminant function.These are not
very interpretable at face value, but we can already start to see which measure
is contributing the most to each function. So, the larger the magnitude of the
coefficient, the more influence or weight that measure has on the function. The
structure matrix table (Table 13.4) is somewhat more interpretable. It shows the
correlation between each measure and the two functions that have been created
by the analysis (similar to ‘loadings’ in factor analysis). The * next to the correla-
tions indicates the function that the particular measure is most highly correlated
318 John M. Norris
TABLE 13.4 Relationship output for individual predictor variables and functions
Structure Matrix
Function
1 2
ActCondA .883 *
–.333
ColA .697* .361
ComA .626* .072
LeadA .592* –.040
InfraA .499* –.393
ProgSupA .350* .137
InstSupA .299* .152
CulEthoA .527 .603*
InstGovA .369 –.402*
*. Largest absolute correlation between each variable and any discriminant function
with. In Table 13.4, it is clear that most of the measures in our example data set
correlate with Function 1, but only about half are highly correlated (i.e., above
.50). For Function 2, only two measures correlated more highly there. Also, if we
needed to pick a single measure to represent each function, it would be the mea-
sure that correlates most highly with it—we can refer to these as marker variables.
ActCondA would be a very good marker variable for the first function (very
highly correlated), while CulEthoA is a moderately strong marker variable for
Function 2.
The final information that we received about the functions themselves is a
table showing the mean values calculated for each group (Groups 1, 2, 3 from our
Processuse grouping variable) on each of the functions created by the analysis.
The values do not have any particularly interpretable meaning on their own;
however, if we compare the groups with each other, we can see how far apart they
are in terms of the functions. For Function 1 in our example data, the difference
between Groups 1 and 3 is 2.851 points, whereas for Function 2, the difference
between 1 and 2 (because they are the most different of the three groups) is only
.974. So, clearly Function 1 is much better at distinguishing between the groups.
The values from this table are represented graphically in Figure 13.6.
The third part of the Discrim output shows classification statistics in several
different tabular and graphic formats, arguably the most useful aspect of the analy-
sis. A first table (“Prior probabilities,” not shown here) just reminds us of the prob-
ability level that was used to estimate group membership. Recall for our example
data that we asked for probabilities based on sample size, so the table shows us that
sample size was used for estimating the size of each group. The next table shows
an overall average value calculated for each group on each measure, based on the
new “scale” of the discriminant functions created by the analysis (see Table 13.5).
Canonical Discrimination Function
4 Processuse
1
2
3
Group centrold
2
1
Function 2
3
0
2
−0
−4
−4 −2 0 2 4
Function 1
FIGURE 13.6 Two-dimensional output for three group average values on two discrimi-
nant functions
We can compare between the three groups on each measure to see which are the
farthest apart—the measures that have widely differing group values are the ones
that will be the best predictors of group membership. Here, again, we see that
ActCondA has very different values for each group, so it will definitely be the best
predictor. Others are clearly discriminating between two groups but not all three
(e.g., ColA between Groups 2 and 3, but not between 1 and 2), and still others
discriminate very little across groups (e.g., InstSupA).
A variety of figures are also provided in the output, depending on what we
have requested in the setup of the analysis. For our example data, we requested a
plot of the group centroids and individual case values for each function, as shown
in Figure 13.6. This curious figure provides a two-dimensional representation of
the ways in which the analysis is able to separate each group and each case. Note
that cases are individual circles (color-coded for each level of Processuse). The
squares (centroids) are essentially an average value for each group on the nine
predictor measures combined. We can read the graph in two ways: (a) look for
distance between each centroid from left to right (here we see a lot of distance
between Group 3 and the others, based on information from Function 1); and
(b) look for distance between each group from top to bottom (here, we do not
320 John M. Norris
see much distance, although 1 and 2 are separated more from each other, but not
much from Group 3, based on Function 2). In essence, then, this figure is showing
that the analysis is pretty highly capable of separating groups by Function 1 (i.e.,
according to a certain set of predictor variables), and then marginally capable of
additional separation by Function 2 (i.e., according to another set of predictor
variables).
Finally, at the very end of the classification output is the information we are
probably most interested in. As shown in Table 13.6 for our example data, Dis-
crim estimates the numbers and percentages of individual cases whose group
membership has been correctly predicted by the combined functions. In other
words, using the combined information from the two functions (i.e., nine mea-
sures of assessment capacity in these data), the analysis was able to correctly pre-
dict the group membership of 73% of the cases (not bad!). We can also see that
TABLE 13.5 Classification output for each predictor variable
Classification Function Coefficients

Processuse
1 2 3
InstSupA 4.032 4.136 4.335
InstGovA 1.897 2.340 1.480
InfraA 6.816 8.033 7.694
ProgSupA –1.536 –2.515 –2.125
LeadA –.299 .140 .637
CulEthoA 6.265 5.798 7.426
ColA 1.718 1.585 2.895
ComA –2.406 –2.962 –3.523
ActCondA .503 2.505 4.419
(Constant) –18.410 –22.132 –32.029
Fisher’s linear discriminant functions
TABLE 13.6 Accuracy of classification output for membership in three groups
Classification Resultsa
Processuse Predicted Group Membership Total
1 2 3
Original CCount 1 15 5 0 20
2 6 15 7 28
3 1 5 36 42
% 1 75.0 25.0 .0 100.0
2 21.4 53.6 25.0 100.0
3 2.4 11.9 85.7 100.0
a. 73.3% of original grouped cases correctly classified.
the predictions were quite a bit higher for group 3 (i.e., the high assessment
use group), not bad for Group 1 (the low assessment use group), and quite a
bit weaker for Group 2 (the mid-assessment use group). Given that we were
interested in predicting three levels of the grouping variable, chance would sug-
gest approximately 33% for each cell in the matrix; thus, while 53% is not very
accurate (half correct, half incorrect), it is actually quite a bit higher than chance
in this analysis.
Reporting Findings
When reporting the findings of a Discrim, it is important to include sufficient
details regarding the nature of the grouping and predictor variables, how statistical
assumptions were checked (and any adjustments made), the setup of the analysis,
and the essential descriptive and inferential statistical details that will allow readers
to both understand the approach adopted and judge the findings on their own.
Following is a basic example of a report based on our example data set.
Results
A discriminant function analysis was conducted to predict the level of
Process use (i.e., learning from and acting on assessment information)
reported by foreign language programs based on several measures of assess-
ment capacity. Low, mid, and high Process use groups were determined by
self-reported scores on a separate survey. Nine predictor variables (mea-
sures of assessment capacity) were included in the analysis: institutional
support, institutional governance, infrastructure, program support, leader-
ship, culture/ethos of assessment, collaboration, communication, and activi-
ties/conditions for assessment. Multivariate assumptions for data quality
were met, and the relatively large sample size (N = 90) as well as sufficient
within-group sample sizes (n = 20, 28, 42) suggested that the analysis would
be robust to some variations in data quality between groups and predic-
tor variables, and despite inequality in group sample sizes. A test of equal-
ity of group means indicated statistically significant ( p <. 05) differences
between the three Process use groups on each of the nine variables, and
bivariate correlations between all pairs of variables ranged from .14 < r <
.61. Given that each variable on its own predicted group differences and
no evidence of multicollinearity was found, all predictor variables were
retained for further analysis. Box’s M ( p = .248) indicated no heterogene-
ity of variance-covariance, hence the subsequent multivariate discriminant
function analysis was inspected.
The analysis identified two discriminant functions, the first accounting
for the large majority (91%) of observable variance across the three Process
use levels. An overall statistically significant effect was found for the com-
bined functions (1 and 2), Wilks’ lambda = .359, χ2(18, N = 90) = 85.059,
p < .001, indicating that the combined predictor variables were able to
322 John M. Norris
account for around 64% of the actual variance in Process use between
the three groups. On its own, the second function did not provide addi-
tional statistically significant predictions, Wilks’ lambda = .876, χ2(9,
N = 90) = 10.994, p = .202. As shown in Table 1, Function 1 was best
represented by the measure of activities/conditions for assessment, which
correlated at .883 with the function; additional strongly correlating mea-
sures included collaboration and communication. Function 2, by contrast,
was best represented by the measure of culture/ethos of assessment, which
correlated moderately (r = .603) with the function; note that institutional
governance correlated negatively and moderately with Function 2, but
positively with Function 1, suggesting a somewhat ambiguous relationship
between this variable and predictions of Process use.
Figure 1 shows the individual cases and group centroids (average values
for each group) displayed in two dimensions: (a) from left to right, Function
1 clearly distinguishes between all three groups, and much more so between
TABLE 1 Structure Matrix
Function
1 2
ActCondA .883 *
–.333
ColA .697* .361
ComA .626* .072
LeadA .592* –.040
InfraA .499* –.393
ProgSupA .350* .137
InstSupA .299* .152
CulEthoA .527 .603*
InstGovA .369 -.402*
*. Largest absolute correlation between each variable and any discriminant function
TABLE 2 Classification ResultsA
Processuse Predicted Group Membership Total

1 2 3
Original CCount 1 15 5 0 20
2 6 15 7 28
3 1 5 36 42
% 1 75.0 25.0 .0 100.0
2 21.4 53.6 25.0 100.0
3 2.4 11.9 85.7 100.0
a. 73.3% of original grouped cases correctly classified.
FIGURE 1 Predicting process use: Cases and group centroids for two discriminant
functions
group 3 (high Process use) and the other two; (b) from top to bottom,
Function 2 additionally distinguishes between Groups 1 and 2, but less so
between Group 3 and the other two groups.
Finally,Table 2 shows the classification results for the discriminant analy-
sis. Overall, the combined Functions 1 and 2 were able to classify 73% of
the cases correctly into the three levels of Process use. Classification accu-
racy was much higher for Group 3 (the highest Process use level), with 86%
of cases predicted correctly, and substantially lower for Group 2, with only
54% of cases predicted correctly.
Note that the emphasis in reporting Discrim findings is on two types of effect
sizes: First, the strength of relationship between individual predictor variables and
each function is represented by correlations. Second, the overall quality of the
model is represented by the percentages of correctly classified cases; correlations
and percentages are both easily interpreted types of effect size. In this sense, Dis-
crim may automatically help researchers to move beyond relatively meaningless
yes/no statistical significance testing that characterizes typical interpretations of
MANOVA by offering a type of follow-up procedure that centers on magnitude
of relationships and patterns in the data.
324 John M. Norris
Following the reporting of basic results, of course, the discussion section of

a Discrim study would go on to tease apart which combination of predictor
measures seemed to be accounting for classification differences between which
levels of the grouping variable (and why that might be the case). Further Discrim
analyses might also be considered, or stepwise or sequential analyses, as a means
for getting closer to the determination of importance of combinations of predic-
tor variables in classifying cases.
SAMPLE STUDY
McNamara, D., Crossley, S., & McCarthy, P. (2010). Linguistic features of writing
quality. Written Communication, 27, 57–86.
Background
McNamara, Crossley, and McCarthy (2010) set out to determine whether
automated measures of cohesion and coherence, as well as syntactic com-
plexity, diversity of words, and other characteristics of words, all provided by
the Coh-Metrix tool, would be predictive of L2 English essays rated to be of
high versus low proficiency on a standardized rubric.
Methods
General: N = 120 English L2 argumentative essays written by college
students
Grouping variable: Essays rated on a six-point holistic rating scale for writing
proficiency; grouped into high (ratings of 4, 5, 6) or low (ratings of 1, 2, 3)
proficiency levels.
Predictor variables: Significance tests were used to determine which indi-
vidual variables, among an initial pool of 53 indices extracted automatically
by Coh-Metrix, showed differences between the two groups. Interestingly,
none of the measures of coherence/cohesion predicted group differences;
the single measure with the largest effect size from each of predictor was
then selected for Discrim.
Discriminant Analysis: Direct Discrim was run on a “training” set of n = 80
cases, and subsequently cross-validated on a “test” set of the remaining
n = 40 cases.
Results
On the training test, Discrim classified essays with 65% accuracy (68% for
low-rated essays, 62% for high-rated essays). On the test set, Discrim classi-
fied essays with 70% accuracy (73% for low-rated essays, 67% for high-rated
essays). Subsequent regression analyses indicated that the word frequency

and syntactic complexity measures were the strongest and significant pre-
dictors of the actual score on the six-point holistic rating scale.
Tools and Resources

There are numerous published treatments of Discrim that provide extensive
detail and debate regarding related math, statistics, interpretation, and application;
these are easily accessed through a simple Internet search. Most statistical soft-
ware packages also provide options for conducting Discrim, including SAS, SPSS,
STATISTICA, and SYSTAT, among others. For researchers interested in utiliz-
ing R for running Discrim, there are many downloadable packages available, and
the following page on the Quick R website will be of considerable help: http://
www.statmethods.net/advstats/discriminant.html.
There are also quite a few web-based introductions, overviews, and hands-on
tutorials addressing Discrim, many of them including data sets for trying out the
analysis; following are a few of the more interpretable treatments.
StatSoft: http://www.statsoft.com/Textbook/Discriminant-Function-Analysis
David Stockburger: http://www.psychstat.missouristate.edu/multibook/mlt03m.html
Kardi Teknomo: http://people.revoledu.com/kardi/tutorial/LDA/
UCLA Institute for Digital Research and Education: http://www.ats.ucla.edu/stat/spss/
output/SPSS_discrim.htm
Further Reading
Examples of L2 Studies Employing Discriminant Analysis
Brown, Robson, & Rosenkjar (2001): Inquired into the relationship between motivation,
personality, anxiety, and learning strategies with L2 proficiency of Japanese learners
of English. Direct Discrim was used to predict learner membership in high, medium,
and low proficiency groups (as determined by a cloze test), and findings indicated that
one set of variables distinguished between low proficiency and the other two groups,
while a second set of variables distinguished between middle and high proficiency
groups, though less accurately.
Collentine (2004): Investigated gain scores on Spanish L2 oral interviews by students in
two distinct learning contexts, study abroad versus regular instruction at home, over
the period of one semester of instruction. Direct and stepwise Discrim were utilized
to indicate which grammatical and lexical features in the oral production best classify
learners in the two learning contexts.
Crossley & McNamara (2009): Compared L1 and L2 written texts on the basis of
numerous measures from the Coh-Metrix computational tool (including, e.g.,
measures of cohesion, text difficulty, and lexical frequency). Multiple Discrim analy-
ses identified the optimal number of variables for achieving maximum distinction
between the two text types; note the use of cross-validation with half of the texts.
326 John M. Norris
Nassaji (2003): Investigated the role of syntactic, semantic, word recognition, and grapho-
phonic processes in determining reading comprehension by adult ESL learners.
Direct Discrim resulted in high levels of classification accuracy between low-skilled
and high-skilled reading groups, most effectively due to lexical-semantic processes but
also attributable to other measured skills. Note the use of cross-validation in this study,
based on two halves of the learner data.
Zheng, Cheng, & Klinger (2007): Examined possible test-method effects of three dif-
ferent item formats (multiple choice, constructed response, and constructed response
with explanations) on the reading comprehension scores of ESL versus non-ESL
examinees. Direct Discrim suggested that while the multiple choice format was able
to distinguish between the two groups significantly, it did so only to a small degree
and it correlated highly with scores on the other two formats, indicating that item
format did not have a substantial effect on overall comprehension score differences
between the two groups.
Recommended Texts for Learning More about

Discriminant Analysis
Huberty & Olejnik (2006):This book provides the most comprehensive overview and
detailed treatment of Discrim available. It covers everything from the history of the tech-
nique to its application in various domains of research to careful discussion and demon-
stration of all aspects of the analysis and considerations along the way. It also addresses
challenges such as the analysis of nonnormal and missing data, the influence of outliers,
and the difference between statistical and practical/clinical prediction or classification.
Tabachnick & Fidell (2013): Chapter 9 of this essential text on multivariate statistics pro-
vides an in-depth introduction to Discrim, including introduction, notes on statistical
assumptions, a step-by-step guide to setting up and running Discrim, and data-based
examples using statistical software packages. Discrim is also characterized in relation
to other multivariate analyses in terms of the types of research questions and data sets
best investigated with this technique.
1. First, try to replicate the analyses above with the DISCRIM data set provided
along with this chapter on the book’s companion website (http://oak.ucc.nau.
edu/ldp3/AQMSLR.html).Was your analysis successful? Did you find the same
patterns in the data? Next, access a different data set that has been analyzed
already with MANOVA, and conduct a Discrim analysis using the measured
variables to predict membership in one grouping variable. Compare the two sets
of output from the two multivariate analyses: What research questions can you
answer on the basis of Discrim that could not be answered with MANOVA?
2. Using the same data, conduct several additional Discrim analyses, each one
investigating a different combination of measures (e.g., remove the most pre-
dictive single measure from the first analysis and run Discrim again). In what
ways do the findings differ from one analysis to the next? How do you
explain differences observed in the accuracy of classification and the loading

of individual measures on the discriminant functions?
3. In other social science disciplines, such as political science or marketing, Dis-
crim is utilized proactively to predict important types of grouping behavior
in society. For example, based on findings from a Discrim analysis, researchers
might identify a handful of easily measurable variables (say, through a phone
or Internet survey) that can reliably predict membership in political parties,
voting behavior, or effectiveness of advertizing campaigns with particular
demographic sectors. Can you think of any issues in applied linguistics that
would benefit from this type of “forecasting” research that Discrim makes
possible? What types of group membership might we be interested in pre-
dicting with high accuracy, and what would we do with that information?
4. After you have studied additional multivariate analyses, such as CA, factor anal-
ysis, and multiple/logistic regression, consider each in comparison with Dis-
crim.What are the strengths and weaknesses of each approach? What does each
technique accomplish that is not accomplished by the others? How do “ideal”
data sets differ between these techniques? What research questions would you
try to answer with Discrim that you might not try to answer with the others?
Acknowledgments
I would like to thank two anonymous reviewers and Luke Plonsky for construc-
tive feedback on this chapter. I am also indebted to John Davis for making a
portion of his Ph.D. dissertation data available for the purpose of demonstrating
Discrim here. Lastly, I thank my Advanced Statistics students at Georgetown Uni-
versity for their questions and insights related to multivariate statistics, and J. D.
Brown for introducing me to Discrim and other analytic techniques.
References
Biber, D. (2003).Variation among university spoken and written registers: A new multidi-
mensional analysis. In C. Meyer & P. Leistyna (Eds.), Corpus analysis: Language structure
and language use (pp. 47–70). Amsterdam: Rodopi.
Brown, J., Robson, G., & Rosenkjar, P. (2001). Personality, motivation, anxiety, strategies,
and language proficiency of Japanese students. In Z. Dörnyei & R. Schmidt (Eds.),
Motivation and second language acquisition (pp. 361–398). Honolulu: University of Hawai‘i,
Second Language Teaching and Curriculum Center.
Collentine, J. (2004). The effects of learning contexts on morphosyntactic and lexical
development. Studies in Second Language Acquisition, 26, 227–248.
Collentine, J., & Collentine, K. (2013). A corpus approach to studying structural conver-
gence in task-based Spanish L2 interactions. In K. McDonough and A. Mackey (Eds.),
Second language interaction in diverse educational contexts (pp. 167–188). Amsterdam: John
Benjamins.
328 John M. Norris
Crossley, S., & McNamara, D. (2009). Computational assessment of lexical differences in L1

and L2 writing. Journal of Second Language Writing, 18, 119–135.
Davis, J. (2012). The usefulness of accreditation-mandated outcomes assessment in college foreign
language education. Unpublished doctoral dissertation, University of Hawai’i at Manoa.
Ferris, D. (1994). Lexical and syntactic features of ESL writing by students at different levels
of L2 proficiency. TESOL Quarterly, 28, 414–420.
Fulcher, G. (1996). Does thick description lead to smart tests? A data-based approach to
rating scale construction. Language Testing, 13, 208–238.
Ganschow, L., & Sparks, R. (1996). Anxiety about foreign language learning among high
school women. Modern Language Journal, 80, 199–212.
Gries, S. (2003). Multifactorial analysis in corpus linguistics: A study of particle placement. London:
Continuum.
Gutiérrez-Clellen,V., & Simon-Cereijido, G. (2006).The discriminant accuracy of a gram-
matical measure with Latino English-speaking children. Journal of Speech, Language, and
Hearing Research, 50, 968–981.
Homburg, T. (1984). Holistic evaluation of ESL compositions: Can it be validated objec-
tively? TESOL Quarterly, 18, 87–107.
Huberty, C., & Olejnik, S. (2006). Applied MANOVA and discriminant analysis (2nd ed.).
New York: Wiley.
Jones, F. (1998). Self-instruction and success: A learner profile study. Applied Linguistics, 19,
378–406.
Koda, K. (1989). The effects of transferred vocabulary knowledge on the development of
L2 reading proficiency. TESOL Quarterly, 22, 529–540.
Martín Martín, P. (2003). A genre analysis of English and Spanish research paper abstracts
in experimental social sciences. English for Specific Purposes, 22, 25–43.
McNamara, D., Crossley, S., & McCarthy, P. (2010). Linguistic features of writing quality.
Written Communication, 27, 57–86.
Nassaji, H. (2003). Higher-level and lower-level text processing skills in advanced ESL
reading comprehension. Modern Language Journal, 87, 261–276.
Norris, J. (1996). A validation study of the ACTFL Guidelines and the German Speaking Test.
Unpublished MA thesis, University of Hawai‘i.
Oh, S. (2006). Investigating the relationship between fluency measures and second language writing
placement test decisions. Unpublished MA scholarly paper, University of Hawai‘i.
Okamura-Bichard, F. (1985). Mother tongue maintenance and second language learning:
A case of Japanese children. Language Learning, 35, 63–89.
Perkins, K. (1983). On the use of composition scoring techniques, objective measures, and
objective tests to evaluate ESL writing ability. TESOL Quarterly, 17, 651–671.
Robinson, P., & Ross, S. (1996). The development of task-based assessment in English for
Academic Purposes programs. Applied Linguistics, 17, 455–476.
Rost, M., & Ross, S. (1991). Learner use of strategies in interaction: Typology and teach-
ability. Language Learning, 41, 235–273.
Tabachnick, B., & Fidell, L. (2013). Using multivariate statistics (6th ed.). New York: Pearson.
Zheng, Y., Cheng, L., & Klinger, D. (2007). Do test formats in reading comprehension
affect second language students test performance differently? TESL Canada Journal,
25, 65–80.
14
BAYESIAN INFORMATIVE
HYPOTHESIS TESTING
Beth Mackey and Steven J. Ross
Hypothesis Testing
In classical hypothesis testing, the null hypothesis that all means are equal is tested
against an alternative that specifies the means are not equal. Using the one-way
ANOVA, researchers first seek to reject the null hypothesis using a preset level of
statistical significance. In many applied research questions, the researcher enter-
tains a hypothesis of mean inequalities, often with a specific hypothesis based on
a one-tailed probability distribution. That is, the researcher hypothesizes not only
that the means differ, but also that they will differ in favor of one of the groups.
This approach is common in the classical experimental group versus control group
contrast, where the null hypothesis predicated on random variation is the bench-
mark against which significant or nonrandom differences are inferred. An alterna-
tive to the null hypothesis–based analysis of variance approach is one grounded
on an informed or theory-driven hypothesis about the ordering of mean scores.
Recent trends in Bayesian data analysis described by Ntzouflas (2009), Kruschke
(2011), and Lunn, Jackson, Best, Thomas, and Speigelhalter (2012), for example,
afford alternatives to null hypothesis testing and are optimal when researchers are
testing hypotheses predicated on grounded theoretical arguments, and in the pres-
ent illustration, for testing framework-driven operationalizations of proficiency.
The conceptual difference between null hypothesis testing and the Bayesian
alternative is that predictions about mean differences are stated a priori in a hier-
archy of differences as motivated by theory-driven claims. Prior research thereby
informs the hypotheses and in Bayesian terms we test an informative hypoth-
esis. In this approach, the null hypothesis is typically superfluous, as the research-
ers aim to confirm that the predicted order of mean differences are instantiated
in the data. Support for the hierarchically ordered means hypothesis is evident
330 Beth Mackey and Steven J. Ross
only if the predicted order of mean differences is observed. The predicted and
plausible alternative hypotheses thus must be expressed in advance of the data
analysis—thus making the subsequent ANOVA confirmatory. In addition to the
advantage of avoiding superfluous null hypothesis testing, the Bayesian approach
avoids the pitfalls of post hoc comparisons of means and the awkward specifi-
cations of planned comparisons across means. The Bayesian approach outlined
in this chapter provides a straightforward system for ordering hypotheses about
mean differences prior to data analysis.
Bayesian Model Selection—An Illustration

Confirmatory Bayesian model selection (BMS) selects the best-fitting hypothesis
from a set of hypotheses. Confirmatory methods are considered sufficiently pow-
erful because they involve specific hypotheses ordered according to their predicted
match to new data (Kuiper and Hoijtink, 2010). Confirmatory analyses are thus
ideal for hypotheses testing claims that have a strong evidential basis built on prior
research and theory. Used in conjunction with research syntheses or meta-analytic
results, they provide a robust method of testing deductive-nomological claims.
To illustrate how these methods can be used with a real set of data to examine
the validity of tests of listening and reading, this study examines a multiple-choice
format reading proficiency test: two listening tests and two reading tests.The pres-
ent study, which we use to illustrate the value of a Bayesian approach to second
language (L2) data analysis, grew out of a need to explore evidence supporting
score interpretation on measures of foreign language reading and listening com-
prehension. A foreign language test of reading proficiency is the focus of this
example. Test specifications call for items to be designed at each of five levels on
a rating scale: Level 1, Level 2, Level 3, Level 4, and Level 5. The specifications are
designed to ensure that test item writers of any given language first identify and
classify test passage materials before crafting test items matched to the identified
difficulty level. The equivalence of proficiency levels across languages is predi-
cated on the content validity of the a priori passage and item classification sys-
tem. The research question investigated in this study examines whether intended
proficiency levels as measured in an operational setting concur with the observed
average difficulty of the items crafted to measure the proficiency levels.
The tests were developed to reflect a six-level hierarchy describing reading pro-
ficiency, reporting levels on a scale from Level 0 (random guessing) to Level 5
(advanced proficiency). The tests were developed in light of a proficiency frame-
work and a set of specifications to adequately measure the vertical (levels of ability)
and the horizontal (content) expectations described in the test development frame-
work. Following extensive training, test developers select texts and write items that
measure the intended reading or listening constructs at five levels. Level 0 is assigned
by default as the random guessing level and items are not specifically written to
Bayesian Informative Hypothesis Testing 331
this level. How well test developers can consistently operationalize the proficiency
framework is fundamental to the construct validity of any framework-based test. In
a five-level framework, the expected hierarchy stipulates that the ordering of means
for any given language test would be: μ1 < μ2 < μ3 < μ4 < μ5. This hierarchy
specifically predicts that the mean of the lowest level items will be distinctly lower
than those of the next higher level, such that the mean difficulties of the items will
separate into five minimally overlapping distributions along a continuum of dif-
ficulty. As noted earlier, hierarchical prediction of the order of item difficulty means
explicitly tests the validity of the framework for item construction.
To date, a number of frameworks have been proposed in applied linguistics
to describe how assessments can be designed to measure gradations of language
proficiency, and to provide criteria to interpret validity claims. Currently used
frameworks include the Common European Framework of Reference (http://
www.coe.int/t/dg4/linguistic/cadre1_en.asp), the American Council of Teach-
ers of Foreign Languages Proficiency Guidelines (http://www.actfl.org/pub-
lications/guidelines-and-manuals/actfl-proficiency-guidelines-2012), and the
Interagency Language Roundtable Skill Level Descriptions (http://govtilr.org/
Skills/ILRscale1.htm). Most language assessment frameworks are predicated on
functional descriptions of how language is used in a range of contexts represent-
ing varied social and employment-related domains. Test developers strive to sam-
ple specimens of language from those contexts, and to construct items and tasks
that reflect comprehension of propositional content appearing within them, and
in the case of spoken or written language assessments, learners’ ability to speak or
write coherently with fluency and accuracy on tasks representing the functional
domains of interest. A fundamental assumption is that the specimens of language
used for test construction can be accurately arrayed along a continuum of diffi-
culty, and that items written to assess comprehension are matched to the passages
and texts along that ordered continuum of difficulty. Crucial for a validity argu-
ment for frameworks based on subjective classification of language specimens is
the accuracy of the classification system itself.
Test developers select texts and passages and write items at each of the levels
covered by the test. Using this framework, test developers operationalize the scale
by producing test passages targeted at each level. The present study used a sample
drawn from a reading test with a sample size of 1,889.
The set of items on the test was initially subjected to a Rasch analysis in order
to estimate the difficulty of each item (see Knoch & McNamara, Chapter 12 in
this volume). As each item has been preclassified by test designers according to
an intended level of difficulty, and extensively checked by moderation panels, the
expectation is that the hierarchy of item difficulty will be corroborated with an
empirical confirmation of the actual item difficulties.
The Rasch item analysis performed on the test generates the observed dif-
ficulty of each item on a logit scale. All of the items on each test are placed on a
.
. T 10023
.
. 10028 10039 10050
. Level 5 Items
. T 10061
.## 10024 10046
.#### 10021
1 .## ** 10018 10026 10033 10055
.## S 10067 Level 4 Items
.####### S 10040
.#### 10001
.#### 10056
.########## 10043
.###### 10007 10011 10020 Level 3 Items
.############ 10012 10044 10049
.###### #M 10025 10029 10031 10052 10054
.############ 10048 10065
0 .###### **M 10015 10051 10062
.########### 10009 10057
.#### 10063 10068 Level 2 Items
.######### 10045 10047 10059 10060
.### S 10008 10034 10035 10032 10066
.####### 10002 10006 10018 10032
.## 10004 10005 10010 10016 10022 10070
.### 10003 10038
Level 1 Items
.# 10013 10027
.# S 10058
-1 . T ** 10014 10030 10064
.#
. 10041
. Level 0 Items
. 10069
. 10042
. T
FIGURE 14.1 Schematic person-item map with cut scores
relative continuum of facility from easy to difficult. For each group of test takers,
the proficiency of individuals is arrayed in a person-item map showing the dif-
ficulty of test items relative to the ability of test takers. Figure 14.1 shows a sche-
matic array of items from another test form (right) with persons (hashed marks
on left) on the same scale of reference.
Although different methods of setting standards exist, a widely used method
for making proficiency level decisions was based on an analysis of the item pool,
with the cut score set at each level where the probability of a correct response
for a given person ability estimate approximated a raw score of 70% correct
of the items preidentified as indicators of proficiency at each level. Figure 14.1
illustrates this item-based approach. Test takers able to correctly answer 70% of
the within-category items, as well as 90% of easier items, were deemed to be
proficient at the threshold cut score level. Accordingly, a test taker with an ability
to answer 70% of the Level 3 items in Figure 14.1 would be deemed proficient
at Level 3, but not at Level 4. As noted earlier, the validity of the 70% cut score
decision point is strongly predicated on the homogeneity of items written to the
intended level of difficulty. As the cut score decision point in principle applies to
all languages tested, it presents both a convenient and homogeneous method for
defining common benchmarks for proficiency as well as a formidable validation
challenge. The analyses to follow test a key assumption of the item-based method
of setting standards empirically.
Bayesian Hypothesis Testing

We examine the linkage between outcomes on each test using a Bayesian infor-
mative hypothesis testing approach. We adopt the informed hypothesis testing
approach developed by Hoijtink, Klugkist, and Boelen (2008) and Hoijtink,
(2012). As the language assessment framework outlined earlier provides an impli-
cationally ordered theory of language difficulty, it presents an opportunity for us
to formulate a hypothesis specifically informed by that theory, as well as plausible
alternatives.The functional classification of passages and texts, as well as the capac-
ity of item writers to apply standard item specification rubrics to any language is
assumed to be generalizable—allowing for a test development system applicable
to any foreign language. This assumption forms the basis of the strong form of
expected order hypothesis, H1.
In the present case, we are interested in determining whether the mean logit
difficulty at each level on each of the sampled tests increases symmetrically from
Level 1 through Level 5:
H1: μ1 < μ2 < μ3 < μ4 < μ5
The difficulty hierarchy predicts that the average logit difficulty of Level 1 items
(μ1) will be systematically less than the mean of items written to measure pro-
ficiency at Level 2 (μ2). The hierarchy expressed in H1 is thus predicted to hold
across languages, as the language assessment framework examined here is written
to be independent of which particular language is described by the scale. The
prediction is that the mean difficulties of test items for the language sampled will
concur with the hierarchy of difficulty. In this sense the intended analyses are
confirmatory: Either H1 is corroborated by the observed data or it is not. Support
for the measurement framework in general will be evidenced through consistent
confirmation of the hierarchy of difficulty.
For a confirmatory model to function, plausible alternatives to the theory-
driven order of hypotheses need to be articulated explicitly. A plausible alter-
native hypothesis, H2, would predict that adjacent categories collapse into the
lower-level categories, suggesting that there is no systematic difference between
particular levels. For instance, foreign language passages chosen to represent Level
2 reading proficiency share characteristics with Level 1 passages, but are selected
such that they contain some extra linguistic complexity to make comprehension
slightly more challenging. Similarly, Level 2 items are constructed to entail slightly
more complexity than Level 1 items. If Level 2 items are not in fact any more dif-
ficult than Level 1 items, hypothesis H2 would be supported by the empirical fact
that the logits of difficulty of Levels 1 and 2 will conflate into an indistinguishable
range. Test developers and item writers are well-acquainted with the difficulty of
fine-tuning the linguistic content of items differentiating Levels 1 versus 2 and
Levels 3 versus 4.Thus, on a five-level scale, not only might Level 2 items collapse
down to Level 1, but Level 4 may also conflate into Level 3:
H2: μ1 = μ2 < μ3 = μ4 < μ5
This outcome is deemed the second most likely outcome given the item devel-
opment and writing process. Even after considerable moderation by item review-
ers, items at Levels 2 and 4 may be indistinguishable from the next-lower category
of difficulty.
A second plausible alternative hypothesis, H3, predicts that in a five-level hier-
archy adjacent levels collapse into ranges of proficiency at the threshold of the
next higher level. Correspondingly, items written for a Level 2 reading passage
require proficiency nearer to the next higher base level (Level 3) than the level
below it, and Level 4 items are indistinguishable empirically from Level 5 items.
This possibility is less likely than H2, as items at the extremes of the hierarchy are
expected to be relatively easy to write and moderate.
H3: μ1 < μ2 = μ3 < μ4 = μ5
A conventional analysis of variance approach comparing the means to the

null hypothesis can confirm that the means are not equal, but it cannot without
elaborate post hoc tests that tell us how the means differ, nor will it allow us to see
if there are alternative models that fit the data better than the hypothesized model
based upon the test development and analysis framework.
The Bayesian software used in this study, Comparison of Means, offers six
methods for testing and comparing means, following Kuiper and Hoijtink (2010).
There are two overall categories for exploration and confirmation, with the
methods in each involving hypothesis testing, model selection, and BMS. These
methods allow us to explore each language test and build an understanding of
how hypotheses providing corroborating evidence for a test development frame-
work are supported by the data. For the purposes of this exposition, we will use
only a confirmatory approach using BMS to investigate the relationship between
the intended levels and the actual outcomes.
Procedures for a Bayesian Comparison of Means

This section of the chapter details the application of the Comparison of Means
software package to the analysis of a data from a test validation project described
earlier.
Step 1. Download the software tool Comparison of Means (available from
Rebecca Kuiper’s website, http://www.uu.nl/staff/rmkuiper). Launch the. exe file.
Step 2. Frame the research question in the context of a comparison of groups
and format the data for the hypothesis testing. Here we use a confirmatory
TABLE 14.1 Grouping labels for analysis
Grouping Intended Level
1 Level 1
2 Level 2
3 Level 3
4 Level 4
5 Level 5
ANOVA approach to investigate whether the intended proficiency levels as mea-

sured in an operational setting concur with the observed average difficulty of
the items crafted to measure those proficiency levels. The Comparison of Means
application requires the data to be in two columns, with a grouping column first,
followed by the label for the variable of interest (see Table 14.1). For the analyses
in this section, the groupings are based upon the items at each of the intended
levels as follows.
The specific number of items at each level on the multiple-choice test varies:
There are five or six items at Levels 1 and 2, between 14 and 16 items at Level 3,
between 12 and 14 items at Level 4 and 12 items at Level 5. Items from parallel
operational forms were included in a pooling of items at each level. It is notewor-
thy that the levels noted in Figure 14.2 are those intended by the test developer
rather than being empirically validated. The small number of items at Level 1 and
Level 2 in the multiple choice format has implications for the analysis and will be
discussed later in this chapter.
For items at each level, mean item difficulty Rasch Item Response Theory
estimates were calculated from the sample.
Step 3.There are three ways to enter data into Comparison of Means: importing
a text file, copying and pasting from an existing file, or entering the data directly into
the tool. Here we created a text file that will be imported into the tool. For this step,
open the data already prepared for analysis that is provided in this book’s companion
website (http://oak.ucc.nau.edu/ldp3/AQMSLR.html). We see that, as explained
earlier, the data must be formatted in two tab-delimited or comma-separated col-
umns (see Figure 14.2 for an excerpt from the data set provided). The first column
contains the grouping variable and the second column contains the data. In the pres-
ent case, the grouping variables are 1 through 5, each number representing items at
an intended difficulty level.The data column contains the item difficulty estimates.
Step 4. Formulate the hypotheses of interest. In this study we test the
difference among the means of items written to five levels of difficulty. Five
hypotheses were specified: a default null hypothesis was included to run the
Bayesian analysis of variance (automatically identified as H0). The null in this
case would imply that the means across the five levels are completely indis-
tinguishable from each other. As noted earlier, this “no difference” hypothesis
is considered to be the least plausible given the test design and item writing
FIGURE 14.2 Grouping identifiers and item difficulty estimates
framework outlined. While null hypothesis testing is not a viable benchmark in

most Bayesian analyses, it is included in Comparison of Means as an optional
starting point. As recommended by Kuiper and Hoijtink (2010), a second
“agnostic” hypothesis is included in the analysis. The means are separated by
commas to designate that no ordering or equality is specified between the
means. The agnostic alternative hypothesis, was included (and automatically
assigned Ha) to protect against the selection of a weak ordered hypothesis. Ha
tests if the mean orders differ from each other in unpredicted ways. No order-
ing is implied, but evidence of any differences other than those specified by the
researchers are still estimated.
The theory-motivated predicted hypothesis, based on the test development

framework (H1), was outlined earlier: Mean item logit difficulties increase sys-
tematically in the order consonant with the levels of the proficiency hierarchy.
Two alternative hypotheses were proposed as described earlier, based on the pos-
sibilities that the adjacent levels are not distinguishable in difficulty from either
the lower level—i.e., they “collapse down” (H2 ), or the adjacent levels will “fold
up” into the next higher level (H3 ).The hypotheses are summarized in Table 14.2.
TABLE 14.2 Hypotheses tested in confirmatory technique
Null (H0 ) μ1 = μ2 = μ3 = μ4 = μ5
Agnostic (Ha ) μ1, μ2, μ3, μ4, μ5
Hypothesis 1 (H1 predicted hypothesis) μ1 < μ2 < μ3 < μ4 < μ5
Hypothesis 2 (H2 “collapse down”) μ1 = μ2 < μ3 = μ4 < μ5
Hypothesis 3 (H3 “fold up”) μ1 < μ2 = μ3 < μ4 = μ5
FIGURE 14.3 Comparison of Means data input

Step 5. Open the Comparison of Means software and click on the Data button
to import the text file (see Figure 14.3). A Data Input screen will open. Click
on Browse Data File and navigate to the text file to import your data set into the
Comparison of Means software. Once the data are imported, the program will
automatically validate that your data set is in the expected format. Click OK.
The upper left corner of the Data Input screen will be updated to reflect the
number of groups and the number of observations (items) in each group, as seen
in Figure 14.4.
Step 6. Comparison of Means offers six methods for testing and comparing
means following Kuiper and Hoijtink (2010). There are two overall categories
for exploration and confirmation, with the methods in each involving hypoth-
esis testing, model selection, and BMS. These methods allow us to examine each
language test and build an understanding of how a framework-driven hierarchy
of item difficulty is supported by the data. For the present case, we will use only
a confirmatory approach using BMS to investigate the relationship between the
intended levels and the actual outcomes.
The criterion for the confirmatory approach will be a posterior model prob-
ability (PMP) that favors H1, that of the ordered mean difficulty in Table 14.3.The
data are expected to corroborate the H1 as the most probable outcome.
FIGURE 14.4 Group observations in Comparison of Means

TABLE 14.3 Comparison of Means software (exploratory and confirmatory tests) (Kuiper
and Hoijtink, 2010)
Method/Technique Exploration Confirmation
Hypothesis testing Equal n: Shaffer-Welch Fq test The Fbar test

(SWFq)
Unequal n: Tukey-Kramer test (TK)
Model selection Paired-comparison information Order-restricted information
criterion (AIC, BIC, and RIC) criterion
Can these abbreviations be
spelled out in a caption or a
footnote? They have not yet been
defined.
BMS Posterior model probabilities Posterior model probabilities
FIGURE 14.5 Confirmatory model specification
Step 7: Select confirmatory BMS. A few default settings can be left as

they are: the “prior vagueness” at the default value (PV = 2) and the equality
(delta) set at 1 for an “about equality” for the agnostic unordered hypothesis (see
Figure 14.5). These help the initial estimation process to commence smoothly.
Step 8. Add the null hypothesis under “Specify models for confirmatory
methods.” As noted earlier, most Bayesian analyses do not entertain random out-
comes as a viable alternative to a specific hierarchy of means. The null hypoth-
esis is included here for didactic purposes. The first order-restricted hypothesis is
included with the use of the Add button. A field will enter the large text box on
the right. Our first hypothesis of interest is the preferred hypothesis, where mean
item difficulties are predicted to increase as the level increases. To represent these
relationships in the tool, we need to enter “1 < 2 < 3 < 4 < 5”. To enter the first
hypothesis of interest, type “1” in the first field and use the pull-down menu to
select <; enter “2” into the next field and continue entering the restrictions until
all are captured. See Figure 14.6 for an illustration of this process.
Once the first hypothesis is entered, clicking on the Add button allows each
subsequent hypothesis of interest to be entered in sequence. Each order-restricted
hypothesis can then be viewed by clicking on the H1, H2, and H3 tabs (see Fig-
ure 14.7). We are now ready to run the analysis.
FIGURE 14.6 Entry of hypothesized mean hierarchies

FIGURE 14.7 Summary of hierarchy of hypotheses
Step 9. Once all hypotheses have been specified and ordered, Comparison of
Means is ready for execution with a press of the Run button (see Figure 14.8).
The Bayesian estimation approach utilizes a Markov-Chain Monte-Carlo
(MCMC) estimation of posterior probabilities based on the observed data. The
MCMC approach starts with a random starting point of each variable chosen
from the sample distribution and estimates the next value of the variable based
on the value of its immediate predecessor until eventually all subsequent values
of the variable form a marginal distribution of that variable, independent of the
original starting point (Lunn, Jackson, Best, Thomas, & Speigelhalter, 2012). In
the present analysis, the Markov Chain is instantiated through the Gibbs sampler,
which iteratively resamples from the conditional distribution of values derived
from prior estimations until, after many thousands of iterations, it arrives at a final
posterior distribution. The preferred hypotheses in the present study predict that
the data means will follow the ordering of mean item difficulties as specified in
H1. To the extent that any of the alternative models better fit the observed test
FIGURE 14.8. Execution of Comparison of Means
data, support for the preferred H1 is diminished in favor of an alternative most

consistent with the observed data.
At the end of the execution of the analysis, the Progress Report window pro-
vides an updated message that reads: “The methods are performed.The results are
reported in Output.txt.” Clicking OK opens the text file with a summary of the
Bayesian confirmatory analysis.
As there is no familiar significance against the null hypothesis test used in the
Bayesian approach, and since the model testing is meant in this analysis to be con-
firmatory, evidence to support the alternative hypotheses H0~H3 are presented in
the form of Bayes factors. The Bayes factor metric is the posterior odds favoring
a hypothesis divided by the prior odds favoring the hypothesis. Each hypothesis
considered in this analysis thus has its own Bayes factor. Hypotheses with the larg-
est relative Bayes factor are considered most supported by the data. Here we see
that the preferred hypothesis, H3, is most supported by the data. Specifically, it is
FIGURE 14.9 Comparison of Means Bayesian analysis output
5.25 times more likely (42.8/8.14) than the “collapse down” alternative hypoth-
esis, H4. The PMP (see Figure 14.9) estimates H1 to be unambiguously the most
probable with a PMP of .82, given the data. The posterior model probability for
each hypothesis is the Bayes factor for that particular hypothesis divided by the
sum of the Bayes factors for all of the hypotheses tested. The Bayes factor and the
PMP estimates suggest that the ordering of mean difficulties predicted by the
item design specification and construction system is corroborated by the empiri-
cal data based on large samples of test takers.
We see that the ordering of mean difficulties predicted by the item design
specification and construction system is corroborated by the empirical data based
on large samples of test takers. The Bayesian confirmatory approach provides a
useful diagnostic for identifying languages and modalities in the test development
framework that might be in need of further development and moderation.
Discussion
Confirmatory analyses were conducted to both illustrate the potential for an
ANOVA software program, Comparison of Means (Kuiper and Hoijtink, 2010),
and to consider the strengths and weaknesses of an item-based approach to defin-

ing proficiency levels in a testing program. Here we see that the predicted hier-
archy of mean item difficulties matches the predictions from Level 1 to Level 5.
Were we to find that the data support a different hypothesis, it would suggest that
the test development process is imperfect, possibly stemming from the test devel-
opers’ inability to accurately predict item difficulty at adjacent levels.
An implication of the confirmatory analysis for an item-based system of
defining proficiency may well be that alternative methods of standard setting
are justified. An alternative to the system outlined here would be a modified
Angoff standard setting approach uninformed by item writers’ intention to write
and tweak items to particular levels of difficulty. Such a system would require
standard-setting panels to decide, item by item, whether test takers would be
likely to succeed at each level of the proficiency hierarchy. Whether subjective
item classification by item designers, moderation committees, or standard-setting
panels will ultimately be corroborated by Bayesian confirmatory analyses is an
issue that should be examined empirically in future research.
Limitations and Conclusion

Bayesian confirmatory analysis of variance provides a robust alternative to con-
ventional ANOVA. The BMS tool used here is a new analysis technique in lan-
guage testing. As framework-based test development gains momentum around
the world, there is a pressing need to examine the evidence that tests can be
written to frameworks such as the Common European Framework of Refer-
ence (CEFR) and those developed by the Interagency Language Roundtable
(ILR) and American Council on the Teaching of Foreign Languages (ACTFL).
The Bayesian approach may offer a cost-effective intermediate step between test
development and expensive and subjective standard setting methods. The confir-
matory approach outlined in this chapter is likely to provide a useful analytic tool
for testing framework-generated hypotheses about language proficiency, as well as
a viable alternative to null hypothesis testing of group mean differences in general.
Tools and Resources

Additional explanatory notes and sample data are available with the Comparison of
Means tool: http://www.uu.nl/staff/rmkuiper.
Further Reading
Hoijtink, H., Klugkist, I., and Boelen, P. (2008). Bayesian evaluation of informative hypotheses.
New York: Springer.
Kruschke, J. (2011) Doing Bayesian data analysis: A tutorial with R and BUGS. New York:
Academic Press.
1. Discuss when a null hypothesis test for an ANOVA design is superfluous and
when a Bayesian approach might be more appropriate.
2. Discuss what makes a hypothesis “informative.”
3. Identify areas of applied linguistics that are amenable to ordered hypotheses.
4. Discuss how meta-analysis results can inform informed hypothesis testing.
5. Sketch out a study on corrective feedback on L2 writing (or another area of
L2 research) that predicts an order for different condition or treatment. How
and why might the approach illustrated in this chapter be used to examine
such a set of predictions?
References
Hoijtink, H. (2012) Informative hypotheses. New York: Chapman Hall/CRC.
Hoijtink, H., Klugkist, I., and Boelen, P. (2008). Bayesian evaluation of informative hypotheses.
New York: Springer.
Kruschke, J. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. New York:
Academic Press.
Kuiper, R. M., & Hoijtink, H. (2010). Comparisons of means using exploratory and confir-
matory approaches. Psychological Methods, 15(1), 69–86. doi:10.1037/a0018720
Lunn, D., Jackson, C., Best, N., Thomas, A., and Speigelhaler, D. (2012). The BUGS book:
A practical introduction to Bayesian analysis. London: Taylor & Francis.
Ntzouflas, I. (2009). Bayesian modeling using WinBUGS. New York: John Wiley and Sons.
INDEX
ACTFL see American Council on the aptitude 35, 244,

Teaching of Foreign Languages Arbuckle 238, 241
agglomerative hierarchical clustering 246 Asención-Delaney 183, 210
agglomeration schedule 248, 256 – 7 Asher 27 – 8, 45, 117, 128
Aiken 133 – 6, 156, 158 assumptions 29; in Bayesian analysis
Aldenderfer 252, 255, 257, 273 331 – 3; in bootstrapping 47, 49 – 50, 75;
Algina 75 – 6 in confirmatory factor analysis 184;
Allen 196, 211 in discriminant analysis 307, 309 – 10,
Allison 138 – 9, 156, 158 313, 315, 321; in exploratory factor
alpha 39, 49, 110 analysis 185, 187, 197; in meta-analysis
alternative hypothesis 333 – 4, 113 – 14, 135 – 6, 137; in mixed effects
336, 343 160, 162; in multiple regression 140 – 1;
American Council for Teachers of Foreign of normality 170; in SEM 220, 236;
Languages (ACTFL) 331, 344 statistical 3, 10, 16 – 19, 29, 177, 313,
AMOS 184, 215, 219, 222, 225, 235, 321, 326
238 – 9; importing data in 226, 234 attrition 162
ANCOVA 132, 143 August 287, 290, 304
Anderson 187, 210 average 36; 39; 54; 114 – 15; 194; 202;
Andrés 162, 180 252; 312
Andrich 277, 292 – 3, 303
Angoff 344 Baayen 161, 172 – 4, 178 – 9
analysis of variance see ANOVA Baba 274
ANOVA 3, 35; 172, 174 – 5; Bayesian Babaii 5, 7
approach, compared to 329 – 30, 335, Banks 111, 126
343 – 4; cluster analysis and 261, 261, bar charts 79 – 80, 87 – 92, 102
269; discriminant analysis, relationship Barkhuizen 299, 303
to 306, 312, 315; factor analysis and 183; Barley 158
mixed effects, compared to 160 – 2, 168; Barr 171, 178 – 9
MRA in 132 – 3, 143, 147 – 8, 153; Bartlett’s Test of Sphericity 188, 205
one-way 49, 68 – 9, 261, 261, 329 Bates 161, 164, 172 – 4, 175, 178 – 9,
Anscombe 104 181
348 Index
Bayes factors 342 – 3 case study 255, 261

Bayesian analysis 335, 343; ANOVA, categorical variables 14, 87, 94; in cluster
compared to 329 – 30, 335, 343 – 4; analysis 243, 261, 268; in MRA 132
output 343; software for 7 category characteristic curve 292
Bayesian model selection (BMS) 330, 334, Cattell 194, 210
338 – 9, 344 CEFR 305 – 6, 331, 344
Beasley 46 Chan, D. K.-S. 113, 125
Becker 108, 125 Chan, W. 46, 76
Bentler 223, 238, 241 charts 78 – 80; 82, 86 – 92, 97 – 9
Bernhardt 134, 158 Chen 168, 179
Best 329, 341, 345 centroid method 252
between-groups design 112, 165, 171 Chernick 50, 51, 76
bias: analysis of 299; in bootstrapping Cheung 113, 125
60 – 1, 63 – 6, 69 – 73; correction of chi-square test: Bartlett’s Test 188;
51, 53, 54, 57; effect size 32, 113; chi-square difference test 224; in
positive 147; publication 38, 115 – 17, discriminant analysis 231; multivariate
119 – 21 outliers 137; model fit 175, 223; in
Biber 245, 247 – 8, 269, 273, 274 SEM 224, 229, 234
Biggs 118, 125 Clark 83, 104, 161, 180
Bikowski 246, 274, 274 classical hypothesis testing 329
Bithell 51, 76 classical test theory (CTT) 276, 303 – 4
Black 187, 210 classification accuracy 313, 326, 334
Blashfield 252, 255, 257, 273 cluster analysis 6, 14, 243 – 5, 268;
Boelen 333, 344, 345 discriminant analysis, compared with
Bonferroni 20 307, 316; dendogram 255; factor
Boomsma 224, 241 analysis, compared with 194, 196;
Bond 294, 302 hierarchical 246 – 7, 258; k-means 246;
Borenstein 115, 122, 125, 127 procedure 246 – 7, 250, 253, 258 – 9, 261,
bootstrapping 6, 11, 30, 46 – 9, 67, 69; 263 – 5, 266; SEM compared with 223;
BCa 51, 53, 54, 58; ANOVA 49; software for 7
considerations in 50; confidence Collentine 183, 210
intervals 51; correlations 49; correlation collinearity 138, 140; reducing 168
coefficient 60 – 2; diagnostics 50; F Cohen, J. 5, 29, 36, 117, 124, 125, 133 – 6,
statistic 67; mean differences 64; non- 156, 158
parametric 74; pairwise comparisons 49, Cohen, P. 133 – 6, 156, 158
67; procedure 53, 56 – 7, 60; replications; Cohen’s d 26 – 7; calculating 31 – 2, 38;
resampling/sampling method 50; interpreting 38; weighting 119; see
software for 7, 73 – 4; t-test 63, 65 effect sizes
Bosker 160, 181 Cohen’s kappa 112
Bovaird 161, 180 Cohn 108, 125
box plot 12, 78, 90 – 3, 94 Collins 162, 180
Box’s M 312, 316, 321 comparative fit index (CFI) 223, 228 – 31,
Boyle 162, 179 233, 237
Brannick 111, 126 comparison group 10, 28
Brown, D. 108 – 9, 123, 127 comparison of means (software) 334 – 8,
Brown, J. D. 3, 6, 10 – 11, 16, 308 341, 343
Brown, T. 209 communalities 187 – 8, 190, 193, 209
Buchner 29, 44 component 14, 101 – 2, 132, 182 – 5, 187,
Bryk 160, 181 193, 197, 199, 200 – 1, 211
Byrne 162, 180, 215, 234, 238 – 9, 241 communality 190, 209
Comrey 187, 194, 199, 203, 205, 210
Canty 51, 56, 76 confidence intervals (CIs): calculating
Carlo 287, 290, 304 32 – 3, 39; interpreting 40; reporting 10,
Carpenter 51, 76 23; statistical significance, test of 40
Index 349
confirmatory Bayesian model selection Dalton, D. R. 107, 125

(BMS) 330 Dalton, C. M. 107, 125
confirmatory factor analysis (CFA) 14, Daniel 80, 105
183, 209, 216 data: screening 16 – 19; sharing 112 – 13;
construct 3, 13, 203, 209, 294; theoretical MRA, in 136 – 7; preparation 220;
197, 215, 221; underlying 183, 192, reduction 13, 80, 220; requirements
207, 213 – 14, 216 – 17, 290; validity 228, 238, 276, 279; SEM, in 225;
330 – 1 transformations 74, 137, 252;
continuous variable: in discriminant visualization 103, 166
analysis 14; displaying visually 87; in Davidson 161, 172 – 4, 178 – 9
mixed models 162; in SEM 220 Davison 46 – 7, 50 – 1, 56, 75, 76
control group 51; in Bayesian analysis 329; default graphic 99, 101
in cluster analysis 243 De Glooper 214, 219, 242
control variables 13, 174 degrees of freedom 114, 137, 173 – 4,
convergence errors 177 223 – 4, 228, 230, 233, 241
Conway 185, 210 Delaney 109, 113, 126
Cooper 109, 111, 122, 125 delayed posttest 110
Cordray 111, 126 Deering 75 – 6
corpus 117, 183, 305; in discriminant delta 198, 292, 339
analysis 307 dependent variable 6, 13 – 14, 35, 40;
corpus analysis 273, 307, 327, 328 MRA, in 131, 139, 145; mixed effects,
correlation 141; bivariate 131, 137 – 8; in 159, 162, 164 – 6, 168; cluster analysis,
143, 213, 315, 321; bootstrapped 60; in 243, 244, 261; discriminant analysis,
canonical 317; discriminant analysis, in in 305 – 6, 308, 310; SEM, in 214,
306, 315, 321, 323; factor analysis, in 219
198; factor loading 209; high 177, 310; determinant 188 – 9, 192, 316
low 188; matrix 155, 188, 192, 205, 220; dichotomous variable 132, 277 – 8, 296
multivariate 131, 146; Pearson 253, 313; DiCiccio 51, 75
random 176; sample size 187; SEM, discrete variables 14, 87, 214
compared to 213; SEM, in 221, 224, discrim see discriminant function analysis
231, 237 – 8; semi-partial 13; software discriminant function analysis 305;
for 7; statistics 131; underlying 182 – 3; procedure 308, 312 – 13; results 323
within-groups 313 DiStefano 220, 241
correlation coefficients: assumption Dixon 168, 178
checking, part of 188; bivariate 60, 131, divisive hierarchical clustering 246
137 – 8, 213, 315; 321; bootstrapping Dörnyei 184, 211, 244, 273
60 – 2; calculating 33, 34; EFA, in 185; dot plot 94
effect size, as 117; multivariate 131 dummy variable 132, 142
correlation matrix 188, 192, 205, 220, 238, Durbin-Watson statistic 144, 147. 152
240
Costello 194, 197, 210 EAP (English for academic purposes) 245
covariance matrix 220, 222 – 3, 226 – 8 Eato 196, 212
covariance structure analysis 213, 241 Eckes 245, 252, 274
covariates: MRA, in 149; mixed effects, effect sizes: 27, 125; calculating 31 – 6;
in 162 Cohen’s d 26, 27 – 8, 31, 106, 112 – 13,
Crawley 51, 76, 271 117 – 19; confidence intervals for 32,
Cronbach’s alpha 208 33 – 5, 115 – 16; correlations as 31, 32,
crossed random effects 161, 179, 181 106; eta-squared 27, 31; Hedges g 32;
cross-sectional design 163 interpreting 37 – 8, 45, 118, 123, 127;
Csizér 244, 273 Kramer’s V 36; large 37, 117 – 18; Odds
Csomay 245, 274 ratio 36; phi 36; R2 5 – 6, 35; reporting
Cumming 117, 227, 274 36 – 7; small 27, 29, 117, 119; weighting
Cunnings 6, 15, 24, 44, 178, 180 113, 122
Curran 220, 242 Efron 46, 50, 51, 75
350 Index
Egbert 24, 30, 33, 44, 48, 65, 76 – 7, 113, fit indices 223, 228 – 9, 234, 240; NFI 233;
127, 162, 181 NNFI 233; RMSEA 223, 228; (S)RMR
eigenvalue 193 – 7, 205, 207 – 9, 317 233; Tucker-Lewis 223
Eiting 222, 242 fit statistics 233, 287, 289, 294, 296, 301;
Elder 299, 303 item fit 289 – 90; person fit 287 – 8
Ellis, N. C. 36 Fitzmaurice 109, 113, 126
Ellis, R. 107 fixed effects 114 – 15, 159, 166, 168 – 77,
Enders 168, 179, 220, 241 179; parameters 168, 173
Eouanzoui 274 Ford 185, 187, 196, 205, 210
EQS 184, 238 – 9, 241 forest plot 79, 115 – 16
Erdfelder 29, 44 Forstmeier 171, 181
Erdosy 274 Fox 164, 241, 180, 238, 294, 302
error: free 219; variance 113 – 15, 185, 222 frequency: cases, of 89; data 36; linguistic
error sum of squares 252 features, of 245, 325; subjects, of 88
eta squared 5, 27, 35, 49; ANOVA, in funnel plot 115 – 17, 119 – 20, 124
132 – 3; calculating 35 fusion coefficient 196, 255, 257 – 8, 276
Excel 7, 31, 39, 99, 112 – 3, 122, 279 – 80
experimental design 39, 160, 162 gain scores 325
explanatory variable 154 gamma 74
exploratory factor analysis (EFA) 8, 187 – 8, Ganschow 244, 274
190, 196, 203, 205 – 12, 216, 223, 242 Gass 3, 5, 7 – 8, 29, 37, 46 – 7, 76, 77, 110,
118, 127, 160, 181, 182, 211
Faca 196, 212 Gelman 162, 180
Facets 7, 278 – 9, 283 – 301 General Linear Model (GLM) 5, 35 – 6, 38,
Fabringar 185, 192, 209, 210 133, 306
factor analysis 6, 12 – 13, 182 – 4, 209; generalizability 77, 127, 242
cluster analysis, compared to 244 – 5, Gibbs sampler 341
247, 255, 258; conducting 187, 190 – 4, Gillespie 168, 180
197; confirmatory 209; discriminant Glass 122, 125
analysis, compared to 306, 316 – 17; Gleser 113, 125
exploratory 185, 209; factor extraction, Glorfeld 196, 210
in 190, 193, 200 – 1; factor loadings, in Goh 207 – 8, 128
199; principal axis factoring 184, 193, Goldstein 160, 162, 180
208; principal components analysis 182, Gonulal 7, 8, 44, 182, 187 – 8, 194, 196 – 7,
184 – 5, 187, 193, 294 – 5; reporting 205; 205
rotation 197 – 9, 203; software for 7; Goo 108, 118, 126
structural equation modeling, compared goodness of fit statistics 228, 230, 233
to 216, 219 Gorsuch 183, 185, 187, 194, 210
factor loadings: matrix 197 – 8 Götz 211, 244, 246
factor rotation 197 – 9, 203 grammaticality judgment task (GJT) 110,
factor scores 13, 186, 201 – 3, 244, 261 162 – 3
Faul 29, 44 Granena 35, 44
Fern 118, 125 Grant 246, 274
Ferris 246, 274 graphics 78 – 80, 83, 87, 90, 103 – 4;
Few 78 – 80, 103, 105 guidelines for 80, 83, 86 – 7
Fidell 131 – 2, 137 – 8, 140 – 1, 143, 157 – 8, Gray 247 – 8, 269, 270, 273
183, 185, 187, 192, 211 Green 157, 288, 302 – 3
Field 156, 178, 180, 182 – 5, 187, 188, 190, Gridlines 82, 87 – 8
193 – 4, 197 – 9, 201 – 3, 205, Gries 178, 180, 245, 252, 271, 274, 307,
209 – 10 328
Finch 220, 242 Gu 214, 219, 225, 235 – 41
Finegan 245, 273
Finney 220, 241 Hair 187, 210
Fisher 80, 105 Hancock 214, 219, 220, 221, 224, 239, 241
Index 351
Hansson 162, 178, 180 individual differences 244

Harrington 184, 210 infit mean square 290, 299
Harshman 196, 211 informative hypothesis 329
Harzing 109, 125 In’nami 107, 109, 123, 125, 239, 241
Hashemi 5, 7 instruments 112, 208, 217, 279, 294, 300
Hattie 118, 125 interaction: between raters 299; effects 10,
Hayes, E. 272 12; mixed effects model, in 160, 166,
Hayes, H. K. 80 – 1 168 – 9, 171, 173; statistical 101
Hayes, T. 114 Interagency Language Roundtable (ILR)
Hayton 196, 211 331, 344
Hedges 114, 115, 117, 122, 125, 125 intercept 131, 169 – 73, 177
Henson 185, 201, 203, 211 interquartile range 40 – 1
Herrington 24, 30, 44, 46, 48, 76, 77, 78, 105 interval scale 10, 287, 308
hierarchical: cluster analysis 246 – 7, 258; interview 224, 325
regression analysis 141, 149 – 50, 152, intraclass correlation 112
154, 158 Iramaneerat 301
Higgins 113, 115, 122, 125, 128 IRIS database 112
Hill 162, 180 Isoda 244, 268, 274
Hinkley 46 – 7, 50, 51, 56, 75, 76 item analysis 15, 183, 331
Hiromori 244, 268, 274 item development 334
histogram 17, 62, 66 – 8, 78, 87 – 8, 91 item/person map 285
Hobbs 109, 113, 126 item response theory (IRT) 79, 275, 335
Hoijtink 330, 334, 336, 338, 345 item separation 290
Hoffman 161, 180 iteration 48, 198, 222, 341
homogeneity: effect size, of 114; variance,
of 18, 51, 309, 312, 316 Jackson 329, 341, 345
homoscedasticity 141, 162 Jaeger 162, 168, 178, 180
Hong 185, 187, 211 James 274
Howell 133, 157, 158 Jang 108, 112 – 13, 123, 126
Hu 223, 241 Jarvis 246, 274
Huang 119, 126 Jeon 6, 13, 14, 35, 113, 125, 158
Huff 5, 7 Jiang 158
Huffcutt 185, 210 Jin 155, 158
Hulstijn 214, 219, 242 Johnson 108, 115, 126, 127
Hunter 113, 122, 125, 127 Joiner 269, 274
Hyde 109, 113, 126 Joliffe’s criterion 194
hypothesis testing 49, 224, 266; Bayesian Jöreskog 213 – 14, 225, 228, 238, 242
Informative 329, 333 – 4, 338 – 9; null 15, Josefsson 162, 178, 180
223, 329 – 30, 336, 344 Jowett 109, 113, 126
Iberri-Shea 162, 180 Kaiser 194, 211

IBM SPSS see SPSS Kaiser-Meyer-Olkin (KMO) 187
Immer 80 – 1 Kaiser’s criterion 194, 196
imputation: data, of 162, 220 Kamil 134, 158
incremental fit index (IFI) 228, 230, 233 Kang 272, 274
indicator 115, 159, 217 Kantor 274
inferential statistics 79, 224, 276, 306 Kenyon 287, 290, 304
independence: observation, of 309, 144, Kepes 111, 126
160; model 228, 230, 233, 234 Keselman 75 – 6
independent variables 3, 14, 118; Kirk 126
discriminant analysis, in 306, 310; Klass 78, 82, 91, 105
mixed effects, in 161, 166, 173; multiple Kline 118, 126, 183, 185, 211
regression, in 131, 138 – 9; structural Klugkist 333, 344, 345
equation modeling, in 214, 219, 232 Koizumi 107, 109, 123, 125, 239, 241
352 Index
Kojic-Sabo 244, 246, 274 Littre 163, 163, 178, 180

Kolmogorov-Smirnov 75 Lix 75 – 6
Komsta 56, 76 Ljungberg 162, 178, 180
Kosslyn 78, 82, 105 Locker 161, 180
Knoch 275, 299, 303, 304 Loewen 182, 183, 188, 190, 194 – 7, 200 – 1,
Kruschke 329, 344 203, 205, 209 – 10, 211
Kuiper 330, 334, 336, 338, 345 logit scale 176, 277, 285, 287, 289, 296
Kunnan 239, 242 logit scores 10, 166
kurtosis 17, 40 – 1, 57, 220 log-likelihood ratio 171
loglinear modeling 14
lab-based research 123 longitudinal data analysis 159
Lackey 187, 193, 199, 203, 205 – 6, 209, longitudinal design 159
211 logistic regression 14, 36, 132, 307, 309,
LaFlair 24, 30, 44, 48, 65, 76 – 7, 113, 127, 327
162, 181 logit 10, 162, 166, 170, 176, 179, 277,
Lagrange multiplier 224 285, 287, 289 – 90, 296, 331, 333, 337;
Lang 29, 44 difficulty 333; mixed effects 162, 179;
language-as-fixed-effect fallacy 187 transformation 166, 170, 176
language testing 35, 127, 235, 344 loglinear modeling 14
large samples 49, 343; difficulty obtaining Louguit 287, 290, 304
30 Lunn329, 341, 345
Larson-Hall 4, 5, 7, 24, 30, 42, 43, 44, 46, Lyster 108, 118, 126
47, 48, 49, 75, 76, 77, 82, 90, 97, 105,
157, 178, 180, 205, 211 MacCallum 185, 187, 196, 205, 210, 211
latent variable 183, 213 – 17, 219, 221 – 2, McCloy 108, 127
225 – 9, 231 – 3 McDaniel 111, 126
Lavolette 183, 188, 190, 194 – 7, 200 – 1, McGill 48, 77
203, 205, 209 – 10, 211 Mackey 44, 108, 118, 126, 307
Lazaraton 47 – 8, 77, 160, 180 McManus 109, 113, 126
Le 114, 127 McNamara 275, 276, 288 – 9, 299, 304
Lee, H. B. 187, 194, 199, 203, 205 Magnuson 168, 178
Lee, J. 108, 112 – 13, 123, 126, 272 – 4 Mahalanobis distance 137 – 9
Lee, S.-K. 119, 126 main effects 168, 173, 175, 178
Lee, W. C. 46 – 7 Mak 155, 158
Leung 245, 252, 274 Malabonga 287, 290, 304
Levy 171, 173 – 4, 179 manifest variables 214
Li 107, 108, 118, 120, 124, 126 Mann-Whitney U test 47
Lightbown 244, 246, 274 MANOVA 306, 308 – 10, 312, 323, 326
Likert scale 110, 183, 187, 247, 278 many-faceted Rasch model 299
Linacre 278 – 9, 287, 290, 294, 301, 303, Marcoulides 239, 242
304 Mareschall 207 – 8, 128
Linck 178, 180 Markov-Chain Monte-Carlo (MCMC)
linearity 140, 307 341
linear regression 33 – 5, 131, 137 – 9, 144 – 5, Masters 278, 304
157 mathematical transformations 17
line graph 12, 78 – 9, 92, 94, 101 – 2 matrix algebra 17 – 18
linguistic variables 240, 245, 262 maximal random effects structure 173, 176
linkage: between-groups 250; within- maximum likelihood 170, 175, 184 – 5,
groups 250 193, 222, 231, 301
Lipsey 110 – 12, 122, 126 mean square 148, 153, 288; infit 290, 299;
LISREL 7, 184, 196, 213, 215, 219, 222, root 223, 228 – 33
225, 227 – 31, 234, 238 – 41 measurement error 219, 240
listwise deletion 61, 198, 220 measurement model 215, 217, 221
Index 353
median 41, 55, 90 – 1, 112 – 13, 118, 165, multivariate analysis of variance see
188, 306; clustering 252 – 3 MANOVA
meta-analysis 12, 27, 30, 36 – 7, 42, 106 – 28, multivariate normality 17, 140, 220, 228,
121; analysis 112 – 13; benefits of 309; outliers 17 – 18, 137
106 – 7; coding 110, 111; data analysis multivariate statistics 3 – 4, 29
112 – 15; data collection/coding multiway analysis 14
110 – 12; defining the domain of 108; Muthén, B. O 236, 238
examples of 27, 36, 108 – 10, 120; forest Muthén, L. K. 236, 238
plot 115 – 16; funnel plot 115 – 17;
history of 107, 122; interpreting results NAEP State Comparisons Tool from
117 – 19; interrater reliability for coding the National Center for Educational
112; method 114; models 114 – 15; Statistics 84 – 5, 87, 105
moderators 108, 114, 115, 119, 120 – 1, Nakata 244, 274
123; publication bias 115 – 17, 119, 120, Nassaji 49, 77, 307, 326, 328
121, 122, 124; Q-test 114; results 345; National Assessment of Educational
searching for primary studies 108 – 9; Progress (NAEP) 83
software for 112, 120, 121 – 2; weighting nested random effect 160
effect sizes 113 – 14 Newman 245, 252, 274
methodological reform 4 – 5, 24 Nicol 82, 105
Meunier 163, 163, 178, 180 Nilsson 162, 178, 180
Minimum Fit Function Chi-Square 228, nominal variable 10, 305, 310
230, 233 nonnormality 6, 11, 29, 46, 48
Ministeps 301 non-parametric statistics 74, 178
Mirman 168, 178, 180 normal distribution 44, 62, 65, 67, 162,
misfit 49, 221, 224, 288 – 91, 294, 299 166, 288; Satorra-Bentler’s correction
missing data 15, 61, 113, 142, 143, 162, 220
198, 220, 276, 279; imputation of 162, normality 11, 47 – 50, 137, 144, 170, 307;
220 univariate 17; multivariate 140, 220,
Mitchell 168, 179 228, 309
mixed effects 6, 15, 159 – 70, 173, 175, 178 Norman 182, 211
mixed methods 7 Norris 5, 6, 8, 14, 24, 28, 36, 44, 45, 49, 77,
Mizumoto 7, 8, 43, 184, 211 107 – 8, 118, 122 – 3, 126, 160, 180
model building: competing or concurrent Normed Fit Index (NFI) 228 – 30, 233
models 214, 217 – 18, 221; measurement Norušis 252, 274
model 215, 217, 221; model Novomestky 56, 76
specification 222 – 3, 225 – 6, 231, 339; Ntzouflas 329, 345
structural model 215 – 16, 221, 240 null hypothesis significance testing
model comparison 174 – 5 (NHST) 6, 10, 15, 25 – 31, 36, 106
model overfitting 221, 238, 289, 303, 310
moderator variables 13, 114, 123 oblique rotation: direct oblimin 197;
modification indices: Lagrange Multiplier promax 197
test 224; Wald test 224 observed scores 14, 214, 217, 236
Molenaar 219, 242 observed variables 14, 182, 214 – 15, 217,
moments: distribution, of a 57 – 8; package 56 221 – 3, 231
Monroe 118, 125 Ockey 216, 239, 242
Morris 112, 126 odds ratio 36
Mueller 214, 219, 221, 224, 239, 241 Oh 114, 127
multicollinearity 13, 137 – 40, 188, 192, Olkin 113, 122, 125
203, 309 – 10, 313, 316, 321; software Onsman 209, 212
for 7 operationalization: construct, of 215;
multiple comparisons 9, 11, 69 proficiency, of 329
multiple regression 5 – 7, 10, 13 – 16; 35, ordered means 15, 329
131; hierarchical 13, 224 ordinal scale 10, 88, 214, 225, 277, 305
354 Index
Ortega 5, 8, 24, 36, 45, 49, 77, 107 – 8, population: covariance 316; distribution
122 – 3, 126, 160, 162, 180 62; effect 114, 135; mean 39; true 26, 33
Orthogonal rotation: varimax 197; Porte 5, 8, 45, 77
quartimax 197; equamax 197 posterior model probability (PMP) 338,
Orwin 111, 113, 126 343
Osborne 194, 197, 210 post hoc: comparisons 10 – 11, 330; power
Oswald 6, 8, 12, 24, 37 – 8, 44, 45, 92, 107, 29, 47, 49; tests 48, 334
108, 109, 112 – 14, 115, 117 – 8, 123 – 4, posttest 39, 110, 112 – 13
127, 128 power 135
outfit mean square 288 power analysis 29, 43 – 4, 125, 239
outliers 17 – 18, 48, 51, 94, 115, 137, 140, Powers 80 – 1, 105
220, 288, 309 – 10 practical significance 27, 38, 43, 107, 114,
Oxford 244, 268, 274 117 – 18, 126
precision: measurement, of 10, 90, 115;
pairwise comparison 49, 67, 69 – 70, 149, observation, of 135; statistics, in 30, 74,
198 133, 137
Papi 183, 188, 190, 194 – 7, 200 – 1, 203, predictor variables 15, 35; cluster analysis,
205, 209 – 10, 211 in 247 – 8, 260 – 1, 266; discriminant
parameters: constrained 222; fixed 222; free analysis, in 307 – 12, 314 – 16, 320 – 1,
222; estimates 222, 224 323; multiple regression analysis, in
parametric statistics 49, 162 131 – 3, 135
Parsimony Normed Fit Index (PNFI) 228, pretest 110, 112 – 13
230, 233 Principal Axis Factoring (PAF) 184, 193,
partial correlations 13 208
partial credit model 275, 277 – 8, 284, Principal Components Analysis (PCA)
292 – 3, 295 – 6, 301, 303 182, 184 – 5, 187, 193, 294 – 5
path diagram 79, 228 – 9 prior estimation 341
Patton 244, 274 probability: level 11, 135 – 7, 144, 148, 318;
Pearson 107, 127 posterior model 338 – 9, 343
percentage: cumulative 194; descriptive Processuse 310, 315 – 20
36, 90, 93, 110, 290, 292, 313, 320, 323; publication bias 38, 115 – 17, 119 – 24
variance, of 194, 197 Purdie 118, 125
percentile 37, 51, 90 – 2, 118 p-value 78, 223 – 4, 228, 230, 233
person ability 285, 287, 332
Pett 187, 193, 199, 203, 205 – 6, 209, 211 Q-Q plot 61 – 2, 65 – 8, 167
Pexman 82, 105 quantiles 51, 65 – 8, 167
Phakiti 47 – 8, 77 Quene 162, 178, 181
Pickering 272 – 3 questionnaire analysis 294
pie chart 79, 97 – 9
Pigott 108, 114, 125, 128 random assignment 110
pilot study 290 random effects: crossed 161, 179, 181;
Pinheiro 175, 181 nested 160 – 1, 168
planned comparisons 330 random intercepts 169 – 71, 173, 176 – 7
Plonsky 3, 4, 5, 7, 8, 10, 11, 12, 24, 29, 30, random slopes 171 – 3, 176 – 8
32, 33, 37 – 8, 42, 44, 45, 46 – 9, 65, 76 – 7, range restriction 113, 122
82, 90, 92, 97, 105, 107, 108 – 9, 110, Rasch analysis: many-facet model 302;
112 – 14, 117 – 9, 123 – 4, 127, 127, 128, simple model 275 – 8, 296, 301
133, 158, 160, 162, 181, 182, 187 – 8, raters: multiple 279 – 80; severity 278
194, 196 – 7, 205, 211, 214, 242, 244, ratio 94, 97, 188, 223; aspect 102 – 3; F-
274, 306, 327 147 – 8; odds 36; scale 10, 14
plots 12, 65, 78, 90 – 4, 124, 138, 140, Raudenbush 160, 161, 162, 181
166 – 7 Raykov 239, 242
Poltavtchenko 108, 127 R development core team 7 – 8, 164, 181
Index 355
reaction time 170, 220 Sabatini 158

Reddon 196, 211 Sachs 44
regression: coefficients 131, 148 – 9, 153; Sagan 40, 45
hierarchical 141; model 33, 131, 147 – 8, Saito 108, 118, 126
151 – 2, 156, 219; standard 141, 149, sample: appropriate 157, 187 – 8;
151 – 2; statistical 143; weights 110, independent 74, 113; large 16 – 17, 30,
152 – 3 40, 115, 187, 321, 343; small 11, 16,
reliability 10, 18, 51, 110, 113 – 14, 135, 29, 32, 46 – 7, 106, 287, 289; suggested
153, 221, 290, 299, 309 187 – 8
repeated measures 15, 160, 162, 171, sampling: adequacy 187 – 8, 192;
179 – 80 distribution 18, 50, 62, 67; error 29 – 30,
replication 24, 57, 108, 112, 173, 206, 221, 106, 113 – 15; procedures; 50; level of
224 161
reporting standards 122 Sarkar 101, 105
resampling 46 – 7, 50 – 1, 72, 75 Sawaki 158
research synthesis 43, 123 scale: ability 276; continuous 14; criterion
residual: ANOVA, in 71, 148, 153; 297; interval 10, 287, 308; Likert 110,
dimensionality, establishing 294 – 5; 183, 187, 247, 278; level 334; logit 176,
linearity, checking for 140, 144; 277, 285, 287, 289 – 91, 296, 331; ordinal
sampling 50 – 1; scaled 169 – 70; square 10, 88, 214, 225, 277, 305; original 58,
229, 230, 233; standardized 145 – 6; 62, 65, 67, 72 – 3; predefined 226; ratio
variance of 170, 217; z 144 10; reading 83 – 5; score 84
Revelle 164, 181 Scarpello 196, 211
r family of effect sizes 38; see correlation; scatterplot 145 – 6
effect sizes Scheepers 171, 179
Rietveld 182, 186 – 7, 202, 209, 211 Schielzeth 171, 181
Ripley 56, 76 Schmidt 113, 114, 122, 125, 127
R-matrix 188 – 9, 192 Schmidtke 183, 188, 190, 194 – 7, 200 – 1,
Robbins 78, 82, 105 203, 205, 209 – 10, 211
Roberts 185, 201, 203, 211 Schmitt 184, 211, 214, 219, 242
Robson 327 Schoonen 5, 8, 183, 214, 219, 220, 222,
robust statistics 75 239
Rogers 46 – 7 Schumacker 279, 304
Root mean square error 223 score: abstract 33; decision point 332;
Root Mean Square Error of distributions 94 – 5, 236; level 247,
Approximation (RMSEA) 223 265, 266 – 8, 296, 332; composite 201;
Rosenkjar 327 missing 198, 220; total 287, 289
Rosenthal 28, 45, 122, 125, 127 scree plot 193 – 7, 205, 208, 2585
Rosnow 28, 45 Seaman 269, 274
Ross 5, 8, 107, 127 SEM see structural equation modeling
Rosseel 238, 242 Shaul 245, 252, 274
rotation: oblique 197 – 8; orthogonal 197 Shintani 107, 126
Rothstein 108, 115, 122, 125, 127, 128 significant effect 37, 47, 107, 315,
Rovine 219, 242 321
R programming language 7 similarity and distance measures
R packages: boot 56; discriminant analysis, 250
for 325; ggplot2 101; lattice 101; lavaan simple resampling method 50 – 1
238; lme4 164, 166; moments 56; plyr SIMPLIS 225 – 6, 228, 238, 241 – 2
56 Singer 162, 168, 181
Rubin 272 – 4 Skehan 244, 246, 274
R squared: adjusted 147, 151; change in skewness 51, 57, 74, 220
132, 149, 151 – 2; see effect sizes slopes: random 171 – 3, 176 – 8
Russell 108, 127 Snijders 160, 181
356 Index
software 6 – 7; AMOS 7, 184, 215, 219, 222, Stoel 214, 219, 242
233 – 5; LISREL 7, 196, 213, 215, 222, Stoll 245, 274
225, 227 – 9; Mplus 7, 236 – 9; PRELIS Strahan 185, 210
225 – 6; SIMPLIS 225 – 6, 228; see also Streiner 182, 211
SPSS structural equation modeling (SEM) 6, 14,
solutions: 2-cluster 260, 261, 267; 3-cluster 183, 213; software for 7
260, 261, 267; 4-cluster 260, 261, 267; study quality 8, 110, 114
six-factor 194, 196, 208; two-factor Stukas 117, 227
196 – 7 Sullivan 187, 193, 199, 203, 205 – 6, 209,
Smith, E.V. 301 211
Smith, R. M. 301 sum of squares 148, 153, 252, 256 – 7
Sörbom 213 – 14, 225, 228, 238, 242 survey questions 183, 203, 268 – 9,
Spada 108, 118, 127 309 – 10
sparklines 97, 98, 99 Sutton 115, 122, 128
Sparks 244, 274 synthetic ethic: 24, 42 – 3
Speigelhaler 329, 341, 345
sphericity 162, 188, 192, 205 Tabachnick 131 – 2, 137 – 8, 140 – 1, 143,
Spino 183, 188, 190, 194 – 7, 200 – 1, 203, 157 – 8, 183, 185, 187, 192, 211
205, 209 – 10, 211 table: titles 83; correlations 60, 315; output
SPSS 4, 7, 31, 33, 35, 39, 41, 156, 219; 63, 275, 279, 284, 301 – 2
bootstrapping 49, 51 – 2, 54, 57, 60 – 5, Tafaghodtari 207 – 8, 128
67, 73; cluster analysis 246, 248, 253, Tait 185, 187, 196, 205, 210
256, 260 – 1; commands 137 – 8, 143, Takeuchi 184, 211
149; discriminant analysis 307 – 8, Tatham 187, 210
310 – 12, 314 – 15; factor analysis, Taylor 27 – 8, 45, 117, 128
performing 184, 187, 190, 193, 196 – 9, test development 222, 234, 237 – 8
202 – 3; mixed effects 164 – 5, 178; Thomas 329, 341, 345
output 35, 41, 147, 199, 256, 315; Thompson 183 – 4, 187, 193 – 4, 201,
output for ANOVA 148, 153; output 202 – 3, 211
for factor analysis 199; output for three-dimensional display 102
hierarchical regression 152 – 3; output Tibshirani 50, 75
for regression 147, 149; output for Tily 171, 173 – 4, 179
tolerance statistics 140; output for Tobias 109, 113, 126
variables 147, 152; Rasch analysis 279; Tolerance 138 – 40
structural equation modeling 219, 225, Tomita 118, 127
233, 238; visual displays 79, 99 – 100, transparency 5, 205
123 – 4 treatment 51, 80, 119, 147, 222, 243, 325;
Squared Euclidean distance 253 coding in R 166, 168
standard error 16, 54, 60 – 6, 69, 83, 133 – 4, Truscott 108, 128
170, 220, 228, 231, 234, 287, 296 Tseng 184, 211, 214, 219, 242
standardized coefficients 33, 148 Tsuda 244, 274
statistical power 9 – 11, 23, 29 – 30, 47 – 8, t-test 11, 19, 25, 27, 49, 51, 160, 162;
108, 160, 220; analyzing 29; NHST 30; bootstrapping, in 63, 65 – 6
sample size 30 Tufte 78, 82, 90, 97, 103
statistical significance 23 – 30, 36, 43, 49, 72, Tukey 47, 76
92, 117, 135, 143, 147, 151, 152, 173, Turner 178, 181
174, 223, 323, 329; flaws associated with Type I error 4, 11, 49, 77, 314
24 – 8, 36, 106 – 7; statistical power 30 Type II error 47 – 8, 65
Steiger 242, 244
stepwise regression 143 Uchikoshi 245, 252, 274
Sterling 183, 188, 190, 194 – 7, 200 – 1, 203, Ullman 214, 222 – 3, 238 – 9, 242
205, 209 – 10, 211 unstandardized coefficients 35, 148 – 9,
Stevens 27 – 8, 45, 117, 128 152 – 3
Index 357
Valentine 108, 122, 125, 128 Weisberg 164, 180

validity 10, 48, 109, 215, 221, 330 – 2 Wells 168, 179
value: absolute 170, 198; exact 83, 86; West 220, 133 – 6, 156, 158, 242
median 90, 306; observed 131, 135; White 109, 113, 128
predicted 71, 131, 145 – 6 Widaman 185, 186, 211, 212
van den Burgh 162, 178, 181 Wikham 105
Vandergrift 207 – 8, 128 Wilcox 47, 75 – 7
Van Gelderen 214, 219, 242 Wilkinson 78, 80, 86, 105
Van Hout 182, 186 – 7, 202, 209, 211 Wilks’ Lambda 315, 317, 322
variable: correlated 188, 203, 310; criterion Willet 168, 181
6, 33, 131, 154 – 5, 264, 266, 307; Williams 209
individual 220, 324; linguistic 240, 245, Willms 162, 179
262; moderator 13, 114, 120; redundant Wilson, D. B. 31, 110, 110 – 12, 121, 122,
13; standardized 33, 253 – 255; 126
theoretical 215, 217; transforming 18, Wilson, S. 109, 113, 126
253, 255; underlying 203, 217, 221 Winke 183, 212
variance: analysis of 10, 38, 329, 334 – 5, Winsteps 7, 278 – 81, 283, 290 – 1, 301 – 4
344; amount of 6, 72, 131, 151, 170, Wolfe 48, 77
182, 194, 317; random 159, 161, 168, Wood 113
171 – 2; shared 35; total 194 – 5, 207 – 8, Wright 276 – 7, 278, 287, 304
257, 316 Wright map 285 – 7, 291, 294, 296, 303
variance inflation factor (VIF) 138 written texts 247 – 8, 325
Velicer 196, 212
Ventura 51, 76 Yamamori 244, 268, 274
Vergeer 222, 242 Yamashita 113, 125
von Randow 299, 303 Yung 46, 76
Ward’s method 252 – 3 Zhang 185, 187, 211

Way 269 ZRESID 144
Weaver 294 – 5, 304 ZPRED 144
Wegener 185, 192, 209, 210 z-score 31, 36, 137, 317

Advancing Quantitative Methods in Second Language Research

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advancing Quantitative Methods in Second Language Research

Uploaded by

Copyright:

Available Formats

ADVANCING QUANTITATIVE

By picking up where introductory texts have left off, Advancing Quantitative

Luke Plonsky (PhD, Michigan State University) is a faculty member in the

Monographs on Theoretical Issues:

Monographs on Research Methodology:

Gass with Behney & Plonsky

2 Why Bother Learning Advanced Quantitative Methods in L2

3 Statistical Power, p Values, Descriptive Statistics, and Effect

4 A Practical Guide to Bootstrapping Descriptive Statistics,

5 Presenting Quantitative Data Visually 78

6 Meta-analyzing Second Language Research 106

7 Multiple Regression 131

8 Mixed Effects Modeling and Longitudinal Data Analysis 159

9 Exploratory Factor Analysis and Principal

10 Structural Equation Modeling in L2 Research 213

11 Cluster Analysis 243

12 Rasch Analysis 275

13 Discriminant Analysis 305

14 Bayesian Informative Hypothesis Testing 329

3.1 A descriptive model of quantitative L2 research 24

4.8 Bootstrap mean differences, Q-Q plot, and jackknife-after-

6.2 Example of a funnel plot without the presence

10.2 Two competing structural models with measurement

11.25 Cluster membership by score level for the

1.1 Software used and available for procedures in this book 7

5.4 2009 average NAEP reading scale scores by gender for

I want to begin by expressing my sincere gratitude to the diverse set of individuals

Douglas Biber (Northern Arizona University)

Rationale for This Book

common to find 20 or 30 univariate tests in a single study leading to a greater

Structure of the Book

Procedure In Chapter Additional Options*

Descriptives, NHST, effect SPSS, Excel R

Why would anyone bother to learn advanced quantitative methods in second

What Are the Advantages of Using Advanced

Measuring More Precisely

Thinking Beyond the Null Hypothesis Significance Test

Avoiding the Problem of Multiple Comparisons

Increasing Statistical Power

Broadening Your Research Perspective

Aligning Your Research Analyses More Closely

Reducing Redundancy and the Number of Variables

Expanding the Number and Types of Variables

Getting More Flexibility in Your Analyses

a collection of analyses that can be used for many questions in L2 research.

Simultaneously Addressing Multiple Levels of Analysis

What Are the Disadvantages of Using Advanced

Larger Sample Sizes

Need for Data Screening

Complexity of Analyses and Interpretations

Are the Disadvantages Really Disadvantages?

Larger Sample Sizes

Need for Data Screening

Complexity of Analyses and Interpretations

The Flawed Notion of Statistical Significance

p < 0.05 p > 0.05 Trash

Important finding / Get published!

Modify relevant theory, research, practice

FIGURE 3.1 A descriptive model of quantitative L2 research

TABLE 3.1 Data and results from Sample Study 1

Study N1 N2 M1 (SD1) M2 (SD2) p d

TABLE 3.2 Data and results from Sample Study 2

Study N1 N2 M1 (SD1) M2 (SD2) p d

TABLE 3.3 Data and results from Sample Study 3