Professional Documents
Culture Documents
Typeset in Bembo
by Apex CoVantage, LLC
1 Quantification 1
2 Introduction to SPSS 14
3 Descriptive Statistics 28
5 Correlational Analysis 60
7 T-Tests 92
Epilogue 246
References 250
Key Research Terms in Quantitative Methods 255
Index 263
ILLUSTRATIONS
Figures
2.1 New SPSS spreadsheet 16
2.2 SPSS Variable View 17
2.3 Type Column 18
2.4 Variable Type dialog 18
2.5 Label Column 18
2.6 Creating student and score variables for the Data View 19
2.7 Adding variables named ‘placement’ and ‘campus’ 19
2.8 The SPSS spreadsheet in Data View mode 19
2.9 Accessing Case Summaries in the SPSS menus 20
2.10 Summarize Cases dialog 21
2.11 SPSS output based on the variables set in the Summarize
Cases dialog 21
2.12 SPSS menu to open and import data 23
2.13 SPSS dialog to open a data file in SPSS 23
2.14 Illustrated example of an Excel data file to be imported into SPSS 24
2.15 SPSS dialog when opening an Excel data source 24
2.16 The personal factor questionnaire on demographic information 25
2.17 SPSS spreadsheet that shows the demographic data of
Phakiti et al. (2013) 25
2.18 The questionnaires and types of scales and descriptors in
Phakiti et al. (2013) 26
2.19 SPSS spreadsheet that shows questionnaire items of
Phakiti et al. (2013) 26
3.1 A pie chart based on gender 34
viii Illustrations
15.12 Reliability Analysis dialog for raters’ totals as selected variables 240
15.13 Reliability Analysis: Statistics dialog for intraclass correlation analysis 241
Tables
1.1 Examples of learners and their scores 4
1.2 An example of learners’ scores converted into percentages 4
1.3 How learners are rated and ranked 5
1.4 How learners are scored on the basis of performance descriptors 6
1.5 How learners are scored on a different set of performance descriptors 6
1.6 Nominal data and their numerical codes 8
1.7 Essay types chosen by students 8
1.8 The three placement levels taught at three different locations 9
1.9 The students’ test scores, placement levels, and campuses 9
1.10 The students’ placement levels and campuses 10
1.11 The students’ campuses 11
1.12 Downward transformation of scales 11
3.1 IDs, gender, self-rated proficiency, and test score of the first 50
participants 29
3.2 Frequency counts based on gender 31
3.3 Frequency counts based on test takers’ self-assessment of
their English proficiency 31
3.4 Frequency counts based on test takers’ test scores 32
3.5 Frequency counts based on test takers’ test score ranges 32
3.6 Test score ranges based on quartiles 33
3.7 Imaginary test taker sample with an outlier 36
4.1 SPSS output on the descriptive statistics 51
4.2 SPSS frequency table for gender 52
4.3 SPSS frequency table for the selfrate variable
(self-rating of proficiency) 52
4.4 Taxonomy of the questionnaire and Cronbach’s alpha (N = 51) 59
4.5 Example of item-level descriptive statistics (N = 51) 59
5.1 Descriptive statistics of the listening, grammar, vocabulary, and
reading scores (N = 50) 73
5.2 Pearson product moment correlation between the listening
scores and grammar scores 78
5.3 Spearman correlation between the listening scores and
grammar scores 78
6.1 Correlation between verb tenses and prepositions in a
grammar test 84
6.2 Explanations of the relationship between the sample size and the
effect 88
6.3 The null hypothesis versus alternative hypothesis 89
xii Illustrations
those scores across the sample, the results of which would be subject to one or
more statistical tests for subsequent interpretation. In each of these procedures, we
have made abstractions, tiny steps away from learner knowledge.
I realize these comments might make me appear skeptical of quantitative research.
Of course I am! Likewise, we should all approach the task of conducting, report-
ing, and understanding empirical research with a critical eye. And thankfully, that
is precisely what this very timely and well-crafted book will enable you to do,
thereby advancing our collective ability both to conduct and evaluate research.
The text, in my view, manages to balance on the one hand a conceptual grounding
that enlightens without overwhelming and, on the other, the need for a hands-
on tutorial—in other words, precisely the knowledge and skills needed to make
and justify your own decisions throughout the process of producing rigorous and
meaningful studies. I look forward to reading them!
Luke Plonsky
Georgetown University
PREFACE
Companion Website
A Companion Website hosted by the publisher houses online and up-to-date
materials such as exercises and activities: www.routledge.com/cw/roever
Comments/suggestions
The authors would be grateful to hear comments and suggestions regarding this
book. Please contact Carsten Roever at carsten@unimelb.edu.au or Aek Phakiti
at aek.phakiti@sydney.edu.au.
ACKNOWLEDGMENTS
In preparing and writing this book, we have benefitted greatly from the support of
many friends, colleagues, and students. First and foremost, we wish to acknowledge
Tim McNamara, whose brilliant pedagogical design of the course Quantitative
Methods in Language Studies at the University of Melbourne inspired us to write an
introductory statistical methods book that focuses on conceptual understanding
rather than mathematical intricacies. In addition, several colleagues, mentors, and
friends have helped us shape the book structure and content through invaluable
feedback and engaging discussion: Mike Baynham, Janette Bobis, Andrew Cohen,
Talia Isaacs, Antony Kunnan, Susy Macqueen, Lourdes Ortega, Brian Paltridge,
Luke Plonsky, Jim Purpura, and Jack Richards. We would like to thank Guy
Middleton for his exceptional work on editing the book chapter drafts. We also
greatly appreciate the feedback from Master of Arts (Applied Linguistics) students
at the University of Melbourne and Master of Education (TESOL) students at
the University of Sydney on an early draft. We would like to thank the staff at
Routledge for their assistance during this book project: Kathrene Binag, Rebecca
Novack, and the copy editors.
The support of our institutions and departments has allowed us time to con-
centrate on completing this book. The School of Languages and Linguistics at the
University of Melbourne supported Carsten with a sabbatical semester, which he
spent in the stimulating environment of the Teachers College, Columbia Uni-
versity. The Sydney School of Education and Social Work (formerly the Faculty
of Education and Social Work) supported Aek with a sabbatical semester at the
University of Bristol to complete this book project. Finally, Kevin Yang and Damir
Jambrek deserve our gratitude for their unflagging support while we worked on
this project over several years.
1
QUANTIFICATION
Introduction
Quantification is the use of numbers to represent facts about the world. It is used to
inform the decision-making process in countless situations. For example, a doctor
might prescribe some form of treatment if a patient’s blood pressure is too high.
Similarly, a university may accept the application of a student who has attained the
minimum required grades. In both these cases, numbers are used to inform deci-
sions. In L2 research, quantification is also used. For example,
Quantitative Research
Quantitative researchers aim to draw conclusions from their research that can be
generalized beyond the sample participants used in their research. To do this, they
must generate theories that describe and explain their research results. When a
theory is in the process of being tested, several aspects of the theory are referred to
as hypotheses. This testing process involves analyzing data collected from, for exam-
ple, research participants or databases. In language assessment research, researchers
may be interested in the interrelationships among test performances across various
language skills (e.g., reading, listening, speaking, and writing). Researchers may
hypothesize that there are positive relationships among these skills because there
are common linguistic aspects underlying each skill (e.g., vocabulary and syntac-
tic knowledge). To test this hypothesis, researchers may ask participants to take a
test for each of the skills. They may then perform statistical analysis to investigate
whether their hypothesis is supported by the collected data.
Issues in Quantification
For the results of a piece of quantitative research to be believable, a minimum number
of research participants is required, which will depend on the research question under
analysis, and, in particular, the expected effect size (to be discussed in Chapter 6).
Quantification 3
In most cases, researchers need to use some type of instrument (e.g., a lan-
guage test, a rating scale, or a Likert-type scale questionnaire) to help them
quantify a construct that cannot be directly seen or observed (e.g., writing abil-
ity, reading skills, motivation, and anxiety). When researchers try to quantify
how well a student can write, it is not a matter of simply counting. Rather, it
involves the conversion of observations into numbers, for example, by applying a
scoring rubric that contains criteria which allow researchers to assign an overall
score to a piece of writing. That score then becomes the data used for further
analyses.
Measurement Scales
Different types of data contain different levels of information. These differences
are reflected in the concept of measurement scales. What is measured and how it is
measured determines the kind of data that results. Raw data may be interpreted
differently on different measurement scales. For example, suppose Heather and
Tom took the same language test. The results of the test may be interpreted in
different ways according to the measurement scale adopted. It may be said that
Heather got three more items correct than Tom, or that Heather performed better
than Tom. Alternatively, it may simply be said that their performances were not
identical. The amount of information in these statements about the relative abili-
ties of Heather and Tom is quite different and affects what kinds of conclusion can
be drawn about their abilities. The three statements about Heather and Tom relate
directly to the three types of quantitative data that are introduced in this chapter:
interval, ordinal, and nomina/categorical data.
Heather 19
Tom 16
Phil 16
Jack 11
Mary 8
Heather 19 95%
Tom 16 80%
Phil 16 80%
Jack 11 55%
Mary 8 40%
• Heather got more questions right than Tom, and also that she got three more
right than Tom did;
• Tom got twice as many questions right as the lowest scorer, Mary; and,
• the difference between Heather and Jack’s scores was the same as the differ-
ence between Tom and Mary’s scores, namely eight points in each case.
Interval data contain a large amount of detailed information and they tell us exactly
how large the interval is between individual learners’ scores. They therefore lend them-
selves to conversion to percentages. Table 1.2 shows the learners’ scores in percentages.
Percentages allow researchers to compare results from tests with different maxi-
mum scores (via a transformation to a common scale). For example, if the next
test consists of only 15 items, and Tom gets 11 of them right, his percentage score
will have declined (as 11 out of 15 is 73%), even though in both cases he got
four questions wrong. In addition to allowing conversion to percentages, interval
data can also be used for a wide range of statistical computations (e.g., calculating
means) and analyses.
Typical real-world examples of interval data include age, annual income, weekly
expenditure, and the time it takes to run a marathon. In L2 research, interval data
include age, number of years learning the target language, and raw scores on lan-
guage tests. Scaled test scores on a language proficiency test, such as the Test of
English as a Foreign Language (TOEFL), International English Language Testing
System (IELTS), and Test of English for International Communication (TOEIC)
are also normally considered interval data.
Quantification 5
Ordinal Data
For statistical purposes, ratio and interval data are normally considered desirable
because they are rich in information. Nonetheless, not all data can be classified as
interval data, and some data contain less precise information. Ordinal data contains
information about relative ranking but not about the precise size of a difference.
If the data in Tables 1.1 and 1.2 regarding students’ test scores were expressed as
ordinal data (i.e., they were on an ordinal scale of measurement), they would tell
the researchers that Heather performed better than Tom, but they would not indi-
cate by how much Heather outperformed Tom. Ordinal data are obtained when
participants are rated or ranked according to their test performances or levels of
some trait. For example, when language testers score learners’ written production
holistically using a scoring rubric that describes characteristics of performance,
they are assigning ratings to texts such as ‘excellent’, ‘good’, ‘adequate’, ‘support
needed’, or ‘major support needed’. Table 1.3 is an example of how the learners
discussed earlier are rated and ranked.
According to Table 1.3, it can be said that
While ordinal data contain useful information about the relative standings of
test takers, they do not show precisely how large the differences between test tak-
ers are. Phil and Tom performed better than Mary did, but it is unknown how
much better than her they performed. Consequently, with the data in Table 1.3,
it is impossible to see that Phil and Tom scored twice as high as Mary. Although
it could be said that Phil and Tom are two score levels above Mary, that is rather
vague.
Ordinal data can be used to put learners in order of ability, but they do little
beyond establishing that order. In other words, they do not give researchers as
much information about the extent of the differences between individual learn-
ers as interval data do. Ratings of students’ writing or speaking performance are
Heather Excellent 1
Tom Good 2
Phil Good 2
Jack Adequate 3
Mary Support Needed 4
6 Quantification
often expressed numerically; however, that does not mean that they are interval
data. For example, numerical values can be assigned to descriptors as follows:
Excellent (5), Good (4), Adequate (3), Support Needed (2); Major Support
Needed (1). Table 1.4 presents how the learners are rated on the basis of perfor-
mance descriptors.
The numerical scores in Table 1.4 may look like interval data, but they are not.
They are only numbers that represent the descriptor, so it would not make sense
to say that Tom scored twice as high as Mary did. It makes sense to say only that
his score is two levels higher than Mary’s. This becomes even clearer if the rating
scales are changed as follows: excellent (8), good (6), adequate (4), support needed
(2), and Major support (0). That would give the information in Table 1.5.
As can been seen in Tables 1.4 and 1.5, the descriptors do not change, but
the numerical scores do. Tom and Phil’s scores are still two levels higher than
Mary’s, but now their numerical scores are three times as high as Mary’s score.
This illustration makes it clear that numerical representations of descriptors are
only symbols that say nothing about the size of the intervals between adjacent
levels. They indicate that Heather is a better writer than Tom, but since they are
not based on counts, they cannot indicate precisely how much of a better writer
Heather is than Tom.
In L2 research, rating scale data are an example of ordinal data. These are
commonly collected in relation to productive tasks (e.g., writing and speaking).
Whenever there are band levels, such as A1, A2, and B1, as in the Common Euro-
pean Reference Framework for Languages (see Council of Europe, 2001), or bands
TABLE 1.4 How learners are scored on the basis of performance descriptors
Heather Excellent 5
Tom Good 4
Phil Good 4
Jack Adequate 3
Mary Support Needed 2
TABLE 1.5 How learners are scored on a different set of performance descriptors
Heather Excellent 8
Tom Good 6
Phil Good 6
Jack Adequate 4
Mary Support Needed 2
Quantification 7
1–9, as in the IELTS, researchers are dealing with ordinal data, rather than interval
data. Data collected by putting learners into ordered categories, such as ‘beginner’,
‘intermediate’, or ‘advanced’ are another case of ordinal data. Finally, ordinal data
occur when researchers rank learners relative to each other. For example, researchers
may say that in reference to a particular feature, Heather is the best, Tom and Phil
share second place, Jack is behind them, and Mary is the weakest. This ranking indi-
cates only that the first learner is better (e.g., stronger, faster, more capable) than the
second learner, but not by how much. Ordinal data can only provide information
about the relative strengths of the test takers in regard to the feature in question. The
final data type often used in L2 research (i.e., nominal or categorical data) does not
contain information about the strengths of learners, but rather about their attributes.
B versus Form C). When a variable can only have two possible values (pass/
fail; international student/domestic student, correct/incorrect), this type of data
is sometimes called dichotomous data. For example, students may be asked to com-
plete a free writing task in which they are limited to three types of essays: personal
experience (coded 1), argumentative essay (coded 2), and description of a process
(coded 3). Table 1.7 shows which student chose which type.
The data in the Type column do not provide any information about one learner
being more capable than another. It only shows which learners chose which essay
type, from which frequency counts can be made. That is, the process description
and personal experience types were chosen two times each, and the argumenta-
tive essay was chosen once. How nominal data are used in statistical analysis for
research purposes will be addressed in the next few chapters.
TABLE 1.8 The three placement levels taught at three different locations
TABLE 1.9 The students’ test scores, placement levels, and campuses
take a placement test consisting of, say, 60 multiple-choice questions assessing their
listening, reading, and grammar skills. Based on the test scores, the students are
placed in one of three levels: beginner, intermediate, or advanced. In addition, the
three levels are taught at three different locations, as presented in Table 1.8.
Table 1.9 presents the scores and placements of the five students introduced earlier.
The test scores are measured on an interval measurement scale that is based on
the count of correct answers in the placement test and provides detailed informa-
tion. It can be said that:
• Heather’s score is in the advanced range since her score is 11 points above the
cut-off, and her score is much higher than Tom’s, whose score was 23 points
lower than hers;
• Tom’s score is in the intermediate range, but it is close to the cut-off for the
advanced range, missing it by just three points;
• Tom’s score is far higher than Phil’s, with a difference of 17 points, yet both
scores are in the intermediate range;
• Phil’s score is just one point above the cut-off for the intermediate level, and
is only four points higher than Jack’s score. Despite the small difference in
their scores, Jack was placed in the beginner level and Phil was placed in the
intermediate level; and,
• Mary’s score is in the middle of the beginner level.
Because the information is detailed, the placement test can be evaluated criti-
cally. For example, Phil and Tom’s scores are 17 points apart whereas Phil and
Jack’s are only four points apart. Phil’s proficiency level is arguably closer to Jack’s
than to Tom’s. Yet, Phil and Tom are both classified as intermediate, but Jack is
classified in the beginner level. This is known as the contiguity problem, and it is
10 Quantification
common whenever cut-off points are set arbitrarily: students close to each other
but on different sides of the cut-off point can be more similar to each other than
to people further away from each other but on the same side of the cut-off point.
Now imagine that there are no interval-level test-score data, but instead just the
ordinal-level placement levels data and the campus data, as in Table 1.10.
As can be seen in Table 1.10, the differences between Tom and Phil and the
problematic nature of the classification that were so apparent before are no longer
visible. The information about the size of the differences between learners has
been lost and all that can be deduced now is that some students are more profi-
cient than others. Tom and Phil have the same level of proficiency and Jack is
clearly different from both of them. This demonstrates why ordinal data are not as
precise as interval data. Information is lost, and the differences between the learn-
ers seen earlier are no longer as clear.
Highly informative interval data are often transformed into less informative
ordinal data to reduce the number of categories the data must be split into. No
language program can run with classes at 60 different proficiency levels; moreover,
some small differences are not meaningful, so it does not make sense to group
learners into such a large number of levels. However, setting the cut-off points is
often a problematic issue in practice.
While the ordinal proficiency level data are less informative than the interval
test-score data, they can be scaled down even further, namely to the nominal cam-
pus data (see Table 1.11).
If this is all that can be seen, it is impossible to know how campus assignment
is related to proficiency level. However, it can be said that:
This information does not indicate who is more proficient since nominal data
do not contain information about the size or direction of differences. They indi-
cate only whether differences exist or not.
Transformation of types of data can happen downwards only, rather than
upwards, in the sense that interval data can be transformed into ordinal data and
Quantification 11
Student Campus
Tom Eastern
Mary City
Heather Ocean
Jack City
Phil Eastern
ordinal data can be transformed into nominal data (e.g., by using test scores to
place learners in classes based on proficiency levels and then by assigning classes to
campus locations). Table 1.12 illustrates the downward transformation of scales.
Transformation does not work the other way around. That is, if it is known
which campus a learner studies at, it is impossible to predict that learner’s profi-
ciency level. Similarly, if a learner’s proficiency level is known, it is impossible to
predict that learner’s exact test score.
Topics in L2 Research
It is useful to introduce some of the key topics in L2 research that can be examined
using a quantitative research methodology. Here, areas of research interests in SLA,
and language testing and assessment (LTA) research are presented.
SLA Research
There is a wide range of topics in SLA research that can be investigated using
quantitative methods, although the nature of SLA itself is qualitative. SLA research
aims to examine the nature of language learning and interlanguage processes (e.g.,
sequences of language acquisition; the order of morpheme acquisition; charac-
teristics of language errors and their sources; language use avoidance; cognitive
processes; and language accuracy, fluency, and complexity). SLA research also
aims to understand the factors that affect language learning and success. Such
factors may be internal or individual factors (e.g., age, first language or cross-
linguistic influences, language aptitude, motivation, anxiety, and self-regulation), or
external or social factors (e.g., language exposure and interactions, language and
12 Quantification
A Sample Study
Khang (2014) will be used to further illustrate how L2 researchers apply the prin-
ciples of scales of measurement in their research. Khang (2014) investigated the
fluency of spoken English of 31 Korean English as a Foreign Language (EFL)
learners compared to that of 15 native English (L1) speakers. The research partici-
pants included high and low proficiency learners. Khang conducted a stimulated
recall study with a subset of this population (eight high proficiency learners and
nine low proficiency learners). This study exemplifies all three measurement scales.
The status of a learner as native or nonnative speaker of English was used as a
nominal variable. ‘Native’ was not in any way better or worse than ‘nonnative’; it
was just different. The only statistic applied to this variable was a frequency count
(15 native speakers and 31 nonnative speakers). Khang used this variable to estab-
lish groups for comparison. Proficiency level was used as an ordinal variable in
Quantification 13
this study. High proficiency learners were assumed to have greater target language
competence than low proficiency learners had, but the degree of the difference
was not relevant. The researcher was interested only in comparing the issues that
high and low proficiency learners struggled with. Khang’s other measures were
interval variables (e.g., averaged syllable duration, number of corrections per min-
ute, and number of silent pauses per minute, which can all be precisely quantified).
Summary
It is essential that quantitative researchers consider the types of data and levels of
measurement that they use (i.e., the nature of the numbers used to measure the
variables). In this chapter, issues of quantification and measurement in L2 research,
particularly the types of data and scales associated with them, have been discussed.
The next chapter will turn to a practical concern: how to manage quantitative data
with the help of a statistical analysis program, namely the IBM Statistical Package
for Social Sciences (SPSS). The concept of measurement scales will be revisited
through SPSS in the next chapter.
Review Exercises
To download review questions and SPSS exercises for this chapter, visit the Com-
panion Website: www.routledge.com/cw/roever.
References
Alderson, J. C. (2000). Assessing reading. Cambridge: Cambridge University Press.
American Psychological Association (APA) . (2010). Publication manual of the American
Psychological Association (6th ed.). Washington, DC: American Psychological Association.
Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: Cambridge
University Press.
Bachman, L. F. , & Kunnan, A. J. (2005). Statistical analyses for language assessment
workbook and CD ROM. Cambridge: Cambridge University Press.
Bachman, L. F. , & Palmer, A. S. (2010). Language assessment in practice. Oxford: Oxford
University Press.
Bell, N. (2012). Comparing playful and nonplayful incidental attention to form. Language
Learning, 62(1), 236265.
Blair, E. , & Blair, J. (2015). Applied survey sampling. Thousand Oaks: Sage.
Brown, J. D. (2005). Testing in language programs. New York: McGraw Hill.
Brown, J. D. (2011). Likert items and scales of measurement. SHIKEN: JALT Testing &
Evaluation SIG Newsletter, 15(1), 1014.
Brown, J. D. (2014). Classical theory reliability. In A. J. Kunnan (Ed.), Companion to
language assessment (pp. 11651181). Oxford, UK: John Wiley & Sons.
Carifio, J. , & Perla, R. J. (2007). Ten common misunderstandings, misconceptions,
persistent myths and urban legends about Likert scales and Likert response formats and their
antidotes. Journal of Social Sciences, 3(3), 106116.
Carr, N. (2011). Designing and analysing language tests. Oxford: Oxford University Press.
Chapelle, C. A. , Enright, M. K. , & Jamieson, J. (Eds.). (2008). Building a validity argument
for the test of English as a foreign language. London: Routledge.
Cho, Y. , & Bridgeman, B. (2012). Relationship of TOEFL iBT scores to academic
performance: Some evidence from American universities. Language Testing, 29(3), 421442.
Clark, L. A. , & Watson, D. B. (1995). Constructing validity: Basic issues in objective scale
development. Psychological Assessment, 7, 309319.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Newbury Park, CA:
Sage.
Cook, R. D. , & Weisberg, S. (1983). Diagnostics for heteroscedasticity in regression.
Biometrika, 70(1), 110.
251 Coombe, C. A. , Davidson, P. , OSullivan, B. , & Stoynoff, S. (Eds.). (2012). Cambridge
guide to second language assessment. Cambridge: Cambridge University Press.
Corder, G. W. , & Foreman, D. I. (2009). Non-parametric statistics for non-statisticians.
Hoboken, NJ: John Wiley.
Council of Europe (2001). Common European framework of reference for languages:
Learning, teaching, assessment. Cambridge: Cambridge University Press.
Crossley, S. A. , Cobb, T. , & McNamara, D. S. (2013). Comparing count-based and band-
based indices of word frequency: Implications for active vocabulary research and
pedagogical applications. System, 41(4), 965982.
Derwing, T. M. , & Munro, M. J. (2013). The development of L2 oral language skills in two L1
groups: A 7-year study. Language Learning, 63(2), 163185.
Di Silvio, F. , Donovan, A. , & Malone, M. E. (2014). The effect of study abroad homestay
placements: Participant perspectives and oral proficiency gains. Foreign Language Annals,
47(1), 168188.
Doolan, S. M. , & Miller, D. (2012). Generation 1.5 written error patterns: A comparative
study. Journal of Second Language Writing, 21(1), 122.
Drnyei, Z. , & Taguchi, T. (2010). Questionnaires in second language research. London:
Routledge.
Douglas, D. (2010). Understanding language testing. London: Hodder Education.
Eisenhauer, J. G. (2008). Degrees of freedom. Teaching Statistics, 30(3), 7578.
Elder, C. , Knoch, U. , & Zhang, R. (2009). Diagnosing the support needs of second language
writers: Does the time allowance matter? TESOL Quarterly, 43(2), 351360.
Ellis, R. (2015). Understanding second language acquisition. Oxford: Oxford University
Press.
Field, A. (2013). Discovering statistics using IBM SPSS statistics (3rd ed.). Los Angeles:
Sage.
Fulcher, G. (2010). Practical language testing. London: Hodder Education.
Furr, R. M. (2010). Yates correction. In N. J. Salkind (Ed.), Encyclopedia of research design
(Vol. 3, pp. 16451648). Los Angeles: Sage.
Fushino, K. (2010). Causal relationships between communication confidence, beliefs about
group work, and willingness to communicate in foreign language group work. TESOL
Quarterly, 44(4), 700724.
Gass, S. M. with Behney, J. , & Plonsky, L. (2013). Second language acquisition: An
introductory course (4th ed.). New York and London: Routledge.
Gass, S. , Svetics, I. , & Lemelin, S. (2003). Differential effects of attention. Language
Learning, 53(3), 497545.
Green, A. (2014). Exploring language assessment and testing: Language in action. New
York: Routledge.
Greenhouse, S. (1990). Yatess correction for continuity and the analysis of 22 contingency
tables: Comment. Statistics in Medicine, 9(4), 371372.
Guo, Y. , & Roehrig, A. D. (2011). Roles of general versus second language (L2) knowledge
in L2 reading comprehension. Reading in a Foreign Language, 23(1), 4264.
Haviland, M. G. (1990). Yatess correction for continuity and the analysis of 22 contingency
tables. Statistics in Medicine, 9(4), 363367.
House, J. (1996). Developing pragmatic fluency in English as a foreign language: Routines
and metapragmatic awareness. Studies in Second Language Acquisition, 18(2), 225252.
Hudson, T. , & Llosa, L. (2015). Design issues and inference in experimental L2 research.
Language Learning, 65(S1), 7696.
Huff, D. (1954). How to lie with statistics. New York: Norton.
252 Isaacs, T. , Trofimovich, P. , Yu, G. , & Munoz, B. M. (2015). Examining the linguistic
aspects of speech that most efficiently discriminate between upper levels of the revised
IELTS Pronunciation scale. IELTS Research Report, 4, 148.
Jamieson, S. (2004). Likert scales: How to (ab)use them. Medical Education, 38(12),
12121218.
Jia, F. , Gottardo, A. , Koh, P. W. , Chen, X. , & Pasquarella, A. (2014). The role of
acculturation in reading a second language: Its relation to English literacy skills in immigrant
Chinese adolescents. Reading Research Quarterly, 49(2), 251261.
Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational measurement (4th ed., pp.
1764). Westport, CT: Greenwood Publishing.
Keith, Z. K. (2003). Validity of automated essay scoring systems. In M. D. Shermis , & J.
Burstein, J. (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 147167).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Khang, J. (2014). Exploring utterance and cognitive fluency of L1 and L2 English speakers:
Temporal measures and stimulated recall. Language Learning, 64(4), 809854.
Ko, M. H. (2012). Glossing and second language vocabulary learning. TESOL Quarterly,
46(1), 5679.
Kormos, J. , & Trebits, A. (2012). The role of task complexity, modality and aptitude in
narrative task performance. Language Learning, 62(2), 439472.
Kunnan, A. J. (Ed.). (2014). The companion to language assessment. Oxford, UK: John
Wiley & Sons.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: a
practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863.
Larson-Hall, J. (2010). A guide to doing statistics in second language research using SPSS.
New York: Routledge.
Larson-Hall, J. (2016). A guide to doing research in second language acquisition with SPSS
and R (2nd ed.). New York: Routledge.
Laufer, B. , & Waldman, T. (2011). Verb-noun collocations in second language writing: A
corpus analysis of learners English. Language Learning, 61(2), 647672.
Laufer, L. , & Rozovski-Roitblat, L. (2011). Incidental vocabulary acquisition: The effects of
task type, word occurrence and their combination. Language Teaching Research, 15(4),
391411.
Lee, C. H. , & Kalyuga, S. (2011). Effectiveness of different pinyin presentation formats in
learning Chinese characters: A cognitive load perspective. Language Learning, 61(4),
10991118.
Lightbown, P. M. , & Spada, N. (2013). How languages are learned (4th ed.). Oxford: Oxford
University Press.
Liu, D. (2011). The most frequently used English phrasal verbs in American and British
English: A multicorpus examination. TESOL Quarterly, 45(4), 661688.
Macaro, E. (2010). Continuum companion to second language acquisition. London:
Continuum.
Mackey, A. , & Gass, S. M. (2015). Second language research: Methodology and design (2nd
ed.). London: Routledge.
Mantel, N. (1990). Yatess correction for continuity and the analysis of 22 contingency tables:
Comment. Statistics in Medicine, 9(4), 369370.
Matsumoto, M. (2011). Second language learners motivation and their perception of their
teachers as an affecting factor. New Zealand Studies in Applied Linguistics, 17(2), 3752.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp.
13103). New York: Macmillan.
253 Miller, G. A. , & Chapman, J. P. (2001). Misunderstanding analysis of covariance.
Journal of Abnormal Psychology, 110(1), 4048.
Mora, J. C. , & Valls-Ferrer, M. (2012). Oral fluency, accuracy, and complexity in formal
instruction and study abroad learning contexts. TESOL Quarterly, 46(4), 610641.
Norris, J. M. , Ross, S. J. , & Schoonen, R. (Eds.). (2015). Improving and extending
quantitative reasoning in second language research. Language Learning, 65(S1), vvi, 1264.
Ockey, G. J. , Koyama, D. , Setoguchi, E. , & Sun, A. (2015). The extent to which TOEFL iBT
speaking scores are associated with performance on oral ability components for Japanese
university students. Language Testing, 32(1), 3962.
Ortega, L. (2009). Understanding second language acquisition. London: Hodder.
Paltridge, B. , & Phakiti, A. (Eds.) (2015). Research methods in Applied Linguistics: A
practical resource. London: Bloomsbury.
Pawlak, M. , & Aronin, L. (2014). Essential topics in applied linguistics and multilingualism:
Studies in honor of David Singleton. New York, NY: Springer.
Phakiti, A. (2006). Modeling cognitive and metacognitive strategies and their relationships to
EFL reading test performance. Melbourne Papers in Language Testing, 1(1), 5396.
Phakiti, A. (2014). Experimental research methods in language learning. London:
Bloomsbury.
Phakiti, A. , Hirsh, D. , & Woodrow, L. (2013). Its not only English: Effects of other individual
factors on English language learning and academic learning of ESL International students in
Australia. Journal of Research in International Education, 12(3), 239258.
doi:10.1177/1475240913513520.
Phakiti, A. , & Li, L. (2011). General academic difficulties and reading and writing difficulties
among Asian ESL postgraduate students in TESOL at an Australian university. RELC
Journal, 42(3), 227264.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting
practices in quantitative L2 research. Studies in Second Language Acquisition, 35(4),
655687.
Plonsky, L. (2014). Study quality in quantitative L2 research (19902010): A methodological
synthesis and call for reform. The Modern Language Journal, 98(1), 450470.
Plonsky, L. , & Gass, S. (2011). Quantitative research methods, study quality, and outcomes:
The case of interaction research. Language Learning, 61(2), 325366.
Plonsky, L. , & Oswald, F. L. (2014). How big is big? Interpreting effect sizes in L2 research.
Language Learning, 64(4), 878912.
Purpura, J. E. (2011). Quantitative research methods in assessment and testing. In E. Hinkel
(Ed.), Handbook of research in second language teaching and learning Vol. 2 (pp. 731751).
London: Routledge.
Purpura, J. E. (2016). Second and foreign language assessment. The Modern Language
Journal, 100(S), 190208.
Qian, D. (2002). Investigating the relationship between vocabulary knowledge and academic
reading performance: An assessment perspective. Language Learning, 52(3), 513536.
Read, J. (2000). Assessing vocabulary. Cambridge: Cambridge University Press.
Read, J. (2015). Researching language testing and assessment. In B. Paltridge , & A. Phakiti
(Eds.), Research methods in applied linguistics: A practical resource (pp. 471486). London:
Bloomsbury.
Roever, C. (1995). Routine formulae in acquiring English as a foreign language. Unpublished
raw data.
Roever, C. (2005). Testing ESL pragmatics. Frankfurt: Peter Lang.
254 Roever, C. (2006). Validation of a web-based test of ESL pragmalinguistics. Language
Testing, 23(2), 229256.
Roever, C. (2012). What learners get for free: Learning of routine formulae in ESL and EFL
environments. ELT Journal, 66(1), 1021.
Rutherford, A. (2011). ANOVA and ANCOVA: A GLM approach. Oxford: John Wiley & Sons.
Scheaffer, R. L. , Mendenhall, W. , Ott, R. L. , & Gerow, K. G. (2012). Elementary survey
sampling. Boston: Brooks/Cole.
Shadish, W. R. , Cook, T. D. , & Campbell, D. T. (2002). Experimental and quasi-
experimental designs for generalized causal inference. Boston: Houghton, Mifflin.
Shintani, N. , Ellis, R. , & Suzuki, W. (2014). Effects of written feedback and revision on
learners accuracy in using two English grammatical structures. Language Learning, 64(1),
103131.
Stevens, J. P. (2012). Applied multivariate statistics for the social sciences (5th ed.). New
York: Routledge.
Tabachnik, B. , & Fidell, L. (2012). Using multivariate statistics. Boston: Pearson.
Taguchi, N. , & Roever, C. (2017). Second language pragmatics. Oxford: Oxford University
Press.
Weir, C. J. (2003). Language testing and validation: An evidence-based approach. New York,
NY: Macmillan.
Williamson, D. M. , Xi, X. , & Breyer, F. J. (2012). A framework for evaluation and use of
automated scoring. Educational Measurement: Issues and Practice, 31(1), 213.
Yang, Y. , Buckendahl, C. W. , Juszkewicz, P. J. , & Bhola, D. S. (2002). A review of
strategies for validating computer-automated scoring. Applied Measurement in Education,
15(4), 391412.