You are on page 1of 31

704009

research-article2017
LTJ0010.1177/0265532217704009Language TestingLee and Winke

/$1*8$*(
Original Manuscript 7(67,1*

Language Testing

Young learners’ response


1­–31
© The Author(s) 2017
Reprints and permissions:
processes when taking sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0265532217704009
https://doi.org/10.1177/0265532217704009
computerized tasks for journals.sagepub.com/home/ltj

speaking assessment

Shinhye Lee and Paula Winke


Michigan State University, USA

Abstract
We investigated how young language learners process their responses on and perceive a
computer-mediated, timed speaking test. Twenty 8-, 9-, and 10-year-old non-native English-
speaking children (NNSs) and eight same-aged, native English-speaking children (NSs) completed
seven computerized sample TOEFL® Primary™ speaking test tasks. We investigated the children’s
attentional foci on different test components (e.g., prompts, pictures, and a countdown timer)
by means of their eye movements. We associated the children’s eye-movement indices (visit
counts and fixation durations) with spoken performance. The children provided qualitative data
(interviews; picture-drawings) on their test experiences as well. Results indicated a clear contrast
between NNSs and NSs in terms of speech production (large score differences) as expected.
More interestingly, the groups’ eye-movement patterns differed. NNSs tended to fixate longer
on and looked more frequently at the countdown timer than their NS peers, who were more
likely to look at content features, that is, onscreen pictures meant to help with building up speech.
Specifically, the NNSs’ fixations on timers were likely to co-occur with hesitation phenomena
(e.g., hemming; pausing; silence). We discuss (a) the potential effects of test-specific features on
children’s performance and (b) child-appropriate test accommodations and practices.

Keywords
Computer-based assessment, eye-trackng, methodology, response processes, speaking
assessment, speech production, validity evidence, young language learners

Alongside the global spread of English and the need to equip adults to be successful
international users of English, it has become important also to equip young English
language learners (ELLs)1 for success in international contexts (McKay, 2006).

Corresponding author:
Paula Winke, Second Language Studies Program, Michigan State University, B252 Wells Hall, 619 Red Cedar
Road, East Lansing, MI, USA.
Email: winke@msu.edu
2 Language Testing 0(0) 

Accordingly, there is a new demand for proficiency tests specifically for children as a
means of benchmarking their English proficiency at international levels (Chik & Besser,
2011; McKay, 2006). A number of private companies have correspondingly developed
such large-scale measures (e.g., the Cambridge Young Learners English Test [YLE], the
Young Learners Test of English [YLTE], and the TOEFL Primary and Junior). As
expected, such assessments have quickly gained popularity in non-English-speaking
countries (Carless & Lam, 2014).
However, there is a lack of research on how these tests affect the children who take
them. Are the test scores appropriate and useful? Is learning helped by test preparation or
the tests scores? How do the children themselves feel about the tests? Methods to research
the impact of language testing on children have been developed (Carless & Lam, 2014;
Chik & Besser, 2011). These methods must be applied for a better understanding of how,
precisely, young ELLs view and interact with recently developed English language tests.
Such work is needed because, as Butler and Zeng (2014) have stressed, there is a relative
dearth of research on children’s reactions to language assessments. Children are complex
test takers (McKay, 2006), so researchers need to do more than just look at the children’s
test scores. In-depth documentations of testing effects are needed; for example, research
that explores child test takers’ affective responses and cognitive processing (Winke, Lee,
Ahn, Choi, Cui, & Yoon, 2015). In this paper, we investigate children’s test-taking pro-
cesses and test-related perspectives on a computerized speaking test. We draw on multi-
ple data sources, including child test takers’ opinions (through interviews), feelings
(through drawings), and attention (through eye-tracking metrics). By doing so, we hope
to contribute to the larger discussion of child language assessment practices.

ELLs at risk
One reason why researchers need to understand how children behave on and interact
with large-scale English language assessments is so that the scores from them can be
fully trusted and seen as valid. Researchers must do this because the targeted population,
young ELLs, is particularly vulnerable in testing situations (Flanery, 1990). Even when
tests are psychometrically sound, researchers need to be sure they are psychologically
safe. As opposed to their non-ELL peers, ELL students may additionally experience an
enhanced sense of anxiety during test taking (Rotenberg, 2002). Research studies on
state- or nation-wide assessments for children have demonstrated that the tests may
affect some young children psychologically and in negative ways. That is, when children
do poorly, the tests may affect their overall levels of academic effort, efficacy, and anxi-
ety (Hall, Collins, Benjamin, Nind, & Sheehy, 2004; Harlen, 2006; Hodge, McCormick,
& Elliott, 1997; Reay & Wiliam, 1999; Rotenberg, 2002). In some cases, children’s neg-
ative experiences during test-taking may cause them to stop attempting to do well during
the test (Smith & Ellsworth, 1987; Winke, 2011). Such response processes compromise
test reliability and score interpretation, and hence validity.
With these warnings, researchers have recently begun to investigate children’s reflec-
tions on evaluative measures (Carless & Lam, 2014) and on commercially available
language assessments (e.g., Chik & Besser, 2011; Winke et al., 2015). One question is
whether the ELLs experience too much test-taking anxiety. In general, ELL students are
Lee and Winke 3

likely to experience varying degrees of foreign language anxiety (Horwitz, 2010), which
arises from multiple sources, but which is understood as a type of “communicative stress
[that] negatively impacts language performance” (Rotenberg, 2002, p. 523; see also
Hodge et al., 1997; McNeil & Valenzuela, 2000). Foreign language anxiety can come in
part from a shortfall in the target language (i.e., not having enough of the language to do
well on the task at hand). But test-taking anxiety is different from foreign language anxi-
ety (see Hewitt & Stephenson, 2012; Rotenberg, 2002); test-taking anxiety can come
from multiple sources, including the test itself. And it can be debilitating during the
exam, especially when students are of a low-ability relative to the exam (as found by
Hewitt and Stephenson, who investigated adult learners of Spanish in a face-to-face oral
exam).
Some research in this area, with child ELLs, has been conducted. Using a survey
taken by second-grade ELLs (nine Hispanics and one Vietnamese) and native English-
speaking children (n = 12), Rotenberg (2002) investigated the relationship among lan-
guage proficiency and different types of anxiety (e.g., general test anxiety and foreign
language anxiety). She found that the majority of the less proficient ELL children rated
themselves higher on test and foreign language anxiety relative to their native English-
speaking peers. In a similar vein, Hodge et al. (1997) found that the language back-
ground (English speaking versus non-English speaking) of students is a statistically
significant factor in predicting the level of emotional distress manifested among students
taking the Higher School Certificate examination in Australia. This prior research under-
scores the vulnerable nature of ELLs in testing contexts. They can be anxious and
stressed for various reasons, which may exacerbate any difficulties they experience dur-
ing the testing process.
The stress and anxiety for ELLs during testing can come from multiple sources, as
mentioned above, so it is important for researchers to try to identify the sources and tease
them apart. Unrelated to the test format, ELLs may experience anxiety during testing that
comes from outside pressure (from their parents; from society at large). Research has
shown that parents and society often affect how ELLs, especially in English-as-a-foreign-
language (EFL) contexts, take tests and what they do during tests (Carless & Lam, 2014;
Chik & Besser, 2011). Chik and Besser (2011) documented the extremely high-stakes
nature of the Young Learners English Test (YLE) (www.cambridgeenglish.org/exams/
young-learners-english) test in Hong Kong. Young learners participated in interviews
with the researchers, and the children implied that they had engaged in intensive test
preparation because the test is important. It serves as an essential pathway to enter pres-
tigious secondary English-medium schools. Accordingly, the children stated that they
were under peer and parental pressure to achieve high scores. Similarly, Carless and Lam
(2014) investigated the perceptions of 115 elementary school children in Hong Kong on
school-based exams through focus group interviews. They asked 76 children to draw
pictures demonstrating their feelings toward the tests they recently took. Overall, the
children expressed more negative than positive responses to assessment practices. Most
significantly, out of a total of 207 incidences where parents were mentioned or drawn by
the children, 51% of the responses revealed that the children were anxious because they
feared their parents’ disapproval. On the other hand, research has shown that a supportive
environment with abundant target language input reduces the anxiety that ELLs feel
4 Language Testing 0(0) 

during tests (Dewaele, Petrides, & Furnham, 2008). The question is, do specific ELL
tests have comprehensible test directions (test directions that are easily understood by
ELLs, meaning that they are simple in terms of language and not culturally bound) and
linguistically supportive test features such that the child test takers feel comfortable and
confident enough to do their best during testing despite outside pressures?

Assessment for young ELLs: An early stage


Part of the reason that researchers and pedagogues do not know much about young ELLs’
test-taking processes or their feelings about ELL tests is that the tests themselves are still
relatively new, and as such, the contexts surrounding them are under-researched.
Previously, most child measures were modified versions of adult instruments (Edelbrock,
1984). Test developers are moving away from that model (see Flanery, 1990 for more
discussions on the topic). Still, newer ELL tests should be checked to be sure that the
procedures are age-appropriate and do not contain features (i.e., complex or vague
instructions, long test lengths, timed conditions) that may be difficult for young learners
to process or handle (McKay, 2006; Menken, 2008). For instance, most child language
tests have restrictions on seeking assistance during testing, and this can be stressful and
striking to young children, as they are typically given explicit instructions in classrooms
to seek help and ask questions (McKay, 2006). Additionally, unfamiliar test mechanics
(e.g., prompts, instructions, picture cues) may arouse confusion in children, to the extent
that the child’s test performance may be affected (Hill & Wigfield, 1984). The rule of
thumb should be that the test should mimic normal learning and daily task procedures
undertaken in class. In other words, it is important that the test “assesses candidates on
the same range and types of cognitive operations as those required of students in the
target programme” (Bax, 2013, p. 442). The test should require little to no specific test-
taking strategies or format-knowledge; and if it does, then short but adequate test prepa-
ration should be included as part of the entire testing plan (with little negative washback
on the curriculum) so that the test-taking knowledge needed is intrinsically embedded. It
should be automatically and effortlessly accessible during testing.
Investigations into tests of young ELLs have been conducted before. For example,
McNeil and Valenzuela (2001) conducted longitudinal ethnographic research to document
the impact of the Texas Assessment of Academic Skills (TAAS) on high school Mexican
ELL immigrants and Mexican Americans. From a series of student interviews, the authors
reported that ELL immigrants were unable to perform their best because they had difficul-
ties in reading and understanding the written test instructions. Winke et al. (2015) con-
ducted a small-scale, comparative investigation into ELL and non-ELL children’s
performances and perspectives on the Young Learners Test of English (YLTE) (www.
cambridgemichigan.org/institutions/products-services/tests/proficiency-certification/
ylte). The authors wanted to understand whether test-taking difficulties were truly associ-
ated with the children’s language proficiency, as one would expect. They found that some
directions caused both young ELLs and non-ELLs confusion, aligning with previous
assertions that unfamiliar test mechanics (e.g., prompts, instructions) may negatively
impact test-taking experiences for all children (Hill & Wigfield, 1984; McKay, 2006). For
instance, several children (native English speakers and ELLs) stated they were unfamiliar
Lee and Winke 5

with using word banks, like ones that appeared on the test, and they also suggested that the
directions did not clearly tell them what to do. On the test, when the directions were not
clear, some children used words not in the word bank that were either approximate or, in
some cases, arguably better than, the accepted answers. This demonstrated potential issues
with score interpretation, especially if ELLs (who are the targeted population for the test)
behaved like the native speakers in giving correct but un-scorable (technically incorrect)
answers. Because Winke et al.’s (2015) study had native English-speaking children, the
authors were able to identify when wrong responses stemmed from cognitive processing
unrelated to a lack of proficiency in the target language (English); thus we believe that
further inspections of young ELL tests should include native English-speaking children.
Native-speaking children of the language being assessed can help researchers investigate
validity evidence based on response processes.

ELL tests online


Further potential issues in child L2 assessments pertain to the assessments’ rapid (and
inevitable) move to computerized platforms. In general, computerized testing demands
computer skills as well as content knowledge (Pitkin & Vispoel, 2001). This could be
problematic with young children, as they often perform more inaccurately and slower on
computerized tasks in comparison to the same tasks on paper owing to juvenile motor
skills and short visual-attention spans for onscreen features (Bosse & Valdois, 2009;
Donker & Reitsma, 2007). To accommodate these issues, visual and aural support is often
built into computerized assessments for children (McKay, 2006). However, even adult test
takers express difficulties in taking computer-mediated tests (Joo, 2007); thus the support
must be carefully inspected to be sure it works for young children, especially when the
children are in particularly novel testing environments (McKay, 2006). Although the dis-
cussion on child-specific L2 tests is still emerging, there is no comprehensive account, as
far as we know, of whether online test features positively contribute to test takers’ cogni-
tive processing during test-taking, especially when young ELLs are concerned.

Cognitive processing during computerized speaking tasks


To investigate how online test features influence test takers’ processes during test taking,
it is necessary to theorize the type of cognitive processes required of young language
learners when they perform speaking-test tasks. We address this issue because the chil-
dren’s skill of focus in this study is speaking, and in particular speaking within a stand-
ardized assessment context. We are interested in the children’s processing in that
situation. To define the cognitive operation, we turn to psycholinguistic perspectives on
speech processing during speaking tasks.
As described by Segalowitz and Trofimovich (2012, p. 182), language, and speaking
in particular, involves volitional and social dimensions (as, the authors noted, was out-
lined in Wundt’s 1912 psychological theories), and a large amount of contextual knowl-
edge. First, speakers must have volition; that is, “when speakers use a language, they
behave as active agents” (p. 182). This, as Segalowitz and Trofimovich wrote, has con-
sequences for processing. The speaker wants to persuade the listener of something, and
6 Language Testing 0(0) 

“this means that the processing underlying speech output includes the processing under-
lying the formation of communicative intentions” (p. 182). In a computerized testing
situation, this means that the test taker needs to be able to imagine the interlocutor and
read into his or her expectations so that the test taker can take the correct agent-position
and formulate the appropriate speech.
In a computerized test, the social dimension is normally an imagined or artificially
set-up realm, and test takers must understand that context. Segalowitz and Trofimovich
noted that “the social dimension of communication has processing implications for
speakers, especially in the L2” (p. 184). They suggested (p. 184), by referring to Wray’s
(2002) work, that normally “speakers try to help each other by minimizing the process-
ing loads they place on each other. They can do this, for example, by using formulaic
expressions and partially fixed strings.” They can also “provide clues as to their place of
origin, background, etc.” (p. 184). In a computerized test, this type of processing load
reduction is not offered by face-to-face interlocutors, as the test’s interlocutors are imag-
ined and do not respond. Rather, the test directions and prompt provide the volition (why
one needs to speak) and the social dimension (to whom one must speak), and may try to
control (reduce or, perhaps as often done in speaking tests, push) the processing load by
requiring certain expressions or words or by requiring certain types of output: short utter-
ances (to elicit lower-level speech) or long paragraphs of speech (to elicit higher-level
speech). Thus, understanding the directions (and the testing context) is crucial in a com-
puterized speaking test, as understanding the directions sets up and supports the speech
act. Without understanding the directions, the processing procedures that necessarily
underlie speech production may not be able to take place (may fail or be misconstrued),
even if the capability for the right type of speech is there.
Segalowitz and Trofimovich also stressed that “speakers must understand the contexts
in which the L2 is used” (p. 184). They identified how there are two general environmen-
tal contexts that shape speech processing: closed skill contexts, or open skill contexts. In
a closed skill context, “variability in the conditions under which performance takes place
has negligible impact on performance, and … the goal of performance is to repeat some
action (physical or cognitive) as precisely as possible to meet some standard” (p. 185).
In a computerized speaking test, examples of closed contexts are listing one’s hobbies,
stating what one does on a typical day, or describing what happened in a series of pic-
tures. These are also called presentational speaking tasks by ACTFL (2015). Segalowitz
and Trofimovich wrote that an open skill context (what ACTFL would call an interper-
sonal communication task) is “where there is a great deal of variability in the conditions
under which performance tasks place and where dealing with this variability is fully part
and parcel of skilled performance” (p. 185). In an open task, there are many “unantici-
pated interruptions and distractions from the environment,” but in a closed task, there are
none. “Open skills, in contrast to closed skills, carry processing demands that draw on
attention, given how important it is to notice and then respond quickly to unexpected
changes in the environment” (p. 185), Segalowitz and Trofimovich wrote. Thus, in a
closed computerized test of speaking (and note that not all computerized speaking tests
are closed: some, for example, have human interlocutors or avatars that mediate the
online conversation), the cognitive operation is threefold and involves the following
expectations, which are outlined in Table 1.
Lee and Winke 7

Table 1. Cognitive operations involved in closed online (computerized) speaking tests.

Cognitive operation Test taker’s process Rater’s assumption


1. Understanding of the Test takers must understand Raters may assume test takers
volition of the task and the test directions and prompt understand the test directions
the (imagined) social to understand the volition of and prompt, and that they
dimension in which the the speaking task (why one know why they need to speak
task is to be performed needs to speak) and the social and to whom they must speak.
dimension (to whom one must
speak).
2. Understanding of Test takers need to Raters understand the task
the speaking context understand that the speaking is closed, that there is no
(closed tasks are closed; that they will variability in the conditions
skill task) not receive feedback during under which performance takes
the speaking performance. place. Raters believe test takers
This entails knowing when to understand the task parameters
speak, how much to speak, (time constraints, etc.). Thus,
and not to stop until the test each performance can be
taker’s intentions have been directly compared against the
conveyed. standards or criteria.
3. Performance of the Test takers must have Raters may assume that if a test
speech act linguistic knowledge and taker does not speak or does
the ability to access it to not speak correctly (or about
actively speak after (or while) the correct things), he or she
performing the first two does not have the linguistic
cognitive operations. Speakers knowledge or ability to do so,
incorporate their intentions in resulting in a low score.
speaking.

Thus, scores from a closed speaking test will represent how much or to what
extent the test takers were able to undertake the processing required by the individual
items. When a computerized speaking test is closed, the test takers need to under-
stand the speaking context fully. In other words, they must undertake the processing
necessary in order to formulate volition and operate within the social dimension that
was set up for them (and in which they are to respond). They need to know the
parameters of the task (when to start speaking, when to stop speaking, and how much
to speak in between) and understand that they will not be interrupted, nor will they
receive feedback. Test takers should, at the end, actively perform the speech act (i.e.,
talk), and scores assigned by raters will represent how well the test takers were able
to perform the intended cognitive operations, which were predetermined by the test
developers.

ELLs can take tests while their eyes are tracked


Segalowitz and Trofimovich wrote in their 2012 paper that it is a challenge to test the
theoretical frameworks of L2 processing, but that it is even more challenging for “L2
8 Language Testing 0(0) 

researchers to identify appropriate methodologies for studying complex interactions


among these dimensions” (p. 187). In this paper we try to do this, but within the context
of child speaking performance on a closed computerized speaking test. This is a substan-
tially unexplored area, but some micro-level examinations into children’s test-taking
behaviors and response processes have been conducted (Carless & Lam, 2014). More
research in this area would help applied linguists to gain a better understanding of how
test features, children’s processing, their proficiency, and their test scores are related.
This line of work would involve introspection (involving children in interviews, stimu-
lated recalls, focus group sessions, or think-alouds). Yet other online measures can also
be used, such as eye tracking (as suggested by the AERA, APA, NCME 2014 Standards
for Educational and Psychological Testing, p. 15).
As opposed to conventional introspective methods, eye tracking can tap into lower-
level, unconscious processing in a non-intrusive way (e.g., Brunfaut & McCray,
2015). Eye movements are good indicators of one’s ongoing cognitive processes,
especially during cognitively complex language tasks (Spivey, Richardson, & Dale,
2009) such as reading (Rayner, 2009; Reichle, Warren, & McConnell, 2009) and
speaking (Griffin & Oppenheimer, 2006). This is because there is an eye–mind link
(Rayner, 2009), meaning that there is a strong association between what one is think-
ing and at what one looks (which can be quantified as for how long one looks and also
how many times one looks). For instance, making prolonged gazes at a specific object
activates the semantic and conceptual representation of its information by speakers,
thereby aiding them eventually to name it accurately (referred to as content hypothe-
sis; Griffin, 2001; Meyer, Sleiderink, & Levelt, 1998). Interference on such continued
processing (e.g., viewing other objects), however, triggers aversion from the object;
this is likely to lead to speakers experiencing difficulty in word retrieval (Van der
Meulen, Meyer, & Levelt 2001).
Recently, researchers have used eye trackers to investigate issues in language assess-
ment (Bax, 2013; McCray & Brunfaut, 2016; Suvorov, 2015; Winke & Lim, 2014, 2015).
Bax used eye tracking to explore how adults’ process reading test items; he found dif-
ferentiated reading strategies between proficient and less proficient adult ELLs. Also
with adult ELLs, Suvorov investigated how test takers process two types of visuals dur-
ing listening tests (one with just the speaker shown, and one with the speaker plus con-
tent imagery of, for example, graphs or pictures related to the talk shown behind the
speaker). Similarly, Winke and Lim (2014) applied eye-tracking methodologies to dem-
onstrate a correlation between adult ELLs’ anxiety and time spent reading test directions.
In another study, the same authors (Winke & Lim, 2015) used eye-tracking to investigate
how raters of English essays used various sections of an analytic rubric. And McCray
and Brunfaut used an eye tracker to investigate how adult ELLs process gap-filled items
on a reading proficiency test.
As far as we know, eye trackers have not been used to investigate how young
ELLs process their speech (and behave) during speaking tests. This method of data
collection may be fruitful, especially because young ELLs are often shy in inter-
views or reluctant to speak when talking to adults they do not know. Tracking their
eye movements is a non-intrusive way of investigating their test-taking response
processes.
Lee and Winke 9

Research questions
In this study, we use two introspective methods (interviews and picture-drawings) and
eye-tracking methods to investigate how children process language during (and behave
on) an ELL speaking test. In particular, we are investigating the children’s response pro-
cesses. This is a type of validation study as outlined in the AERA, APA, NCME 2014
Standards for Educational and Psychological Testing: We conduct this study to accumu-
late one type of “evidence to provide a sound scientific basis for the proposed score
interpretations” (p. 11). As described in the 2014 Standards, validation begins with an
explicit statement of the proposed interpretation of test scores. In the test that we inves-
tigate, test scores are supposed to represent the child’s speaking skills alone (see www.
ets.org/s/toefl_primary/pdf/understand_speaking_score_reports.pdf). Thus, there should
be evidence that the scores represent speaking skills alone. We assume that reading
needed to perform any underlying cognitive processing (to understand the volition,
social context, or test context) should not be represented in the scores. We conduct this
research because, as outlined in the Standards: “As validation proceeds, and new evi-
dence about the meaning of a test’s scores becomes available, revisions may be needed
to the test, in the conceptual framework that shapes it, and even in the construct underly-
ing the test” (p. 12, emphasis added). Evidence based on response processes comes from
the analysis of individual items (which we investigated in this study). Methods recom-
mended by the Standards to do this type of validation work include the following: (a)
questioning test takers about their performance strategies or responses to particular
items; (b) documenting performance through eye movements; and (c) documenting
wide, individual differences in test-taking processes. Such information can, according to
the Standards, lead to a reconsideration of certain test formats (see pp. 14 and 15 of the
Standards). As written in the Standards, “Process studies involving examinees from dif-
ferent subgroups can assist in determining the extent to which capabilities irrelevant or
ancillary to the construct may be differentially influencing test takers’ test performance”
(p. 15). Our research design will allow us to make such a determination.
We aim to reveal how young ELLs perform under a certain closed speaking test con-
text; that is, online and time-pressured, and with content visuals that are to help guide test
takers’ production (and help reduce the test takers’ processing load). In such a test, the
test directions and prompts provide the communication goal and social conditions. The
test takers are to understand those and use them to form their speech intentionality. We
follow methods undertaken by Winke et al. (2015) and include both child English lan-
guage learners (ELLs) and their same-aged, native English-speaking peers to determine
whether differing behaviors and processing relate to (a) variations in proficiency, or (b)
variations in general child-cognitive development. The assumption is that native English-
speaking children will do well overall (because they are proficient in English), but their
variation in age-related cognitive development may contribute to their test-score varia-
tion. By having native English speaking children identified (by their parents, in this case)
as average or above average (academically prepared) readers and their same-aged,
English language learner peers participate in this study, we hope to see which parts in the
test directions and prompts may pose processing difficulties, and then investigate whether
the difficulties stem from a lack of age-appropriate reading skills, or if they stem from
10 Language Testing 0(0) 

issues in the directions or prompts. Furthermore, we triangulate the test data and the eye-
movement data with additional offline measures: retrospective interviews and picture
drawings. We expect the multiple perspectives to better our understanding of online oral-
skills tests for young ELLs. The following research questions and hypotheses guided our
study:

1. To what extent are young non-native English-speaking children (NNS) aged 8, 9,


and 10 differentiated from same-aged native English-speaking children (NS) in
terms of the quality of their oral performance while taking a computerized oral
test? The hypothesis is that the NS children will outperform the NNS, but we do
not know exactly how (in what ways) the quality of their oral performance will
differ on the test.
2. To what extent are young NNSs differentiated from NSs in terms of their response
processes, as indicated through their eye movements while taking a computerized
oral test? We hypothesize no difference.
3. What are these children’s emotional reactions toward the computerized oral test?
Based on prior research, we hypothesize that the NNSs will be less positive about
the computerized testing, but we do not know how much less positive they will
feel.

Method
Participants
Initially, a total of 31 children participated in this study. Three children (two NS children
and one NNS child) were excluded from all final analyses owing to low accuracy in
terms of eye movement data (NS participants 4 and 9) and discontinuance of testing
(NNS participant 21). In the end, a total of 28 children aged 8, 9, and 10 years remained
in the final analyses (16 boys and 12 girls). They were all in second through fifth grade
in public schools in the United States. This specific age and grade range corresponded to
the test’s targeted population. Twenty NNS children (eight 8-year-olds, six 9-year-olds,
and six 10-year-olds) had diverse L1 backgrounds (seven Koreans, seven Chinese, four
Egyptians, and two Vietnamese). With the exception of one child (Participant 5; 5 years
of residence), the NNS children were mostly recent arrivals in the United States, with the
average length of residence being 11.4 months (ranging from 3 months to 3 years). The
children had been learning English for an average of 30 months (ranging from 6 months
to 5 years) with 13 children already having early exposure to English prior to coming to
the United States. For comparative purposes, eight same-aged NS children participated
(four 8-year-olds, one 9-year-old, and three 10-year-olds).

Materials
Background questionnaire. We asked parents to complete a background questionnaire on
their child’s experiences in language learning, standardized testing, and computer usage.
Lee and Winke 11

Figure 1. Screenshot of the test screen.


Copyright © 2013 Educational Testing Service. Used with permission.

Speaking test. For this study we used seven sample oral tasks from the TOEFL® Prima-
ryTM (available at https://toeflprimary.caltesting.org/sampleQuestions/TOEFLPrimary/
index.html). Because we wanted the test to appear real (and not have the word “sample”
in it), we moved the seven sample test questions into a new computerized test format that
mirrored the screen layout and flow of the original sample test. The first author of this
paper used Camtasia (www.techsmith.com) to video-record each task from the sample
test website. She then uploaded those seven video-recordings to Tobii Studio (www.
tobiipro.com), the eye-tracking experiment design software we used. In the recreated
test, as in the sample test, each oral-response question appeared after the child listened to
a part of a story. The test tasks involved describing pictures (Tasks 1, 3, and 4), explain-
ing a series of events (Tasks 2 and 5), and asking questions (Tasks 6 and 7). The response
time varied from 10 to 20 seconds for the simpler tasks (Tasks 1, 3, 4, and 6), and 30
seconds for the others (Tasks 2, 5, and 7). As in the sample test, the children had to press
a play button to start the test, and then they could work through the test at their own pace.
As seen in Figure 1, the test screen in this study was recreated to be identical to the origi-
nal sample test with the same test prompts, visual stimuli, and response time indicator (a
countdown timer). As in the sample test, this screen format was consistent across all
tasks (see Figures 1 and 2).

Drawings about attitudes. To tap into the children’s attitudes toward the speaking test, we
used a draw-a-picture task, as described by Carless and Lam (2014) and also used by
Winke et al. (2015). The additional purpose of implementing this task was to
12 Language Testing 0(0) 

Figure 2. Screenshot of the test screen designated with areas of interest.


Copyright © 2013 Educational Testing Service. Used with permission.

complement quantified results (Barraza, 1999). As this is reportedly not an easy task for
some children (Winke et al., 2015), we told the children that they could provide written
descriptions (in the L1) if they wished or talk while they drew.

Post-hoc interviews. Children provided their thoughts (in either English or in their L1) on the
test through one-on-one interviews conducted by the first author. When the first author did
not speak the native language of the child, a secondary, trained researcher (a graduate stu-
dent at the same university) who spoke the native language of the child assisted. We pro-
vided each child with a paper test booklet (with the same material they had seen onscreen),
which served as a stimulus for prompting their responses during the interview.

Computer-skills assessment. Each child took a 5-minute, computer-skills assessment


(adapted from Hyun, 2005), which assessed the child’s ability to complete 11 tasks, such
as typing his or her name and navigating to a website. Through an item analysis we
learned that the majority of the children had basic computer abilities (item facility at
93%; overall test discrimination index = .04). None of the children were unfamiliar with
basic computer navigation.

The apparatus. We used a Tobii TX300 (23-inch-wide screen), an eye-tracking platform


that has the tracking precision at a sampling rate of 300Hz (further technical information is
available at www.tobii.com). It records binocular eye movements through a hidden camera
in the monitor. For this to work with children, we recorded eye movements without head
Lee and Winke 13

stabilization. One could stabilize the head with a Tobii TX300 through the use of an exter-
nal head and forehead rest, and that would reduce the amount of noise and error in the data.
But for this research with large areas of interest (to be described below) and with our focus
on creating a safe environment for the children, we felt a non-head-stabilized environment
was essential; it would be more natural, as described by Bax (2013), even though accuracy
in recording would be slightly reduced over a head-stabilized design.

Procedure
We met with each child and his or her parent in the eye-tracking lab at a large Midwestern
university. Upon signing the consent forms, the parent answered the questionnaire while
the child completed the test. To obtain optimal tracking conditions, we instructed the
child on his or her posture and adjusted the monitor’s height so the eye-tracking cameras
were at the child’s eye level. The child sat on a non-adjustable, non-swivel chair (essen-
tial) to reduce movement2. Then, the child went through a calibration process and
watched an instructional video clip covering test procedures and onscreen features (e.g.,
prompt, timer), followed by another clip of three practice questions. After the testing
session, the child drew a picture about his or her feelings about the test, participated in
the interview session, and performed the computer skills test. The entire session lasted
approximately 1 hour. At the very end, as a thank you, the child picked out an educational
toy (valued at 10 USD) from a treasure chest, and the parent received a 50-USD gift
certificate to a large nearby store.

Analysis
First, after a one-hour rubric training and calibration session with the first author, we had
two trained, experienced raters3 score the children’s tests. The two raters, who were
graduate students at our university, used the TOEFL® PrimaryTM holistic rubric provided
by Educational Testing Service. Using the rubric, they scored five 3-point-scale items
and two 5-point-scale items. The rubric descriptors were specifically on the following
three elements of language use (grammar and word choice), content (complete response),
and delivery (rate and fluidity of speech). There was a moderate to high agreement
between the raters across the tasks,4 excluding Task 1 on which the two raters had rela-
tively low agreement (Cohen’s Kappa κ = .40; range: .40–.75). When there were discrep-
ancies, the first author (as a third rater) came in to resolve them. We used SPSS (version
22) to calculate descriptive statistics on the results.
Additionally, we analyzed each child’s responses in regard to fluency. We measured
fluency in two ways. First, upon transcribing all responses, we counted the number and
length of silent (i.e., non-verbal pauses) and filled pauses (i.e., uhs or ums) by inspecting
the audio files’ waveforms and spectrograms in Audacity (version 2.1.0). Silent pauses
were silences equivalent to or longer than 250 milliseconds (De Jong, Groenhout,
Schoonen, & Hulstijn, 2013). We marked pauses in parentheses in the transcript; we
coded self-correction or repetitions in brackets. A double colon (::) marked clause-level
boundaries (Kahng, 2014). An example of a final transcript with coded temporal informa-
tion is as follows:
14 Language Testing 0(0) 

Um (476) chairs are on the bus (580) :: and (720) um there’s giraffe not a driver (1033) :: and
(279) uh (290) there’s (464) an apple (987) <apples> instead of wheels ::

Second, we analyzed speed fluency by calculating the articulation rate of responses.


First, we obtained the total number of words and syllables by using the online software,
Syllable Counter (www.syllablecount.com). Then, we divided the number of syllables in
each utterance by the duration of the utterance, removing all pausing instances (De Jong
et al., 2013).
To analyze the eye-tracking metrics, we first defined separate locations on the screen in
which particular text or images appeared as areas of interest (AOIs). That is, areas (such as
those in which directions are given) on the screen can be defined. For these areas, research-
ers can obtain eye-gazing metrics (e.g., how long did each person look at the directions to
a certain question) using Tobii Studio (Ver. 3.2) (eye-tracking analyzing software). In this
study, we designated the separate AOIs as the (a) test prompts, (b) pictures, (c) lexical cues,
and (d) timer (see Figure 2). To segment the analysis, we time-stamped the AOIs so that
they were differentially activated during the children’s response times (when the system
microphone was on to record speech). Subsequently, we computed two different metrics of
eye-movement data within each AOI: (a) the total fixation duration, which indicates the
total amount of time (in seconds) a child looked at an AOI and (b) the number of eye visits
(i.e., visit counts), which reveals how many separate glances the child gave to each AOI. A
visit refers to a child’s gaze within an AOI that ends when the same child looked at some-
thing outside of the AOI (Holmqvist et al., 2011). We adopted these two measures to meas-
ure quantitatively the length of time and number of times that the children spent looking at
a specific test feature. We calculated descriptive statistics across groups for comparison.
To supplement the quantitative results, we qualitatively explored two types of visual
summaries of the children’s eye-gazing patterns: heatmaps, which represent the density
of overall fixations, and scanpaths (i.e., eye movement patterns) on the test screen that
enabled us to track the exact sequences and positions of fixations during performance.
We also analyzed the children’s interview responses and their drawings. We first tran-
scribed all interview response data. Following Winke et al. (2015), we categorized the inter-
view data according to the children’s overall view (“positive” or “negative”) and labeled
those codes (when possible) in relation to (a) the Task (1–7), (b) the test features (delineated
by AOI), and the test mode (recording, not-recording). Finally, following Carless and Lam
(2014), we coded each child’s drawing as “positive,” “negative,” or “neutral.”

Results
The quantitative analysis reported in this section is based on a total of 28 children’s test
performances.

Oral performance differences across NS and NNS groups


As can be seen in Table 2, the NS children obtained higher scores across all test tasks,
on average, than the NNS children did, as one would expect. The score differences
Lee and Winke 15

Table 2. Descriptive statistics for native speakers (NS) and non-native speakers (NNS) task
and total performance on the speaking test.
Group Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Total

3 points 5 points 3 points 3 points 5 points 3 points 3 points 28 points

NS (N = 8) 2.94 (0.17) 4.88 (0.33) 3 (0) 3 (0) 5 (0) 3 (0) 2.56 (1.00) 24.31 (1.57)
NNS (N = 20) 2.43 (0.82) 3.00 (1.71) 2.30 (0.70) 2.48 (0.98) 3.10 (1.30) 2.18 (1.28) 2.18 (0.85) 17.25 (7.09)

Note: Standard deviations (SD) are in parentheses. The second row gives the maximum point value that can be obtained by
a test taker on the respective task and the whole test (total).

between the two groups were not substantial on the test tasks scored on a 3-point scale
(Tasks 1, 3, 4, 6, and 7). NNS children’s scores on these tasks fell along the 2-point
range, which was particularly similar to NS children’s scores on Tasks 1 (M = 2.94,
SD = 0.17) and 7 (M = 2.56, SD = 1.00). These tasks similarly elicited simple lan-
guage structures (i.e., clause or sentence-level responses) that conveyed basic com-
municative functions (e.g., asking questions). In a sense, one could view these tasks
as too easy for the NSs in particular, but for the NNS children as well. There were
score differences on Tasks 2 and 5, which were rated on a 5-point scale. Although the
NS children consistently demonstrated (as one would expect) a ceiling effect on Tasks
2 and 5, the NNS children’s scores fell along a 3-point range, and fell approximately
2 points below the NS children’s scores. Among the 20 NNS children, three received
the maximum score on Task 2 (Participants 22, 23, and 31), whereas only one NNS
child did so on Task 5 (Participant 23). We believe these two tasks were difficult to
many participants to some extent, arguably because they prompted more attention to
detail than the other tasks did.
Because the test takers’ results were more varied (larger ranges of scores and larger
standard deviations) on Tasks 2 and 5 that had the larger range of possible scores (i.e.,
1 to 5 points), we focused on these tasks for further analyses. Only when there is vari-
ation in the dependent variable is the variation in the independent variable interesting:
The question then is to see if the two variances are in any way related, or not. Table 3
showcases the results of the measures of mean utterances produced across the two
groups. The NS children took more time to respond while speaking more and at a
higher articulation rate, with an opposite pattern for the NNS group. Additionally, the
NNS children’s speech contained longer silent pauses and had more of what we coded
as repetitions and repairs.
We further compared the test takers’ responses on Tasks 2 and 5 in terms of content
quality. The key requirement of Task 2 (sequencing actions based on four pictures) and
Task 5 (describing what happened in a video) was to provide a complete narration of a
series of actions in a coherent manner (see Figure 2). Task 5 required the children to
describe the events shown previously in the video while incorporating three lexical cues
in their responses.
As in Excerpt 1, NS children (seven out of eight) commonly mentioned all major
events in the visual stimuli, with detailed description.
16 Language Testing 0(0) 

Table 3. Descriptive statistics of the children’s speech samples.

Speech characteristics NS children (N = 8) NNS children (N = 20)

Task 2 Task 5 Task 2 Task 5


Number of syllables 40.88 (27.99) 55 (13.48) 21.33 (12.32) 29.05 (14.61)
Number of filled pauses 0.38 (0.52) 0.25 (0.46) 0.86 (1.62) 0.90 (1.18)
Number of silent pauses 7 (5.10) 8.38 (3.20) 7.05 (5.44) 8.14 (5.70)
Number of repairs 0.25 (0.71) 0.50 (0.76) 1.35 (1.46) 1.85 (1.08)
Mean length of silent pauses (sec) 4.81 (4.04) 4.71 (2.08) 5.44 (3.67) 5.57 (3.88)
Mean length of response time (sec) 17.14 (7.49) 24.14 (3.13) 12.95 (7.85) 19.87 (9.92)
Articulation rate (sec) 2.42 (1.16) 2.68 (0.29) 1.76 (0.70) 2.01 (1.59)

Note: Standard deviations are in parentheses.

Excerpt 1 (Task 5, Participant 4, Native English speaker, Male, Age 9)

Billy first took it :: took the key up the hook :: and then he pulled the branch on the tree (372)
:: The branch which was fake :: and (267) led him to a ladder coming out of a tree :: and he
climbed up the ladder (313) :: put the (232) <hid a key> in there :: and climbed back down (255)
:: He then put (546) <pulled> down a yell <a white and yellow flower> near the tree which
caused (267) the branch and the ladder go back in its original position.

On the other hand, as in Excerpt 2, the lower-scoring NNS child had a more generic
and incomplete description of the four scenes in Task 2, accompanied by irrelevant infor-
mation. While the majority of the NNS children did not have difficulty mentioning the
action depicted in the very first picture in Task 2, six children did not consistently include
in their responses the events shown in the later pictures. An apparent lack of rapid word
retrieval resulted in filled pauses or vague pronouns. Grammatical errors also limited
task achievement.

Excerpt 2 (Task 2, Participant 16, Native Chinese speaker, Female, Age 9)

First you need take food :: go your hand :: and open the door (755) :: take bird no <food> in the
(1068) um (4900) in the (314) in the (906) um (1591) there :: and (313) close the door :: bird
can go eat food.

The overall results regarding the fluency quality of the two groups’ responses for
Tasks 2 and 5 appeared to align with the score descriptors presented in the rubric. As one
would expect, the NNS children failed to achieve the same degree of fluidity in speech
as their NS peers, with the NNS children’s responses showcasing, on average, more
instances of hesitations and incomplete descriptions.

Differences in response processes across NS and NNS groups


To answer research question 2, we investigated how the observed differences in
the responses between the two groups could be accounted for by the children’s
Lee and Winke 17

Table 4. Eye-movement indices within the areas of interest on Task 2.

AOI Total fixation time (sec) Visit count

NS (N = 8) NNS (N = 20) NS (N = 8) NNS (N = 20)

M Med. M Med. M Med. M Med.


1. Pictures 8.16 (0.63) 7 6.64 (0.50) 6 13.16 (1.32) 12 10.79 (0.77) 10
2. Prompt 0.67 (0.89) 0.30 0.50 (0.87) 0.02 0.13 (0.36) 0 0.20 (0.64) 0
3. Timer 2.29 (3.38) 0.72 5.05 (3.16) 4.97 1.95 (1.28) 2 6.30 (4.04) 6

Note: M = means; Med. = medians. Results for the AOI pictures are averaged values of the added fixation
times and visit counts of all four pictures. Standard deviations are in parentheses.

eye-movement patterns (which represented their response processes). As before, we


focused on data collected from Tasks 2 and 5.
In Table 4 are the total fixation times and visit counts on the three AOIs (pictures,
prompts, timer) that appeared on the test screen for Task 2. Both groups of children
fixated the longest on the pictures and the shortest on the prompt during the 30-second
response time. This seems reasonable, as the pictures were necessary for constructing
(processing the message of) responses, while prompts were visible before the timer
began. Attention to the timer was noticeably different across the two groups; the NNS
children fixated two times longer on the timer (M = 5.05, SD = 3.16) than the NS chil-
dren (M = 2.29; SD = 3.38). Furthermore, unlike the NS children, the NNS children
made more frequent visits to the timer (they looked at it more often) when they were
providing responses.
Similar patterns were observed in Task 5 (see Table 5), with both groups fixating
longest on the pictures (to process responses) and shortest on the prompt (to under-
stand volition and the social context). However, the NNS children fixated longer on
the three lexical cues (M = 4.93; SD = 1.85) and on the timer (M = 6.00; SD = 4.93)
than the NS children did. In other words, NS children primarily focused on the visual
image stimuli while responding; NNS children tended to pay attention to different test
features, especially the timing device (which is a device that sets the test developers’
proposed speaking parameters that align with the test developers’ intended task pro-
cessing load). The heatmaps in Figure 3 present visual evidence of these group-level
differences, with the NNSs showing great intensity of focus on the timer (indicated
with darker shading).
As the underlying difference between the two groups across tasks in terms of eye-
tracking metrics was in relation to the fixations on the timer, we wanted to investigate
this more. We subsequently investigated qualitative scanpaths to track the specific speech
sections that aligned with timer fixations. Particularly, we inspected the red (dark-shaded)
fixation points in the scanpath recordings while simultaneously playing the synchronous
audio-recorded responses. The fixations tended to occur in the middle of NNS children’s
utterances. Thus, the long fixation durations on the timer were particularly associated
with features pertaining to speech disfluency (see Table 6). For instance, the NNS chil-
dren fixated on the timer while repairing their speech, or while producing filled and
short silent pauses before word retrieval (e.g., Participant 11 and 16). NS children
18 Language Testing 0(0) 

Table 5. Eye-movement indices within the areas of interest on Task 5.

AOI Total fixation time (sec) Visit count

NS (N = 8) NNS (N = 20) NS (N = 8) NNS (N = 20)

M Med. M Med. M Med. M Med.


1. Video 9.75 (0.69) 9.20 8.26 (1.15) 8.14 12.27 (1.42) 11.05 11.09 (1.84) 9.5
2. Cues 0.53 (0.22) 0 4.93 (1.85) 3.76 1.35 (0.43) 0.41 5.35 (2.57) 4
3. Prompt 0.67 (0.89) 0.32 0.56 (1.12) 0.12 1.13 (1.36) 1 1.15 (1.69) 1
4. Timer 2.05 (2.56) 1.17 6.00 (4.93) 4.69 2.25 (2.38) 2 7.10 (3.96) 6.5

Note: M = means; Med. = medians. Results for the AOI cues are averaged values of the added fixation times
and visit counts of all three lexical cues. Standard deviations are in parentheses.

demonstrated different gaze patterns when hesitating or repairing their responses: They
tended to focus their eyes on the objects in the pictures (tools meant to help with the
processing load) when retrieving words. This is depicted in Figure 4 and Excerpts 3 and
4. Features in bold in both excerpts were produced at the specific moment of fixation, as
displayed in Figure 4.

Excerpt 3 (Task 5, Participant 18, Native Chinese speaker, Male, age 9)

…uh (279) he (861) he’s got the (674) bach (932) barnch (583) in the [sees the timer] (1328)
in the in the the the laddes (1328) …

Excerpt 4 (Task 5, Participant 3, Native English speaker, Male, age 10)

…he took the key out (836) :: and then (.) he pulled the [sees the branch in the video] the
lowest (.) no (.) longest branch on the right side of the tree (697) …

On other occasions, the NNS children’s fixations accompanied longer silent pauses,
which may indicate being lost for words; in such cases, the children silently gazed at the
timer for a long period of time (e.g., Participant 20).
An additional pattern for the NNS children was the general occurrence of fixations on
the timer at local or broad sentential boundaries in speech; for instance, when uttering
connecting devices (e.g., Participant 15) or when producing silent pauses in between
clause-level units (e.g., Participant 22). This tendency within the NNS group was differ-
ent from the performance of the NS children on Task 2. In Task 2, four consecutive pic-
tures were displayed (see Figure 4). Interestingly, a majority of the NS children (N = 7)
tended to fixate on the following (next) picture at the end of their description of the
preceding picture, a pattern known to reflect a type of speech planning (Griffin, 2004).
Yet while the NNSs seemed to follow this trend as well, some made excessive fixations
on the timer before looking at or describing the subsequent pictures. This is displayed in
Figure 5 and Excerpts 5 and 6. Bolded and underlined features are produced when the
children saw the timer and the pictures, respectively.
Lee and Winke 19

Figure 3. Heatmaps based on means within each group: NS (top) and NNS (bottom) children
for Task 5.
Copyright © 2013 Educational Testing Service. Used with permission.

Excerpt 5 (Task 2, Participant 21, Native Arabic speaker, Female, age 8)

[sees the first picture] First get the bird (528) food (978) :: [sees the second picture] open the
door to the bird cage (760) :: [sees the third picture] put it on (351) <put> the bird food the
ground :: [sees the fourth picture] and let (316) <leave> so the birds can eat.
20 Language Testing 0(0) 

Table 6. Parts in NNS children’s speech aligning with fixation points on the timer.

Categories Examples
1. With filled pauses Participant 6:
First you need take food go your hand and open the door (755)
take food in the (1068) um (4900)
2. Between clauses Participant 5:
First you have to bring (615) the food then open the cage door
(790)
3. Long pauses Participant 12:
Put (1161) some food (732) at the ground (1068)
Participant 20:
Key (2000) in the tree [no response]
4. Repetitions and repairs Participant 11:
put it on (351) put the bird food the ground
5. Lexical retrieval (with, Participant 1:
after NP) I put some (348) food in the (394) bird cage
6. End of the utterance Participant 4:
um (1893) put down food bird eat it.

Note: Bolded features indicate the presence of fixation on the timer. Pauses are marked by numbers in
parentheses. The numbers give the lengths (in milliseconds) of the pauses.

Excerpt 6 (Task 2, Participant 5, Native English speaker, Female, age 8)

First of all (3639) [sees first picture] you take a small bag (302) :: and (438) put some bird food
in it [sees the second picture] then open the door [sees the third picture] to the cage (314) ::
[sees the fourth picture] put some (319) bird food on the bottom of the cage (903).

Overall, the results indicate that the NNS children devoted their attention not only to
the pictures designed to aid their responses (and to reduce their cognitive processing
load), but also to the onscreen test features (that were not meant to invoke speech varia-
tion). NNS children fixated more frequently and longer on the timing device, while NS
children focused more on helpful test-content features.

The children’s reactions to the computerized speaking test


The children expressed their reactions to the test through picture-drawing and interviews.
As detailed in Figure 6, we classified the children’s general impressions (drawings) of
the test into three categories: positive, negative, and neutral.
Out of a total of 28 drawings and written descriptions, we coded five drawings (18%)
as neutral, nine drawings (32%) as positive, and 14 drawings (50%) as negative.
Specifically, while the majority of the NS children’s drawings indicated general enjoy-
ment (Participants 3, 4, 5, 6, 7, and 10), the NNS children’s responses were more nega-
tive. Among the 12 that demonstrated negative feelings, two expressed nervousness and
embarrassment (Participants 12 and 26) and four indicated the pressure of producing
Lee and Winke 21

Figure 4. Screenshot of the test screens of Task 2 (top) and Task 5 (bottom).
Copyright © 2013 Educational Testing Service. Used with permission.

accurate or well-formulated spoken responses (Participants 15, 22, 24, and 30). Six NNS
children specifically reflected on the time restrictions by drawing the timer or by having
a clock in the pictures. Although the NNS children generally gave short answers when
22 Language Testing 0(0) 

Figure 5. Different eye-gazing patterns between NNS and NS children when retrieving words
(top: Participant 18, NS; bottom: Participant 7, NNS).
Copyright © 2013 Educational Testing Service. Used with permission.

they described their picture or when they described the test during the picture-drawing
time (see Table 4), four participants said they felt rushed to respond during the test
(Participants 15, 23, 28, and 29) while two NNS children’s drawings indicated they had
too much time to respond (Participants 19 and 25).
Given the NNS’ mixed but mostly negative reactions to timed testing, we further explored
the interview responses relevant to the timing device. Among the 13 responses5 (from 3 NS
Lee and Winke 23

Figure 6. Examples of children’s drawings in response to questioning on how they perceived


the test to be.

and 10 NNS children) that had something about the timing of the test or the timer, the NSs
(Participants 6, 7, and 10) stated that they paid little attention to the timer, while NNSs had
mixed feelings. Five NNS children (Participants 14, 19, 22, 23, and 25) thought that the
timer was helpful (and note that two of these, participant 22 and 23, were high NNS scorers)
as it served as a major source of time management (these five children’s average score was
20.9 out of a total score of 25). Yet the other five NNS children (Participants 18, 26, 28, 29,
and 30, averaging 17.3 on the test) stated that they felt distracted by the timer.

Discussion
Because tests for young language learners are moving into computer-mediated formats,
developers for children’s language tests need to understand fully how test mode affects
child language learners’ test scores. Validation studies are needed to assess the appropriate-
ness of the field’s current testing practices for young children, heeding observation-based
warnings about problems with child-language assessment (Carless & Lam, 2014; Colwell,
2013; McKay, 2006; Menken, 2008; Pomplun & Custer, 2005). In the present study, we
used a combination of qualitative feedback, quantitative test-score information, and eye-
movement records to investigate children’s response processes. We did this because “theo-
retical and empirical analyses of the response processes of test takers can provide evidence
concerning the fit between the construct and the detailed nature of performance or response
actually engaged in by test takers” (AERA, APA, NCME, 2014, p. 15).
In terms of test performance, the NS and NNS children scored substantially differ-
ently on two test tasks that were more complex (Tasks 2 and 5). The variation in scores
on these items provided a window for us to understand how cognitive processes underly-
ing speaking tasks can be difficult for young ELLs, and why. We were also able to see,
with these items and by looking additionally at the NS children’s data, whether any item
features could be implicated in adverse test-taking response processes. As one would
expect, lower-performing NNS children struggled with more complex tasks requiring a
24 Language Testing 0(0) 

paragraph-level description of visual stimuli (they were supposed to talk about the pic-
tures), apparently owing to difficulty in instant word retrieval, with subsequent effects on
overall fluidity, but also owing to a lack of visual attention to the visuals. The score dif-
ferences are related to the learners’ varying ability to use the contextual support, a feature
that may actually add to task complexity, not serve as a way to help with the cognitive
load, as prior researchers have found with adult data (Ishikawa, 2007; Iwashita, Elder, &
McNamara, 2001). Task 5 had less visual guidance in task production; thus the test takers
may have had to generate more of their own content and plan more of their language,
which may have slowed or hindered the NNS children from completing the underlying
cognitive tasks, and thus the primary task at hand (i.e., producing complete and fluent
speech; Robinson, 2001, 2005). This could particularly affect younger ELLs as a result
of their still-developing memory span as well as oral English abilities (Dempster, 1985).
At the same time, despite Task 2 being a more positive contextual support condition (i.e.,
production was guided by more visual prompting), the NNS children still did poorly on
it, at least in comparison to their NS peers. By looking at the eye-movement data, we see,
however, that in Task 2, the NNS children did not view the positive contextual support as
much as their NS peers did: instead, the NNS children were looking at the timer. We see
in the eye-movement metrics an alignment between the NNS children’s visual-atten-
tional foci and their speech production. The NNS children focused excessively on the
timer during production, especially when they were having difficulty in word retrieval
(see Figure 5), and also immediately before describing a subsequent picture in Task 2.
This differs from the NS children, who primarily focused on content cues while
responding.
These results can be interpreted in different ways. First, the eye-movement patterns of
the NS children seemingly corroborate previous accounts on the close association
between eye movements and language production (Griffin, 2004; Griffin & Oppenheimer,
2006; Meyer et al., 1998). A robust finding in relevant literature is the synchronization
between gaze on an object and articulation of that object’s name (Griffin, 2001, 2004;
Van der Meulen et al., 2001). The advocates of the content hypothesis (Humphreys &
Forde, 2001; Van der Meulen et al., 2001) claimed that an object’s visual features acti-
vate stored semantic and conceptual representations that in turn activate potential names
and structures (Griffin & Oppenheimer, 2006). Additionally, any prolonged gaze at a
referent after identifying it prevents attentional dispersion to other irrelevant objects,
thereby lessening interference with word-production processes (Harley, 1990; Van der
Meulen et al., 2001). If this is true, the NS children’s fixation patterns on the contextual
cues in the present study may have helped them to precisely retrieve and process relevant
information and maintain their focus on the primary information necessary for successful
task completion (Griffin & Oppenheimer, 2006). The content hypothesis would also
seem to suggest that the reduced attention to relevant visual references (as in the case of
the NNS children) may cause hesitations and retrieval errors in speech (Humphreys,
Riddoch, & Price, 1997), as seen in this study’s data. According to the hypothesis, speak-
ers are susceptible to environmental intrusions (and perhaps especially so by children
who rely heavily on the visual world for meaning making), even in closed speaking
tasks. Speakers may make errors when they fixate too briefly at relevant objects or too
long at irrelevant objects while preparing or producing speech (Griffin & Oppenheimer,
Lee and Winke 25

2006). If this was the case in the current study, the NNS children’s divided attention on
test features (the timer) rather than on content features (pictures of objects that provide
vocabulary or structure for the speech task) may have reduced their use of key informa-
tion relevant to task completion, while potentially causing more speech errors. As we
explained in the literature review, we believe that raters, because the task is closed,
assume that there is no variability in the conditions under which performance takes place.
But our eye-tracking metrics seem to point out that there is much variability in the condi-
tions under which performance takes place (attentional variability), and with children,
that attentional variability may be intertwined with speech-production variation.
It remains unclear as to why only the NNS children diverted their attention to such
task-irrelevant test features such as the timer during speech production. We can only
speculate. An association between gaze aversion and task difficulty (Glenberg,
Schroeder, & Robertson, 1998) provides a possible explanation. Speakers tend to avert
their eyes to a neutralized field (e.g., the floor or sky) in tasks that pose dual demands
(e.g., monitoring/remembering the content and retrieving/producing words; Glenberg
et al., 1998) or during disfluencies (Exline & Fehr, 1978). For instance, in a picture-
description task such as Tasks 2 and 5 (i.e., dual-task situations; Glenberg et al., 1998),
information-rich components of the environment (e.g., visual stimuli) may distract
attention from the concurrent task of word retrieval (Griffin & Oppenheimer, 2006). In
such cases, averting one’s gaze to a more uniform location is a way of disengaging
from the visual environment, directing attention toward efficient word retrieval
(Glenberg et al., 1998). If the countdown timer served this role (and this is not explic-
itly tested in this study), this claim may align with the current study’s findings, espe-
cially in the contexts of Tasks 2 and 5 that were more difficult. Within these task
conditions, the NNS children’s fixations occurred at specific points in speech such as
immediately before the description of subsequent pictures and during hesitations. It
could be that the NNS children exploited the timer as a buffer in an effortful way to
constrain additional cognitive demands and facilitate speech production. But we think
this explanation may be unsatisfactory. We turn now to the interview data to shed more
light on why the NNSs tended to attend to the timer.
As differential attention to the timer seemed to be a potential source of score variation
across groups, we looked at the interview data in which children mentioned the timer as
a distraction. This view came almost equally across both groups (NS and NNS) of chil-
dren, even though our metrics show the NNS actually looked at the timer more often and
for longer periods. Previous studies have noted that countdown timers arouse test anxiety
in adult test takers in computer-based oral tests (Joo, 2007; Shi, 2012). The same type of
device could exert even more tension in young ELLs, especially since research has
shown their vulnerability while taking timed tests (Hill & Wigfield, 1984; Plass & Hill,
1986). As they are still developing the notion of time, the children may generally find it
unnatural to be timed (Friedman, 1992), and especially so because in real classrooms,
they normally are not timed in such a way. In this sense, the countdown timer may be a
distracting feature in the test, one that interrupts the test taker’s necessary cognitive oper-
ations. The young ELL test takers may engage in maladaptive test-taking strategies
because of the presence of the timer. For instance, researchers documented that both
moderate and highly anxious children tended to perform quickly but less accurately in
26 Language Testing 0(0) 

timed tests (Plass & Hill, 1986). The poorly performing NNS children in the current
study precisely showcased short, incomplete responses, multiple and frequent looks at
the timer, and ultimately failed to incorporate the key requirements for achieving higher
scores. Raters may score thinking there is no variability in the conditions under which
performance takes place, but the timer may be introducing (or causing) variability in test-
taking response behaviors, and thus construct-irrelevant variability in processing and
performance.
There is, however, another way to look at the data. The timer interference in the low-
proficiency NNS children’s speech may indicate that their proficiency was not high
enough for the tasks at hand. That is, if NS and high-scoring NNS children were not
adversely impacted by the timer (as their interview data suggested), then that would sug-
gest that language proficiency levels contribute to students’ perceptions or anxiety levels
when performing specific types of computerized test tasks. It could also be the case that
such timer-focused attentional patterns are attributable to the age differences among the
ELL children. This is an interesting area of research, particularly concerning younger
ELL children, as they are reported to possess a higher-level of distress in timed testing
contexts (McKay, 2006; Menken, 2008) relative to their native English-speaking peers
(see also Hodge et al., 1997). Butler and Zeng (2014) addressed the possibility of the
interfacing effects of proficiency levels as well as age differences in paired-testing con-
texts involving ELL children. The researchers suggested that paired testing may not be
advantageous in assessing the interactional competence of younger ELL children, espe-
cially when the communicative tasks are more difficult. This was based on Butler and
Zeng’s finding that as opposed to sixth-grade ELL pairs, fourth-grade ELL dyads were
more likely to exert less diverse interactional patterns, with more speech with fixed
expressions. One could assume that when the gap between a child’s ability and test-task
difficulty is too great, the child’s performance cannot be reliably scored and the score on
that task cannot be reliably interpreted. This is a quandary for computerized oral testing
because most of it is not computer adaptive, which would prevent test takers from receiv-
ing test items that are too difficult or too easy. Based on the outcome of this study, it
appears that computer-adaptive, oral-language testing is something that oral-test devel-
opers should work vigorously toward.
On a related note, we think the findings of this study also point out the need for child-
language assessment developers to understand fully the contribution of task types and
task features to task difficulty (McKay, 2005). We believe the timer in the test was not
innocuous. It most likely affected, at least to some extent, the child test takers’ spoken
performance as well as their use of attentional resources during testing, and it appeared
to have affected the less proficient and correspondingly more anxious test takers the
most. If a validation study can lead to recommended revisions in a test (as outlined on
p. 12 of the AERA, APA, NCME 2014 Standards), we believe this study would suggest
removing timers from child speaking tests, or devising an alternate method for framing
the proposed processing load (or expected outcome) of the task. A worthy follow-up
study would have a within-group (repeated-measures) design where NNSs take various
speaking tests of equal difficulty, but with the test features (such as a countdown timer)
manipulated (i.e., timer, no timer, alternative to the timer). It would be interesting to
investigate the effects of an alternative to the timer for children. An interesting
Lee and Winke 27

alternative might be, for example, the presence of an animal or test narrator on screen
who gestures or holds up a microphone to indicate that he or she is listening when (and
while) it is time to talk. We think our study shows that the second cognitive operation that
we outlined in Table 1, that is, a child’s understanding of the speaking context (the closed
skill task), is one of the hardest cognitive operations for young children. This could be
owing to the novel aspect of this type of speaking task, and the unfamiliarity children
have with timed speaking constructs.
Our study had limitations in that the NNS children involved were recent arrivals, with
the parents in the room during test taking. We know from prior research that parents add
pressure, especially in Asian cultures (Carless & Lam, 2014; Chik & Besser, 2011), and
thus our interpretation that the NNS were more stressed by test features than NS were
should be cautiously received. Quite reasonably, the NNSs’ stress may have additionally
stemmed from the presence of their parents in the room. Future research could remedy
this by investigating ELL tests with various populations of young learners.

Conclusions
In the present study, we attempted to describe the oral-test-taking processes and test-
taking perspectives of a quickly growing group of test takers: young ELLs. The major
finding of the present study is that when young ELLs are not able to use helpful (and
required) visual cues (that are most likely meant to reduce the child’s cognitive load) to
the fullest extent possible, score interpretation suffers: The children may have been dis-
tracted by (or anxious because of) test features borrowed from adult-testing scenarios
(countdown timers). Countdown timers are supposed to serve as an aid to cognitive
processing; they are supposed to help children (or test takers in general) understand the
speaking context, and help them know when to speak, how much to speak, and when to
stop (see Table 1). But in this study, the timer may have introduced construct-irrelevant
variability in response processes to which raters were most likely blind. In our data, we
found a potential contributor to ELLs’ oral test scores in the more difficult test tasks.
These tasks had visual cues to aid processing, but they also had onscreen test features
related to task management and test restrictions (the timer). We observed that more atten-
tion to the timer went hand-in-hand with less attention to the helpful content pictures,
and these behaviors in turn corresponded with low test scores (low speaking proficiency).
The eye-movement data do not suggest causality; however, at this point we do need to
know which causes which. The new issues that have revealed themselves through this
research are these: Is there a compounding effect, such that the low-proficiency students
are doubly set up to do poorly on these types of items? What would happen if the timer
were not present? Would that free up cognitive processing demands, especially for the
lower-level NNSs, such that they could allocate more attention to the helpful visual cues,
and thus score higher on the test?
Acknowledging that the tests should be biased for the best (Swain, 1984), and
because, as summarized by Chapelle, Enright, and Jamieson (2010), researchers should
gather multiple types of evidence to support test-score interpretations, we urge test
developers to consider child test takers’ sensitivity to the test-specific conditions on
young learners’ oral-skills tests. We believe researchers should carefully investigate
28 Language Testing 0(0) 

any (or all) time indicators that are meant helpfully to frame the speaking context and
task demands. While our study was small, the results call for more research on the
effects of young language-learner tests (as put forth by Butler & Zeng, 2014). Such
research is critical at this time to accommodate the future increase in the number of
young language test takers, especially those who may have limited exposure to stand-
ardized computer-based testing.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/
or publication of this article: This research was part of a larger project funded by Educational
Testing Service (ETS) under a Committee of Examiners and TOEFL research grant, specifically a
TOEFL® Young Students Series Research Program Grant For Graduate Students awarded to
Shinhye Lee and Laura Ballard at Michigan State University. ETS does not discount or endorse the
methodology, results, implications or opinions presented by the researchers.

Notes
1 Typically referred to as English language learners (ELLs) (see Espinosa, 2010 for discus-
sions on diverse terms used for describing this group of children), these children often begin
their English studies during or even before the early years in formal elementary schooling
(McKay, 2006).
2 An earlier pilot on a swivel chair with wheels that also allowed for a slight recline or rocking
motion showed problems as the children enjoyed swishing back and forth and moving the
chair around, which rendered unusable eye-movement data.
3 Both raters had extensive experience rating test takers’ oral performance measured through
the TOEFL iBT speaking test section at the time of the study. They had experience rating
school-age test takers’ oral responses, with one of them being a rater of the TOEFL Junior test.
As such, they both were familiar with rating children’s oral performance, and were famililar
with the current test’s rubric descriptors.
4 The agreement between the two raters across Task 2 and Task 7 were as follows: .71 (Task 2),
.63 (Task 3), .55 (Task 4), .58 (Task 5), .75 (Task 6), and .66 (Task 7).
5 Because the issue of the timer emerged through the course of a series of experiments, we did
not ask all children this question.

References
ACTFL. (2015). NCSSFL-ACTFL Can-do statements: Performance indicators for language learn-
ers. Alexandria, VA: American Council on the Teaching of Foreign Languages. Available at
www.actfl.org/publications/guidelines-and-manuals/ncssfl-actfl-can-do-statements.
AERA, APA, NCME. (2014). Standards for educational and psychological testing. Washington,
DC: American Education Research Association.
Barraza, L. (1999). Children’s drawings about the environment. Environmental Educational
Research, 5(1), 49–66.
Bax, S. (2013). The cognitive processing of candidates during reading tests: Evidence from eye-
tracking. Language Testing, 30(4), 441–465.
Bosse, M., & Valdois, S. (2009). Influence of the visual attention span on child reading perfor-
mance: A cross-sectional study. Journal of Research in Reading, 32(2), 230–253.
Brunfaut, T., & McCray, G. (2015). Looking into test-takers’ cognitive processing whilst com-
pleting reading tasks: A mixed-method eye-tracking and stimulated recall study. (ARAGs
Research Reports – Online, vol. 1, no. 1). London: British Council.
Lee and Winke 29

Butler, Y. G., & Zeng, W. (2014). Young foreign language learner’s interactions during task-based
paired assessment. Language Assessment Quarterly, 11(1), 45–75.
Carless, D., & Lam, R. (2014). The examined life: Perspectives of lower primary school students
in Hong Kong. Education 3–13: International Journal of Primary, Elementary and Early
Years Education, 42(3), 313–329.
Chapelle, C. A., Enright, M. K., & Jamieson, J. (2010). Does an argument-based approach to valid-
ity make a difference? Educational Measurement: Issues and Practice, 29(1), 3–13.
Chik, A., & Besser, S. (2011). International language test taking among young learners: A Hong
Kong case study. Language Assessment Quarterly, 8(1), 73–91.
Colwell, N. M. (2013). Test anxiety, computer-adaptive testing and the Common Core. Journal of
Education and Training Studies, 1(2), 50–60.
De Jong, N. H., Groenhout, R., Schoonen, R., & Hulstijn, J. H. (2013). Second language fluency:
Speaking style or proficiency? Correcting measures of second language fluency for first lan-
guage behavior. Applied Psycholinguistics, 36(2), 223–243.
Dempster, F. N. (1985). Short-term memory development in childhood and adolescence. In C. J.
Brainerd & M. Pressley (Eds.), Basic processes in memory development (pp. 209–248). New
York: Springer-Verlag.
Dewaele, J.-M., Petrides, K. V., & Furnham, A. (2008). The effects of trait emotional intelligence
and sociobiographical variables on communicative anxiety and foreign language anxiety
among adult multilinguals: A review and empirical investigation. London: Birkbeck ePrints.
Donker, A., & Reitsma, P. (2007). Young children’s ability to use a computer mouse. Computers
& Education, 48(4), 602–617.
Edelbrock, C. (1984). Developmental considerations. In T. H. Ollendick & M. Hersen (Eds.),
Child behavioral assessment: Principles and procedures (pp. 20–37). New York: Pergamon.
Espinosa, L. M. (2010). Assessment of young English-language learners. In E.E. Garcia &
E. Frede (Eds.), Young English-language learners: Current research and emerging directions
for practice and policy (pp. 119–142). New York: Teacher’s College Press.
Exline, R. V., & Fehr, B. J. (1978). Applications of semiosis to the study of visual interaction. In A.
W. Seigman & S. Feldstein (Eds.), Nonverbal behavior and communication (pp. 117–157).
Hillsdale, NJ: Lawrence Erlbaum.
Flanery, R. (1990). Methodological and psychometric considerations in child reports. In A. M. La
Greca (Ed.), Through the eyes of the child: Obtaining self-reports from children and adoles-
cents (pp. 57–82). Needham Heights, MA: Allyn & Bacon.
Friedman, W. J. (1992). Children’s time memory: The development of a differentiated past.
Cognitive Development, 7(2), 171–187.
Fritts, B. E., & Marszalek, J. M. (2010). Computerized adaptive testing, anxiety levels, and gender
differences. Social Psychology of Education, 13(3), 441–458.
Glenberg, A. M., Schroeder, J. L., & Robertson, D. A. (1998). Averting the gaze disengages the
environment and facilitates remembering. Memory & Cognition, 26(4), 651–658.
Griffin, Z. M. (2001). Gaze durations during speech reflect word selection and phonological
encoding. Cognition, 82(1), 1–14.
Griffin, Z. M. (2004). The eyes are right when the mouth is wrong. Psychological Science, 15(12),
814–821.
Griffin, Z. M., & Oppenheimer, D. M. (2006). Speakers gaze at objects while preparing intention-
ally inaccurate labels for them. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 32(4), 943–948.
Hall, K., Collins, J., Benjamin, S., Nind, M., & Sheehy, K. (2004). SATurated models of pupildom:
Assessment and inclusion/exclusion. British Educational Research Journal, 30(6), 801–817.
Harlen, W. (2006). The role of assessment in developing motivation for learning. In J. Gardner
(Ed.), Assessment and learning (pp. 61–80). London: SAGE Publications.
30 Language Testing 0(0) 

Harley, T. A. (1990). Environmental contamination of normal speech. Applied Psycholinguistics,


11(1), 45–72.
Hewitt, E., & Stephenson, J. (2012). Foreign language anxiety and oral exam performance: A rep-
lication of Phillips’s MLJ study. The Modern Language Journal, 96(2), 170–189.
Hill, K. T., & Wigfield, A. (1984). Test anxiety: A major educational problem and what can be
done about it. The Elementary School Journal, 85(1), 105–126.
Hodge, G. M., McCormick, J., & Elliott, R. (1997). Examination-induced distress in a public
examination at the completion of secondary schooling. British Journal of Educational
Psychology, 67(2), 185–197.
Holmqvist, K., Nyström, M., Andersson, R., Dewhurst, R., Jarodzka, H., van de Weijer, J. (2011).
Eye tracking: A comprehensive guide to methods and measures. Oxford, UK: Oxford
University Press.
Horwitz, E. K. (2010). Foreign and second language anxiety. Language Teaching, 43(2), 154–167.
Humphreys, G. W., & Forde, E. M. E. (2001). Hierarchies, similarity, and interactivity in object
recognition: “Category-specific” neuropsychological deficits. Behavioral and Brain Sciences,
24(3), 453–509.
Humphreys, G. W., Riddoch, M. J., & Price, C. J. (1997). Top-down processes in object identifi-
cation: Evidence from experimental psychology, neuropsychology, and functional anatomy.
Philosophical Transactions: Biological Sciences, 352(1358), 1275–1282.
Hyun, E. (2005). A study of 5- to 6-year-old children’s peer dynamics and dialectical learning in
a computer-based technology-rich classroom environment. Computers & Education, 44(1),
69–91.
Ishikawa, T. (2007). The effect of manipulating task complexity along the (+/− here-and-now)
dimension on L2 written narrative discourse. In M. del Pilar García-Mayo (Ed.), Investigating
tasks in formal language learning (pp. 136–156). Clevedon, UK : Multilingual Matters.
Iwashita, N., Elder, C., & McNamara, T. (2001). Can we predict task difficulty in an oral profi-
ciency test? Exploring the potential of an information-processing approach to task design.
Language Learning, 51(3), 401–436.
Joo, M. (2007). The attitudes of students’ and teachers’ toward a Computerized Oral Test (COT)
and a Face-To-Face Interview (FTFI) in a Korean university setting. Journal of Language
Sciences, 14(2), 171–193.
Kahng, J. (2014). Exploring utterance and cognitive fluency of L1 and L2 English speakers:
Temporal measures and stimulated recall. Language Learning, 64(4), 809–854.
McCray, G., & Brunfaut, T. (2016). Investigating the construct measured by banked gap-fill items:
Evidence from eye-tracking. Language Testing. Advanced online publication.
McKay, P. (2005). Research into the assessment of school-age language learners. Annual Review
of Applied Linguistics, 25, 243–263.
McKay, P. (2006). Assessing young language learners. Cambridge, UK: Cambridge University
Press.
McNeil, L., & Valenzuela, A. (2001). The harmful impact of the TAAS system of testing in Texas:
Beneath the accountability rhetoric. In M. Kornhaber & G. Orfield (Eds.), Raising standards
or raising barriers? Inequality and high stakes testing in public education (pp. 127–150).
New York: Century Foundation.
Menken, K. (2008). English language learners left behind: Standardized testing as language pol-
icy. Clevedon, England: Multilingual Matters.
Meyer, A. S., Sleiderink, A., & Levelt, W. J. M. (1998). Viewing and naming objects: Eye move-
ments during noun phrase production. Cognition, 66(2), B25–B33.
Pitkin, A. K., & Vispoel, W. P. (2001). Differences between self-adapted and computerized adap-
tive tests: A meta-analysis. Journal of Educational Measurement, 38(3), 235–247.
Lee and Winke 31

Plass, J. A., & Hill, K. T. (1986). Children’s achievement strategies and test performance: The
role of time pressure, evaluation anxiety, and sex. Developmental Psychology, 22(1), 31–36.
Pomplun, M., & Custer, M. (2005). The score comparability of computerized and paper-and-pencil
formats for K-3 reading tests. Journal of Educational Computing Research, 32(2), 153–166.
Rayner, K. (2009). Eye movements in reading: Models and data. Journal of Eye Movement
Research, 2(5), 1–10.
Reay, D., & Wiliam, D. (1999). “I’ll be a nothing”: Structure, agency and the construction of iden-
tity through assessment. British Educational Research Journal, 25(3), 343–354.
Reichle, E. D., Warren, T., & McConnell, K. (2009). Using E-Z Reader to model the effects of
higher level language processing on eye movements during reading. Psychonomic Bulletin &
Review, 16(1), 1–21.
Robinson, P. (2001). Task complexity, cognitive resources, and syllabus design: A triadic frame-
work for investigating task influences on SLA. In P. Robinson (Ed.), Cognition and second
language instruction (pp. 287–318). New York: Cambridge University Press.
Robinson, P. (2005). Cognitive complexity and task sequencing: Studies in a componential frame-
work for second language task design. International Review of Applied Linguistics, 43, 1–32.
Rotenberg, A. M. (2002). A classroom research project: The psychological effects of standardized
testing on young English language learners at different language proficiency levels. Available
from the ERIC Database (ED472651) at http://eric.ed.gov/?id=ED472651
Segalowitz, N., & Trofimovich, P. (2012). Second language processing. In S. M. Gass & A.
Mackey (Eds.), The Routledge handbook of second language acquisition (pp. 179–192).
Shi, F. (2012). Exploring students’ anxiety in computer-based oral English test. Journal of
Language Teaching and Research, 3(3), 446–451.
Smith, C. A., & Ellsworth, P. C. (1987). Patterns of appraisal and emotions related to taking
exams. Journal of Personality and Social Psychology, 52(3), 475–488.
Spivey, M., Richardson, D., & Dale, R. (2009). The movement of eye and hand as a window into
language and cognition. In E. Morsella & J. Bargh (Eds.), Oxford handbook of human action
(pp. 225–248). New York: Oxford University Press.
Suvorov, R. (2015). The use of eye tracking in research on video-based second language (L2)
listening assessment: A comparison of context videos and content videos. Language Testing,
32(4), 463–483.
Swain, M. (1984). Large-scale communicative testing: A case study. In S. J. Savignon &
M. Berns (Eds.), Initiatives in communicative language teaching (pp. 185–201). Reading,
MA: Addison-Wesley.
Van der Meulen, F. F., Meyer, A. S., & Levelt, W. J. M. (2001). Eye movements during the pro-
duction of nouns and pronouns. Memory & Cognition, 29(3), 512–521.
Winke, P. (2011). Evaluating the validity of a high-stakes ESL test: Why teachers’ perceptions
matter. TESOL Quarterly, 45(4), 628–660.
Winke, P., Lee, S., Ahn, I., Choi, I., Cui, Y., & Yoon, H.-J. (2015). A validation study of the
reading section of the Young Learners Tests of English (YLTE). CaMLA Working Papers,
2015–03, 1–30. Available atwww.cambridgemichigan.org/wpcontent/uploads/2015/12/
CWP-2015–03.pdf.
Winke, P., & Lim, H. (2014). The effects of testwiseness and test-taking anxiety on L2 listen-
ing test performance: A visual (eye-tracking) and attentional investigation. IELTS Research
Reports Series, 3. Available at www.ielts.org/pdf/Winke%20and%20Lim.pdf.
Winke, P., & Lim, H. (2015). ESL essay raters’ cognitive processes in applying the Jacobs et al.
rubric: An eye-movement study. Assessing Writing, 25, 38–54.

You might also like