Professional Documents
Culture Documents
Address for correspondence: Youn-Hee Kim, Modern Language Center, Ontario Institute for
Studies in Education, University of Toronto, 252 Bloor Street West, Toronto, Ont., Canada, M5S
1V6; email: younkim@oise.utoronto.ca
the items (assessment criteria) and the way in which they apply the
scale do differ’ (p. 13).
Studies of raters with diverse linguistic and professional back-
grounds have also been conducted. Comparing native and non-
native Spanish speakers with or without teaching experience,
Galloway (1980) found that non-native teachers tended to focus
on grammatical forms and reacted more negatively to non-
verbal behavior and slow speech, while non-teaching native
speakers seemed to place more emphasis on content and on sup-
porting students’ attempts at self-expression. Conversely, Barnwell
(1989) reported that untrained native Spanish speakers provided
more severe assessments than an ACTFL-trained Spanish rater.
This result conflicts with that of Galloway (1980), who found
that untrained native speakers were more lenient than teachers.
Barnwell suggested that both studies were small in scope, and that
it was therefore premature to draw conclusions about native speak-
ers’ responses to non-native speaking performance. Hill (1997)
further pointed out that the use of two different versions of rating
scales in Barnwell’s study, one of which was presented in English
and the other in Spanish, remains questionable.
One recent study of rater behavior focused on the effect of
country of origin and task on evaluations of students’ oral English
performance. Chalhoub-Deville and Wigglesworth (2005) inves-
tigated whether native English-speaking teachers who live in dif-
ferent English-speaking countries (i.e., Australia, Canada, the UK,
and the USA) exhibited significantly different rating behaviors in
their assessments of students’ performance on three Test of Spoken
English (TSE) tasks – 1) give and support an opinion, 2) picture-
based narration, and 3) presentation – which require different
linguistic, functional, and cognitive strategies. MANOVA results
indicated significant variability among the different groups of native
English-speaking teachers across all three tasks, with teachers resid-
ing in the UK the most severe and those in the USA the most lenient
across the board; however, the very small effect size (η2 = 0.01)
suggested that little difference exists among different groups of
native English-speaking teachers.
Although the above studies provide some evidence that raters’
linguistic and professional backgrounds influence their evaluation
behavior, further research is needed for two reasons. First, most
extant studies are not grounded in finely tuned methodologies. In
some early studies (e.g., Fayer & Krasinski, 1987; Galloway, 1980;
Youn-Hee Kim 191
II Methodology
1 Research design overview
The underlying research framework of this study is based on both
expansion and complementarity mixed methods designs, which are
most commonly used in empirical mixed methods evaluation studies
(see Greene et al., 1989 for a review of mixed methods evaluation
designs). The expansion design was considered particularly well
suited to this study because it would offer a comprehensive and
diverse illustration of rating behavior, examining both the product
that the teachers generate (i.e., the numeric scores awarded to stu-
dents) and the process that they go through (i.e., evaluative com-
ments) in their assessment of students’ oral English performance
(Greene et al., 1989). The complementarity design was included to
192 An investigation into native and non-native teachers’ judgments
2 Participants
Ten Korean students were selected from a college-level language
institute in Montreal, Canada, and were informed about the research
project and the test to participate in the study. The students were
drawn from class levels ranging from beginner to advanced, so that
the student sample would include differing levels of English profi-
ciency. The language institute sorted students into one of five class
levels according to their aggregate scores on a placement test meas-
uring four English language skills (listening, reading, speaking, and
writing): Level I for students with the lowest English proficiency,
up to Level V for students with the highest English proficiency.
Table 1 shows the distribution of the student sample across the five
class levels.
For the teacher samples, a concurrent mixed methods sampling
procedure was used in which a single sample produced data for
both the quantitative and qualitative elements of the study (Teddlie
& Yu, 2007). Twelve native English-speaking Canadian teachers
of English and 12 non-native English-speaking Korean teachers of
English constituted the NS and NNS teacher groups, respectively.
In order to ensure that the teachers were sufficiently qualified,
certain participation criteria were outlined: 1) at least one year of
prior experience teaching an English conversation course to non-
native English speakers in a college-level language institution;
2) at least one graduate degree in a field related to linguistics or
language education; and 3) high proficiency in spoken English for
Korean teachers of English. Teachers’ background information
Level I II III IV V
Number of students 1 1 3 3 2
Youn-Hee Kim 193
3 Instruments
A semi-direct oral English test was developed for the study. The
purpose of the test was to assess the overall oral communicative
language ability of non-native English speakers within an aca-
demic context. Throughout the test, communicative language abil-
ity would be evidenced by the effective use of language knowledge
and strategic competence (Bachman & Palmer, 1996). Initial test
development began with the identification of target language use
domain, target language tasks, and task characteristics (Bachman
& Palmer, 1996). The test tasks were selected and revised to reflect
potential test-takers’ language proficiency and topical knowledge,
as well as task difficulty and interest. An effort was also made to
select test tasks related to hypothetical situations that could occur
within an academic context. In developing the test, the guiding
principles of the Simulated Oral Proficiency Interview (SOPI)
were referenced.
The test consisted of three different task types in order to
assess the diverse oral language output of test-takers: picture-
based, situation-based, and topic-based. The picture-based task
required test-takers to describe or narrate visual information, such
as describing the layout of a library (Task 1, [T1]), explaining
the library services based on a provided informational note (Task
2, [T2]), narrating a story from six sequential pictures (Task 4,
[T4]), and describing a graph of human life expectancy (Task 7,
[T7]). The situation-based task required test-takers to perform the
appropriate pragmatic function in a hypothetical situation, such
as congratulating a friend on being admitted to school (Task 3,
[T3]). Finally, the topic-based task required test-takers to offer
194 An investigation into native and non-native teachers’ judgments
4 Procedure
The test was administered individually to each of 10 Korean
students, and their speech responses were simultaneously
recorded as digital sound files. The order of the students’ test
response sets was randomized to minimize a potential ordering
effect, and then 12 of the possible test response sets were distrib-
uted to both groups of teachers. A meeting was held with each
teacher in order to explain the research project and to go over the
scoring procedure, which had two phases: 1) rating the students’
test responses according to the four-point rating scale; and 2)
justifying those ratings by providing written comments either in
English or in Korean. While the NS teachers were asked to write
comments in English, the NNS teachers were asked to write
comments in Korean (which were later translated into English).
The rationale for requiring teachers’ comments was that they
would supply not only the evaluation criteria that they drew on
to infer students’ oral proficiency, but that it would help to iden-
tify the construct being measured. The teachers were allowed
to control the playing, stopping, and replaying of test responses
and to listen to them as many times as they wanted. After rat-
ing a single task response by one student according to the rating
scale, they justified their ratings by writing down their reasons or
comments. They then moved on to the next task response of that
student. The teachers rated and commented on 80 test responses
(10 students × 8 tasks).
To decrease the subject expectancy effect, the teachers were told
that the purpose of the study was to investigate teachers’ rating
behavior, and the comparison of different teacher groups was not
explicitly mentioned. The two groups of teachers were therefore
unaware of each other. In addition, a minimum amount of infor-
mation about the students (i.e., education level, current visa status,
etc.) was provided to the teachers. Meetings with the NS teachers
were held in Montreal, Canada, and meetings with the NNS teach-
ers followed in Daegu, Korea. Each meeting lasted approximately
30 minutes.
5 Data analyses
Both quantitative and qualitative data were collected. The quanti-
tative data consisted of 1,727 valid ratings, awarded by 24 teach-
ers to 80 sample responses by 10 students on eight tasks. Each
196 An investigation into native and non-native teachers’ judgments
Fit statistics
Teachers’ internal
Proportions of large
consistency
standard residuals
Typology development
3,295 Teachers’
written evaluation criteria
comments Data transformation
(quantification of
evaluation features) for
cross-comparison
see Kim, 2005). The 19 evaluative criteria were compared across the
two teacher groups through a frequency analysis.
π = Nu (1)
Nt
where, Nu = the total number of large standard residuals and
Nt = the total number of ratings.
Nur
Pr = (2)
N tr
where, Nur = the number of large standard residuals made by rater
r and Ntr = the number of ratings made by rater r.
An inconsistent rating will occur when the observed propor-
tion exceeds the null proportion beyond the acceptable deviation
(Myford & Wolfe, 2000). Thus, Myford and Wolfe propose that the
frequency of unexpected ratings (Zp) can be calculated using equa-
tion (3). According to them, if a Zp value for a rater is below +2, it
indicates that the unexpected ratings that he or she made are random
error; however, if the value is above +2, the rater is considered to be
exercising an inconsistent rating pattern.
Pr − π
Zp = (3)
π − π2
Ntr
In this study, an unexpected observation was reported if the stand-
ardized residual was greater than +2, which was the case in 89 out
of a total of 1,727 responses. When rating consistency was exam-
ined, one NS teacher and two NNS teachers were found to exhibit
inconsistent rating patterns, a result similar to what was found in
the fit analysis. The two NNS teachers whose observed Zp values
were greater than +2 were NNS6 and NNS7, who had been flagged
as misfitting teachers by their infit indices. Interestingly, the analy-
sis of NS teachers showed that it was NS9, not NS10, who had Zp
values greater than +2. This may be because NS10 produced only
a small number of unexpected ratings which did not produce large
residuals. That small Zp value indicates that while the teacher gave a
few ratings that were somewhat unexpectedly higher (or lower) than
the model would expect, those ratings were not highly unexpected
(C. Myford, personal communication, May, 31, 2005).
Myford and Wolfe (2004a, 2004b) introduced the more advanced
Many-faceted Rasch Measurement application to detect raters’
consistency based on the single rater–rest of the raters (SR/ROR)
correlation. When raters exhibit randomness, they are flagged with
Youn-Hee Kim 201
significantly large infit and outfit mean square indices; however, sig-
nificantly large infit and outfit mean square indices may also indicate
other rater effects (Myford & Wolfe, 2004a, 2004b). Thus, Myford
and Wolfe suggested that it is important to examine significantly
low SR/ROR correlations as well. More specifically, they suggested
that randomness will be detected when infit and outfit mean square
indices are significantly larger than 1 and SR/ROR correlations are
significantly lower than those of other raters. Four teachers appeared
to be inconsistent: NS9, NNS6, NNS7, and NNS9 showed not only
large fit indices but also low SR/ROR correlations. When compared
relatively, NS9, NNS7, and NNS9 seemed to be on the borderline
in their consistency, whereas NNS6 was obviously signaled as an
inconsistent teacher.
In summary, the three different types of statistical approaches
showed converging evidence; most of the NS and NNS teachers
were consistent in their ratings, with one or two teachers from each
group showing inconsistent rating patterns. This result implies that
the two groups rarely differed in terms of internal consistency, and
that the NNS teachers were as dependable as the NS teachers in
assessing students’ oral English performance.
1.2
1
0.8
Task Difficulty (logits)
0.6
0.4
0.2
0
-0.2 T1 T2 T3 T4 T5 T6 T7 T8
-0.4
-0.6
-0.8
Tasks
Teacher Group
1. NS GROUP 2. NNS GROUP
1
0.8
0.6
0.4
z-value
0.2
0
-0.2
-0.4
-0.6
-0.8
T1 T2 T3 T4 T5 T6 T7 T8
more lenient pattern of ratings, while NS9 showed the exact reverse
pattern; that is, NS9 rated Task 6 significantly more severely. It is
very interesting that one teacher from each group showed the same
bias patterns on Tasks 1, 4, and 7, since it implies that the ratings of
these two teachers may be interchangeable in that they display the
same bias patterns.
In summary, the NS and NNS teachers seem to have behaved
similarly in terms of severity, and this is confirmed by both the task
difficulty measures and the two bias analyses. The overall results of
the multiple quantitative analyses also show that the NS and NNS
NS
NNS
300
Frequency of Comments
250
200
150
100
50
0
l g ruc y
al rele fo
nt
se nte ncy
fe ent
t
n
un ry
ra f di ails
n g p l i sk
t
g u ce
pr ess
vo use
t a pri se
en
co ess
ra ram use
ili
rse
of e
tio
ov pic d in
cy h of me
nc
la
str com e ta
an van
nt ppr ar u
ge tenc igib
tra gum
m
n
e
t
n
n cou
bu
cia
de
e
pl eme ere
flu
pp ate
sh
ar
gu
rre
ag
th
iat
ca
m
m
ll
ar
ar
s
h
t
es
g
on
in
ns
nt
of
o
ro
i
l t and
o
pr
ul ic g
ll
ela ess
to
ac
tio
t
a
ra
n
er erst
of
l
er
n
k
co upp
ne
e
ex
as
ete
bo
d
tu
un
s
ra
co
al
cu
m
-c
ac
ov
so
specific context in which this study was carried out will make the
interpretations of the study more valid. The use of other qualitative
approaches is also recommended. The only qualitative data col-
lected were written comments, which failed to offer a full account
of the teachers’ in-depth rating behavior. Those behaviors could be
further investigated using verbal protocols or in-depth interviews
for a fuller picture of what the teachers consider effective language
performance. As one of the reviewers pointed out, it might be also
interesting to investigate whether the comments made by the NS
and NNS teachers tap different constructs of underlying oral profi-
ciency and thereby result in different rating scales. Lastly, further
research is suggested to examine the extent to which the semi-
direct oral test and the rating scale employed in this study represent
the construct of underlying oral proficiency.
Acknowledgements
I would like to acknowledge that this research project was funded
by the Social Sciences and Humanities Research Council of Canada
through McGill University’s Institutional Grant. My sincere appre-
ciation goes to Carolyn Turner for her patience, insight, and guid-
ance, which inspired me to complete this research project. I am also
very grateful to Eunice Jang, Alister Cumming, and Merrill Swain
for their valuable comments and suggestions on an earlier version
of this article. Thanks are also due to three anonymous reviewers of
Language Testing for their helpful comments.
V References
Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford:
Oxford University Press.
Barnwell, D. (1989). ‘Naïve’ native speakers and judgments of oral proficiency
in Spanish. Language Testing, 6, 152–163.
Brown, A. (1995). The effect of rater variables in the development of an occu-
pation-specific language performance test. Language Testing, 12, 1–15.
Caracelli, V. J. & Greene, J. C. (1993). Data analysis strategies for mixed-
method evaluation designs. Educational Evaluation and Policy Analysis,
15, 195–207.
Caracelli, V. J. & Greene, J. C. (1997). Crafting mixed method evaluation
designs. In Greene, J. C. & Caracelli, V. J., editors, Advances in mixed-
method evaluation: The challenges and benefits of integrating diverse
212 An investigation into native and non-native teachers’ judgments
Notes:
1. ‘Communication’ is defined as an examinee’s ability to both address a given task
and get a message across.
2. A score of 4 does not necessarily mean speech is comparable to that of native
English speakers.
3. No response, or a response of ‘I don’t know’ is automatically rated NR (Not
Ratable).
Appendix B: Definitions and examples of the evaluation criteria
Evaluation criteria & definitions Examples
1. Understanding the task: the degree to which a speaker ‘Didn’t seem to understand the task.’
understands the given task ‘Didn’t understand everything about the task.’
2. Overall task accomplishment: the degree to which a ‘Generally accomplished the task.’
speaker accomplishes the general demands of the task ‘Task not really well accomplished.’
‘Successfully accomplished task.’
3. Strength of argument: the degree to which the argument of ‘Good range of points raised.’
the response is robust ‘Good statement of main reason presented.’
‘Arguments quite strong’
4. Accuracy of transferred information: the degree to which a ‘Misinterpretation of information (e.g., graduate renewals for
speaker transfers the given information accurately undergrads, $50 a day for book overdue?)’
‘Incorrect information (e.g., ‘9pm’ instead of ‘6pm’)’
5. Topic relevance: the degree to which the content of the ‘Not all points relevant’
response is relevant to the topic ‘Suddenly addressing irrelevant topic (i.e., focusing on physically
harmful effects of laptops rather than on harmful effects of the
internet)’
6. Overall language use: the degree to which the language ‘Generally good use of language’
component of the response is of good and appropriate ‘Native-like language’
quality ‘Very limited language’
7. Vocabulary: the degree to which vocabulary used in the ‘Good choice of vocabulary’
response is of good and appropriate quality ‘Some unusual vocabulary choices (e.g., he crossed a girl.)’
8. Pronunciation: the degree to which pronunciation of the ‘Native-like pronunciation’
response is of good quality and clarity ‘Pronunciation difficulty (e.g., l/r, d/t, vowels, i/e)’
‘Mispronunciation of some words (e.g., ‘circulation ’)
9. Fluency: the degree to which the response is fluent without ‘Choppy, halted’
too much hesitation ‘Pausing, halting, stalling – periods of silence’
216 An investigation into native and non-native teachers’ judgments