An Investigation Into Native and Non Native Teachers' Jugements of Oral English Peformance A Mixed Methods Approach

–
Language Testing 2009 26 (2) 187 217
An investigation into native and

non-native teachers’ judgments of
oral English performance: A mixed
methods approach
Youn-Hee Kim University of Toronto, Canada
This study used a mixed methods research approach to examine how

native English-speaking (NS) and non-native English-speaking (NNS)
teachers assess students’ oral English performance. The evaluation
behaviors of two groups of teachers (12 Canadian NS teachers and 12
Korean NNS teachers) were compared with regard to internal consist-
ency, severity, and evaluation criteria. Results of a Many-faceted Rasch
Measurement analysis showed that most of the NS and NNS teachers
maintained acceptable levels of internal consistency, with only one or
two inconsistent raters in each group. The two groups of teachers also
exhibited similar severity patterns across different tasks. However, sub-
stantial dissimilarities emerged in the evaluation criteria teachers used to
assess students’ performance. A qualitative analysis demonstrated that
the judgments of the NS teachers were more detailed and elaborate than
those of the NNS teachers in the areas of pronunciation, specific gram-
mar use, and the accuracy of transferred information. These findings are
used as the basis for a discussion of NS versus NNS teachers as language
assessors on the one hand and the usefulness of mixed methods inquiries
on the other.
Keywords: mixed methods, NS and NNS, oral English performance

assessment, many-faceted Rasch Measurement
In the complex world of language assessment, the presence of raters

is one of the features that distinguish performance assessment from
traditional assessment. While scores in traditional fixed response
assessments (e.g., multiple-choice tests) are elicited solely from
the interaction between test-takers and tasks, it is possible that
the final scores awarded by a rater could be affected by variables
Address for correspondence: Youn-Hee Kim, Modern Language Center, Ontario Institute for
Studies in Education, University of Toronto, 252 Bloor Street West, Toronto, Ont., Canada, M5S
1V6; email: younkim@oise.utoronto.ca
© The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav

DOI:10.1177/0265532208101010
188 An investigation into native and non-native teachers’ judgments
inherent to that rater (McNamara, 1996). Use of a rater for perform-

ance assessment therefore adds a new dimension of interaction to
the process of assessment, and makes monitoring of reliability and
validity even more crucial.
The increasing interest in rater variability has also given rise to
issues of eligibility; in particular, the question of whether native
speakers should be the only ‘norm maker[s]’ (Kachru, 1985) in
language assessment has inspired heated debate among language
professionals. The normative system of native speakers has long
been assumed in English proficiency tests (Taylor, 2006), and
it is therefore unsurprising that large-scale, high-stakes tests
such as the Test of English as a Foreign Language (TOEFL)
and the International English Language Testing System (IELTS)
rendered their assessments using native English-speaking ability
as a benchmark (Hill, 1997; Lowenberg, 2000, 2002). However,
the current status of English as a language of international com-
munication has caused language professionals to reconsider
whether native speakers should be the only acceptable standard
(Taylor, 2006). Indeed, non-native English speakers outnumber
native English speakers internationally (Crystal, 2003; Graddol,
1997; Jenkins, 2003; Lowenberg, 2000), and ‘localization’ of the
language has occurred in ‘outer circle countries’ such as China,
Korea, Japan, and Russia (Kachru, 1985, 1992; Lowenberg,
2000). These developments suggest that new avenues of oppor-
tunity may be opening for non-native English speakers as lan-
guage assessors.
This study, in line with the global spread of English as a lingua
franca, investigates how native English-speaking (NS) and non-native
English-speaking (NNS) teachers evaluate students’ oral English
performance in a classroom setting. A mixed methods approach will
be utilized to address the following research questions:
1) Do NS and NNS teachers exhibit similar levels of internal

consistency when they assess students’ oral English perform-
ance?
2) Do NS and NNS teachers exhibit interchangeable severity
across different tasks when they assess students’ oral English
performance?
3) How do NS and NNS teachers differ in drawing on evalu-
ation criteria when they comment on students’ oral English
performance?
Youn-Hee Kim 189
I Review of the literature

A great deal of research exploring rater variability in second language
oral performance assessment has been conducted, with a number of
early studies focusing on the impact of raters’ different backgrounds
(Barnwell, 1989; Brown, 1995; Chalhoub-Deville, 1995; Chalhoub-
Deville & Wigglesworth, 2005; Fayer & Krasinski, 1987; Galloway,
1980; Hadden, 1991). In general, teachers and non-native speakers
were shown to be more severe in their assessments than non-teachers
and native speakers, but the outcomes of some studies contradicted
one another. This may be explained by their use of different native
languages, small rater samples, and different methodologies (Brown,
1995; Chalhoub-Deville, 1995).
For example, in a study of raters’ professional backgrounds,
Hadden (1991) investigated how native English-speaking teachers
and non-teachers perceive the competence of Chinese students in
spoken English. She found that teachers tended to be more severe
than non-teachers as far as linguistic ability was concerned, but
that there were no significant differences in such areas as compre-
hensibility, social acceptability, personality, and body language.
Chalhoub-Deville (1995), on the other hand, comparing three differ-
ent rater groups (i.e., native Arabic-speaking teachers living in the
USA, non-teaching native Arabic speakers living in the USA, and
non-teaching native Arabic speakers living in Lebanon), found that
teachers attended more to the creativity and adequacy of information
in a narration task than to linguistic features. Chalhoub-Deville sug-
gested that the discrepant findings of the two studies could be due to
the fact that her study focused on modern standard Arabic (MSA),
whereas Hadden’s study focused on English.
Another line of research has focused on raters’ different lin-
guistic backgrounds. Fayer and Krasinski (1987) examined how
the English-speaking performance of Puerto Rican students was
perceived by native English-speaking raters and native Spanish-
speaking raters. The results showed that non-native raters tended
to be more severe in general and to express more annoyance when
rating linguistic forms, and that pronunciation and hesitation were
the most distracting factors for both sets of raters. However, this
was somewhat at odds with Brown’s (1995) study which found that
while native speakers tended to be more severe than non-native
speakers, the difference was not significant. Brown concluded that,
‘there is little evidence that native speakers are more suitable than
non-native speakers … However, the way in which they perceive
the items (assessment criteria) and the way in which they apply the
scale do differ’ (p. 13).
Studies of raters with diverse linguistic and professional back-
grounds have also been conducted. Comparing native and non-
native Spanish speakers with or without teaching experience,
Galloway (1980) found that non-native teachers tended to focus
on grammatical forms and reacted more negatively to non-
verbal behavior and slow speech, while non-teaching native
speakers seemed to place more emphasis on content and on sup-
porting students’ attempts at self-expression. Conversely, Barnwell
(1989) reported that untrained native Spanish speakers provided
more severe assessments than an ACTFL-trained Spanish rater.
This result conflicts with that of Galloway (1980), who found
that untrained native speakers were more lenient than teachers.
Barnwell suggested that both studies were small in scope, and that
it was therefore premature to draw conclusions about native speak-
ers’ responses to non-native speaking performance. Hill (1997)
further pointed out that the use of two different versions of rating
scales in Barnwell’s study, one of which was presented in English
and the other in Spanish, remains questionable.
One recent study of rater behavior focused on the effect of
country of origin and task on evaluations of students’ oral English
performance. Chalhoub-Deville and Wigglesworth (2005) inves-
tigated whether native English-speaking teachers who live in dif-
ferent English-speaking countries (i.e., Australia, Canada, the UK,
and the USA) exhibited significantly different rating behaviors in
their assessments of students’ performance on three Test of Spoken
English (TSE) tasks – 1) give and support an opinion, 2) picture-
based narration, and 3) presentation – which require different
linguistic, functional, and cognitive strategies. MANOVA results
indicated significant variability among the different groups of native
English-speaking teachers across all three tasks, with teachers resid-
ing in the UK the most severe and those in the USA the most lenient
across the board; however, the very small effect size (η2 = 0.01)
suggested that little difference exists among different groups of
native English-speaking teachers.
Although the above studies provide some evidence that raters’
linguistic and professional backgrounds influence their evaluation
behavior, further research is needed for two reasons. First, most
extant studies are not grounded in finely tuned methodologies. In
some early studies (e.g., Fayer & Krasinski, 1987; Galloway, 1980;
Youn-Hee Kim 191
Hadden, 1991), raters were simply asked to assess speech samples

of less than four minutes’ length without reference to a carefully
designed rating scale. Also, having raters assess only one type of
speech sample did not take the potential systematic effect of task
type on task performance into consideration. Had the task types var-
ied, raters could have assessed diverse oral language output, which
in turn might have elicited unknown or unexpected rating behaviors.
Second, to my knowledge, no previous studies have attempted to
use both quantitative and qualitative rating protocols to investigate
differences between native and non-native English-speaking teach-
ers’ judgments of their students’ oral English performance. A mixed
methods approach, known as the ‘third methodological movement’
(Tashakkori & Teddlie, 2003, p. ix), incorporates quantitative and
qualitative research methods and techniques into a single study and
has the potential to reduce the biases inherent in one method while
enhancing the validity of inquiry (Greene, Caracelli, & Graham,
1989). However, all previous studies that have examined native and
non-native English-speaking raters’ behavior in oral language per-
formance assessment have been conducted using only a quantitative
framework, preventing researchers from probing research phenom-
ena from diverse data sources and perspectives. The mixed methods
approach of the present study seeks to enhance understanding of
raters’ behavior by investigating not only the scores assigned by NS
and NNS teachers but also how they assess students’ oral English
performance.
II Methodology
1 Research design overview
The underlying research framework of this study is based on both
expansion and complementarity mixed methods designs, which are
most commonly used in empirical mixed methods evaluation studies
(see Greene et al., 1989 for a review of mixed methods evaluation
designs). The expansion design was considered particularly well
suited to this study because it would offer a comprehensive and
diverse illustration of rating behavior, examining both the product
that the teachers generate (i.e., the numeric scores awarded to stu-
dents) and the process that they go through (i.e., evaluative com-
ments) in their assessment of students’ oral English performance
(Greene et al., 1989). The complementarity design was included to
provide greater understanding of the NS and NNS teachers’ rating

behaviors by investigating the overlapping but different aspects of
rater behavior that different methods might elicit (Greene et al.,
1989). Intramethod mixing, in which a single method concurrently
or sequentially incorporates quantitative and qualitative components
(Johnson & Turner, 2003), was the selected guiding procedure. The
same weight was given to both quantitative and qualitative methods,
with neither method dominating the other.
2 Participants
Ten Korean students were selected from a college-level language
institute in Montreal, Canada, and were informed about the research
project and the test to participate in the study. The students were
drawn from class levels ranging from beginner to advanced, so that
the student sample would include differing levels of English profi-
ciency. The language institute sorted students into one of five class
levels according to their aggregate scores on a placement test meas-
uring four English language skills (listening, reading, speaking, and
writing): Level I for students with the lowest English proficiency,
up to Level V for students with the highest English proficiency.
Table 1 shows the distribution of the student sample across the five
class levels.
For the teacher samples, a concurrent mixed methods sampling
procedure was used in which a single sample produced data for
both the quantitative and qualitative elements of the study (Teddlie
& Yu, 2007). Twelve native English-speaking Canadian teachers
of English and 12 non-native English-speaking Korean teachers of
English constituted the NS and NNS teacher groups, respectively.
In order to ensure that the teachers were sufficiently qualified,
certain participation criteria were outlined: 1) at least one year of
prior experience teaching an English conversation course to non-
native English speakers in a college-level language institution;
2) at least one graduate degree in a field related to linguistics or
language education; and 3) high proficiency in spoken English for
Korean teachers of English. Teachers’ background information
Table 1 Distribution of students across class levels
Level I II III IV V
Number of students 1 1 3 3 2
Youn-Hee Kim 193
was obtained via a questionnaire after their student evaluations

were completed: all of the NNS teachers had lived in English-
speaking countries for one to seven years for academic purposes,
and their self-assessed English proficiency levels ranged from
advanced (six teachers) to near-native (six teachers); none of the
NNS teachers indicated their self-assessed English proficiency
levels at or below an upper-intermediate level. In addition, nine
NS and eight NNS teachers reported having taken graduate-level
courses specifically in Second Language Testing and Evaluation,
and four NS and one NNS teacher had been trained as raters of
spoken English.
3 Instruments
A semi-direct oral English test was developed for the study. The
purpose of the test was to assess the overall oral communicative
language ability of non-native English speakers within an aca-
demic context. Throughout the test, communicative language abil-
ity would be evidenced by the effective use of language knowledge
and strategic competence (Bachman & Palmer, 1996). Initial test
development began with the identification of target language use
domain, target language tasks, and task characteristics (Bachman
& Palmer, 1996). The test tasks were selected and revised to reflect
potential test-takers’ language proficiency and topical knowledge,
as well as task difficulty and interest. An effort was also made to
select test tasks related to hypothetical situations that could occur
within an academic context. In developing the test, the guiding
principles of the Simulated Oral Proficiency Interview (SOPI)
were referenced.
The test consisted of three different task types in order to
assess the diverse oral language output of test-takers: picture-
based, situation-based, and topic-based. The picture-based task
required test-takers to describe or narrate visual information, such
as describing the layout of a library (Task 1, [T1]), explaining
the library services based on a provided informational note (Task
2, [T2]), narrating a story from six sequential pictures (Task 4,
[T4]), and describing a graph of human life expectancy (Task 7,
[T7]). The situation-based task required test-takers to perform the
appropriate pragmatic function in a hypothetical situation, such
as congratulating a friend on being admitted to school (Task 3,
[T3]). Finally, the topic-based task required test-takers to offer
their opinions on a given topic, such as explaining their personal

preferences for either individual or group work (Task 5, [T5]),
discussing the harmful effects of Internet use (Task 6, [T6]),
and suggesting reasons for an increase in human life expectancy
(Task 8, [T8]).
The test was administered in a computer-mediated indirect
interview format. The indirect method was selected because the
intervention of interlocutors in a direct speaking test might affect
the reliability of test performance (Stansfield & Kenyon, 1992a,
1992b). Although the lexical density produced in direct speaking
tests and indirect speaking tests have been found to be different
(O’Loughlin, 1995), it has consistently been reported that scores
from indirect speaking tests have a high correlation with those from
direct speaking tests (Clark & Swinton, 1979, 1980; O’Loughlin,
1995; Stansfield, Kenyon, Paiva, Doyle, Ulsh, & Antonia, 1990).
In order to effectively and economically facilitate an understanding
of the task without providing test-takers with a lot of vocabulary
(Underhill, 1987), each task was accompanied by visual stimuli. The
test lasted approximately 25 minutes, 8 of which were allotted for
responses.
A four-point rating scale was developed for rating (see
Appendix A). It had four levels labeled 1, 2, 3, and 4. A response
of ‘I don’t know’ or no response was automatically rated NR (Not
Ratable). The rating scale only clarified the degree of commu-
nicative success without addressing specific evaluation criteria.
Because this study aimed to investigate how the teachers com-
mented on the students’ oral communicative ability and defined
the evaluation criteria to be measured, the rating scale did not
provide teachers with any information about which evaluation
features to draw on. To deal with cases in which teachers ‘sit
on the fence’, an even number of levels was sought in the rating
scale. Moreover, in order not to cause a cognitive and psycho-
logical overload on the teachers, six levels were set as the upper
limit during the initial stage of the rating scale development.
Throughout the trials, however, the six levels describing the
degree of successfulness of communication proved to be indis-
tinguishable without dependence on the adjacent levels. More
importantly, teachers who participated in the trials did not use
all six levels of the rating scale in their evaluations. For these
reasons, the rating scale was trimmed to four levels, enabling the
teachers to consistently distinguish each level from the others.
Youn-Hee Kim 195
4 Procedure
The test was administered individually to each of 10 Korean
students, and their speech responses were simultaneously
recorded as digital sound files. The order of the students’ test
response sets was randomized to minimize a potential ordering
effect, and then 12 of the possible test response sets were distrib-
uted to both groups of teachers. A meeting was held with each
teacher in order to explain the research project and to go over the
scoring procedure, which had two phases: 1) rating the students’
test responses according to the four-point rating scale; and 2)
justifying those ratings by providing written comments either in
English or in Korean. While the NS teachers were asked to write
comments in English, the NNS teachers were asked to write
comments in Korean (which were later translated into English).
The rationale for requiring teachers’ comments was that they
would supply not only the evaluation criteria that they drew on
to infer students’ oral proficiency, but that it would help to iden-
tify the construct being measured. The teachers were allowed
to control the playing, stopping, and replaying of test responses
and to listen to them as many times as they wanted. After rat-
ing a single task response by one student according to the rating
scale, they justified their ratings by writing down their reasons or
comments. They then moved on to the next task response of that
student. The teachers rated and commented on 80 test responses
(10 students × 8 tasks).
To decrease the subject expectancy effect, the teachers were told
that the purpose of the study was to investigate teachers’ rating
behavior, and the comparison of different teacher groups was not
explicitly mentioned. The two groups of teachers were therefore
unaware of each other. In addition, a minimum amount of infor-
mation about the students (i.e., education level, current visa status,
etc.) was provided to the teachers. Meetings with the NS teachers
were held in Montreal, Canada, and meetings with the NNS teach-
ers followed in Daegu, Korea. Each meeting lasted approximately
30 minutes.
5 Data analyses
Both quantitative and qualitative data were collected. The quanti-
tative data consisted of 1,727 valid ratings, awarded by 24 teach-
ers to 80 sample responses by 10 students on eight tasks. Each
teacher rated every student’s performance on every task, so that

the data matrix was fully crossed. A rating of NR (Not Ratable)
was treated as missing data; there were eight such cases among
the 80 speech samples. In addition, one teacher failed to make
one rating. The qualitative data included 3,295 written com-
ments. Both types of data were analyzed in a concurrent manner:
a Many-faceted Rasch Measurement (Linacre, 1989) was used to
analyze quantitative ratings, and typology development and data
transformation (Caracelli & Greene, 1993) guided the analysis
of qualitative written comments. The quantitative and qualitative
research approaches were integrated at a later stage (rather than
at the outset of the research process) when the findings from both
methods were interpreted and the study was concluded. Since
the nature of the component designs to which this study belongs
does not permit enough room to combine the two approaches
(Caracelli & Greene, 1997), the different methods tended to
remain distinct throughout the study. Figure 1 summarizes the
overall data analysis procedures.
a Quantitative data analysis: The data were analyzed using the

FACETS computer program, Version 3.57.0 (Linacre, 2005). Four
facets were specified: student, teacher, teacher group, and task. The
teacher group facet was entered as a dummy facet and anchored at
zero. A hybrid Many-faceted Rasch Measurement Model (Myford
& Wolfe, 2004a) was used to differentially apply the Rating Scale
Model to teachers and tasks, and the Partial Credit Model to teacher
groups.
Three different types of statistical analysis were carried out to
investigate teachers’ internal consistency, based on: 1) fit statis-
tics; 2) proportions of large standard residuals between observed
and expected scores (Myford & Wolfe, 2000); and 3) a single
rater–rest of the raters (SR/ROR) correlation (Myford & Wolfe,
2004a). The multiple analyses were intended to strengthen the
validity of inferences drawn about raters’ internal consistency
through converging evidence, and to minimize any bias that is
inherent to a particular analysis. Teachers’ severity measures
were also examined in three different ways based on: 1) task
difficulty measures, 2) a bias analysis between teacher groups
and tasks, and 3) a bias analysis between individual teachers
and tasks.
Youn-Hee Kim 197
Fit statistics
Teachers’ internal
Proportions of large
consistency
standard residuals
Single rater–rest of the

raters (SR/ROR) correlation
1,727
ratings

Task difficulty measures
Teachers’ severity Bias analysis (group

level)

Bias analysis (individual

level)
Typology development
3,295 Teachers’
written evaluation criteria
comments Data transformation
(quantification of
evaluation features) for
cross-comparison
Figure 1 Flowchart of data analysis procedure
b Qualitative data analysis: The written comments were analyzed

based on evaluation criteria, with each written comment constituting
one criterion. Comments that provided only evaluative adjectives
without offering evaluative substance (e.g., accurate, clear, and so
on) were excluded in the analysis so as not to misjudge the evalua-
tive intent. The 3,295 written comments were open-coded so that the
evaluation criteria that the teachers drew upon emerged. Nineteen
recurring evaluation criteria were identified (see Appendix B for
definitions and specific examples). Once I had coded and analyzed
the teachers’ comments, a second coder conducted an independ-
ent examination of the original uncoded comments of 10 teachers
(five NS and five NNS teachers); our results reached approximately
95% agreement (for a detailed description about coding procedures,
see Kim, 2005). The 19 evaluative criteria were compared across the
two teacher groups through a frequency analysis.
III Results and discussion

1 Do NS and NNS teachers exhibit similar levels of internal
consistency when they assess students’ oral English performance?
To examine fit statistics, the infit indices of each teacher were
assessed. Teachers’ fit statistics indicate the degree to which each
teacher is internally consistent in his or her ratings. Determining
an acceptable range of infit mean squares for teachers is not a
clear-cut process (Myford & Wolfe, 2004a); indeed, there are no
straightforward rules for interpreting fit statistics, or for setting
upper and lower limits. As Myford and Wolfe (2004a) noted,
such decisions are related to the assessment context and depend
on the targeted use of the test results. If the stakes are high, tight
quality control limits such as mean squares of 0.8 to 1.2 would
be set on multiple-choice tests (Linacre & Williams, 1998);
however, in the case of low-stakes tests, looser limits would be
allowed. Wright and Linacre (1994) proposed the mean square
values of 0.6 to 1.4 as reasonable values for data in which a rating
scale is involved, with the caveat that the ranges are likely to vary
depending on the particulars of the test situation.
In the present study, the lower and upper quality control limits
were set at 0.5 and 1.5, respectively (Lunz & Stahl, 1990), given
the test’s rating scale and the fact that it investigates teachers’ rating
behaviors in a classroom setting rather than those of trained raters
in a high-stakes test setting. Infit mean square values greater than
1.5 indicate significant misfit, or a high degree of inconsistency in
the ratings. On the other hand, infit mean square values less than
0.5 indicate overfit, or a lack of variability in their scoring. The
fit statistics in Table 2 show that three teachers, NS10, NNS6, and
NNS7, have misfit values. None of the teachers show overfit rating
patterns.
Another analysis was carried out based on proportions of large
standard residuals between observed and expected scores in order
to more precisely identify the teachers whose rating patterns dif-
fered greatly from the model expectations. According to Myford
and Wolfe (2000), investigating the proportion to which each rater is
involved with the large standard residuals between observed scores
Youn-Hee Kim 199
Table 2 Teacher measurement report
Teacher Obsvd Fair–M Measure Model Infit Outfit PtBis

average average (logits) S.E. MnSq MnSq
NS10 2.9 2.78 −0.60 0.20 1.51 1.37 0.56

NNS10 2.9 2.74 −0.52 0.20 1.26 1.21 0.58
NNS11 2.8 2.63 −0.29 0.19 1.09 0.94 0.55
NNS1 2.7 2.52 −0.07 0.19 0.85 0.74 0.57
NS9 2.7 2.43 0.11 0.19 1.34 1.43 0.51
NS5 2.6 2.37 0.23 0.19 1.07 1.28 0.53
NNS9 2.6 2.35 0.26 0.19 1.29 1.46 0.50
NS12 2.6 2.32 0.33 0.19 0.96 1.12 0.54
NNS7 2.6 2.32 0.33 0.19 1.54 1.29 0.49
NNS5 2.5 2.29 0.40 0.19 0.81 0.82 0.57
NS7 2.5 2.27 0.44 0.19 1.11 1.12 0.53
NS11 2.5 2.25 0.47 0.19 1.00 0.94 0.53
NS4 2.5 2.22 0.54 0.19 0.52 0.48 0.60
NNS4 2.5 2.22 0.54 0.19 0.52 0.48 0.60
NNS12 2.4 2.17 0.65 0.19 0.83 0.97 0.56
NNS2 2.4 2.13 0.72 0.19 0.69 0.68 0.57
NS3 2.4 2.08 0.83 0.19 0.77 1.03 0.57
NNS3 2.4 2.08 0.83 0.19 0.85 0.73 0.59
NS2 2.3 2.02 0.97 0.19 0.67 0.69 0.57
NS8 2.3 1.99 1.05 0.19 0.78 0.77 0.59
NS6 2.2 1.91 1.23 0.19 1.30 1.41 0.53
NNS6 2.2 1.84 1.38 0.19 1.61 1.74 0.49
NS1 2.1 1.75 1.60 0.20 0.68 0.60 0.58
NNS8 2.1 1.73 1.64 0.20 0.85 0.72 0.56
Mean 2.5 2.22 0.54 0.19 1.00 1.00 0.55
S.D. 0.2 0.27 0.58 0.00 0.31 0.33 0.03
RMSE (model) = 0.19 Adj. S.D. = 0.55
Separation = 2.87 Separation (not inter-rater) Reliability = 0.89
Fixed (all same) χ2 = 214.7 d.f. = 23
Significance (probability) = .00
Note: SR/ROR correlation is presented as the point-biserial correlation (PtBis) in the

FACET output.
and expected scores can provide useful information about rater

behavior. If raters are interchangeable, it should be expected that
all raters would be assigned the same proportion of large standard
residuals, according to the proportion of total ratings that they make
(Myford & Wolfe, 2000). Based on the number of large standard
residuals and ratings that all raters make and each rater makes, they
suggest that the null proportion of large standard residuals for each
rater ( π ) and the observed proportion of large standard residuals for
each rater (Pr) can be computed using equations (1) and (2):
π = Nu (1)
Nt
where, Nu = the total number of large standard residuals and
Nt = the total number of ratings.
Nur
Pr = (2)
N tr
where, Nur = the number of large standard residuals made by rater
r and Ntr = the number of ratings made by rater r.
An inconsistent rating will occur when the observed propor-
tion exceeds the null proportion beyond the acceptable deviation
(Myford & Wolfe, 2000). Thus, Myford and Wolfe propose that the
frequency of unexpected ratings (Zp) can be calculated using equa-
tion (3). According to them, if a Zp value for a rater is below +2, it
indicates that the unexpected ratings that he or she made are random
error; however, if the value is above +2, the rater is considered to be
exercising an inconsistent rating pattern.
Pr − π
Zp = (3)
π − π2
Ntr
In this study, an unexpected observation was reported if the stand-
ardized residual was greater than +2, which was the case in 89 out
of a total of 1,727 responses. When rating consistency was exam-
ined, one NS teacher and two NNS teachers were found to exhibit
inconsistent rating patterns, a result similar to what was found in
the fit analysis. The two NNS teachers whose observed Zp values
were greater than +2 were NNS6 and NNS7, who had been flagged
as misfitting teachers by their infit indices. Interestingly, the analy-
sis of NS teachers showed that it was NS9, not NS10, who had Zp
values greater than +2. This may be because NS10 produced only
a small number of unexpected ratings which did not produce large
residuals. That small Zp value indicates that while the teacher gave a
few ratings that were somewhat unexpectedly higher (or lower) than
the model would expect, those ratings were not highly unexpected
(C. Myford, personal communication, May, 31, 2005).
Myford and Wolfe (2004a, 2004b) introduced the more advanced
Many-faceted Rasch Measurement application to detect raters’
consistency based on the single rater–rest of the raters (SR/ROR)
correlation. When raters exhibit randomness, they are flagged with
Youn-Hee Kim 201
significantly large infit and outfit mean square indices; however, sig-
nificantly large infit and outfit mean square indices may also indicate
other rater effects (Myford & Wolfe, 2004a, 2004b). Thus, Myford
and Wolfe suggested that it is important to examine significantly
low SR/ROR correlations as well. More specifically, they suggested
that randomness will be detected when infit and outfit mean square
indices are significantly larger than 1 and SR/ROR correlations are
significantly lower than those of other raters. Four teachers appeared
to be inconsistent: NS9, NNS6, NNS7, and NNS9 showed not only
large fit indices but also low SR/ROR correlations. When compared
relatively, NS9, NNS7, and NNS9 seemed to be on the borderline
in their consistency, whereas NNS6 was obviously signaled as an
inconsistent teacher.
In summary, the three different types of statistical approaches
showed converging evidence; most of the NS and NNS teachers
were consistent in their ratings, with one or two teachers from each
group showing inconsistent rating patterns. This result implies that
the two groups rarely differed in terms of internal consistency, and
that the NNS teachers were as dependable as the NS teachers in
assessing students’ oral English performance.
2 Do NS and NNS teachers exhibit interchangeable severity

across different tasks when they assess students’ oral English
performance?
The analysis was carried out in order to identify whether the two
groups of teachers showed similar severity measures across differ-
ent tasks. Given that task difficulty is determined to some extent by
raters’ severity in a performance assessment setting, comparison of
task difficulty measures is considered a legitimate approach. Figure
2 shows the task difficulty derived from the NS and the NNS groups
of teachers. As can be seen, the ratings of the NS group were slightly
more diverse across tasks, with task difficulty measures ranging
from −0.53 logits to 0.97 logits, with a 1.50 logit spread; in the NNS
group’s ratings, the range of task difficulty measures was similar to
that of the NS group, though slightly narrower: from −0.59 logits to
0.82 logits, with a 1.41 logit spread. Figure 2 also shows that both
groups exhibited generally similar patterns in task difficulty meas-
ures. Task 6 was given the highest difficulty measure by both groups
of teachers, and Tasks 3 and 2 were given the lowest difficulty meas-
ure by the NS and the NNS teacher groups, respectively.
1.2
1
0.8
Task Difficulty (logits)
0.6
0.4
0.2
0
-0.2 T1 T2 T3 T4 T5 T6 T7 T8
-0.4
-0.6
-0.8
Tasks
NS Teacher Group NNS Teacher Group
Figure 2 Task difficulty measures by NS and NNS teacher groups
A bias analysis was carried out to further explore the potential

interaction between teacher groups and tasks. In the bias analysis, an
estimate of the extent to which a teacher group was biased toward a
particular task is standardized to a Z-score. When the Z-score values
in a bias analysis fall between −2 and +2, that group of teachers is
thought to be scoring a task without significant bias. Where the val-
ues fall below −2, that group of teachers is scoring a task leniently
compared with the way they have assessed other tasks, suggesting
a significant interaction between the group and the task. By the
same token, where the values are above +2, that group of teachers
is thought to be rating that task more severely than other tasks. As
the bias slopes of Figure 3 illustrate, neither of the two groups of
teachers was positively or negatively biased toward any particular
tasks; thus, the NS and NNS teacher groups do not appear to have
any significant interactions with particular tasks.
A bias analysis between individual teachers and tasks confirmed
the result of the previous analysis. While an interaction was found
between individual teachers and tasks, no bias emerged toward a
particular task from a particular group of teachers. Strikingly, certain
teachers from each group showed exactly the same bias patterns on
particular tasks. As shown in Table 3, one teacher from each group
exhibited significantly lenient rating patterns on Tasks 1 and 4, and
significantly severe patterns on Task 7. Two NS teachers exhibited
conflicting rating patterns on Task 6: NS11 showed a significantly
Youn-Hee Kim 203
Teacher Group
1. NS GROUP 2. NNS GROUP
1
0.8
0.6
0.4
z-value
0.2
0
-0.2
-0.4
-0.6
-0.8
T1 T2 T3 T4 T5 T6 T7 T8
Figure 3 Bias analysis between teacher groups and tasks
more lenient pattern of ratings, while NS9 showed the exact reverse
pattern; that is, NS9 rated Task 6 significantly more severely. It is
very interesting that one teacher from each group showed the same
bias patterns on Tasks 1, 4, and 7, since it implies that the ratings of
these two teachers may be interchangeable in that they display the
same bias patterns.
In summary, the NS and NNS teachers seem to have behaved
similarly in terms of severity, and this is confirmed by both the task
difficulty measures and the two bias analyses. The overall results of
the multiple quantitative analyses also show that the NS and NNS
Table 3 Bias analysis: Interactions between teachers and tasks
Teacher Task Obs-Exp Bias Model Z-score Infit

average measure S.E. MnSq
(logits)
NS11 T6 0.54 −1.26 0.55 −2.29 0.9

NS9 T4 0.38 −1.23 0.58 −2.13 1.5
NNS9 T4 0.43 −1.22 0.55 −2.19 1.5
NNS12 T1 0.47 −1.18 0.53 −2.24 0.7
NS3 T1 0.44 −1.06 0.50 −2.11 0.8
NS5 T6 0.43 −1.01 0.55 −1.84 1.3
NNS6 T6 −0.34 1.06 0.69 1.54 3.0
NS9 T6 −0.49 1.21 0.58 2.09 2.1
NS3 T6 −0.44 1.21 0.64 1.90 0.7
NS6 T7 −0.60 1.90 0.65 2.92 1.1
NNS6 T7 −0.60 2.02 0.69 2.93 1.1
teachers appeared to reach an agreement as to the score a test-taker

should be awarded, exhibiting little difference in internal consist-
ency and severity. Given that the quantitative outcomes that the two
groups of teachers generated were rarely different, the research focus
now turns to qualitative analyses of the processes the teachers went
through in their assessments.
3 How do NS and NNS teachers differ in drawing on evaluation

criteria when they comment on students’ oral English
performance?
In order to illustrate teachers’ evaluation patterns, their written
comments were analyzed qualitatively. While the quantitative
approach to their ratings provided initial insights into the teach-
ers’ evaluation behavior, a more comprehensive and enriched
understanding was anticipated from the complementary inclusion
of a qualitative approach. The mixed methods design was intended
to enhance the study’s depth and breadth, elucidating dimensions
that might have been obscured by the use of a solely quantitative
method.
When comments from both groups were reviewed, a variety
of themes emerged. Nineteen evaluation criteria were identified,
and they were quantified and compared between the NS and NNS
teacher groups. Figure 4 illustrates the frequency of comments made
by the NS and NNS teacher groups for the 19 evaluation criteria.
The analysis was conducted using comments drawn from all eight
tasks; the NS and NNS teacher groups could not be compared for
individual tasks in that they made very few comments (fewer than
10) on some evaluation criteria related to a particular task. The com-
parison was therefore based on comments for all tasks rather than for
each individual task.
Interestingly, the total number of comments made by the two
groups differed distinctly: while the NS group made 2,123 com-
ments, the NNS group made only 1,172. This may be because
providing students with detailed evaluative comments on their per-
formance is not as widely used in an EFL context as traditional fixed
response assessment. Figure 4 also shows that the NS group provided
more comments than the NNS group for all but two of the evalua-
tion criteria: accuracy of transferred information and completeness
of discourse. Still, the overall number of comments for these two
criteria was similar in the two teacher groups (46 vs. 50 comments
Youn-Hee Kim 205
NS
NNS
300
Frequency of Comments
250
200
150
100
50
0
l g ruc y
al rele fo
nt
se nte ncy
fe ent
t
n
un ry
ra f di ails
n g p l i sk
t
g u ce
pr ess
vo use
cio pec ram ture
t a pri se
en
co ess
ra ram use
ili
rse
of e
tio
ov pic d in
cy h of me
nc
la
str com e ta
an van
nt ppr ar u
ge tenc igib
tra gum
m
n
e
t
n
n cou
bu
cia
de
e
pl eme ere
flu
pp ate
sh
ar
gu
rre
ag
th
iat
ca
m
m
ll
ar
ar
s
h
t
es
g
on
in
ns
nt
of
o
ro
i
l t and
o
pr
ul ic g
ll
ela ess
to
ac
tio
t
a
ra
n
er erst
of
l
er
n
k
co upp
ne
e
ex
as
ete
bo
d
tu
un
s
ra
co
al
cu
m
-c
ac
ov
so
Figure 4 Frequency distribution of the comments by NS and NNS teacher groups
for accuracy of transferred information; 53 vs. 66 comments for

completeness of discourse).
When the evaluation criteria emphasized by the two teacher
groups were examined, the NS group was found to draw most
frequently on overall language use (13.46% of all comments),
pronunciation (11.47%), vocabulary (11.42%), fluency (9.33%), and
specific grammar use (6.70%). The NNS group emphasized pronun-
ciation (15.23% of all comments), vocabulary (14.47%), intelligibil-
ity (7.69%), overall language use (7.00%), and coherence (5.68%).
These trends indicate that the two teacher groups shared common
ideas about the ways in which the students’ performance should be
assessed. Although the NS and NNS groups differed in that the NS
group made more comments across most of the evaluation criteria,
both groups considered vocabulary, pronunciation, and overall lan-
guage use to be the primary evaluation criteria.
The NS teachers provided more detailed and elaborate comments,
often singling out a word or phrase from students’ speech responses
and using it as a springboard for justifying their evaluative com-
ments. For example, when evaluating pronunciation, the NS teach-
ers commented that ‘some small pronunciation issue (‘can’/‘can’t’
& ‘show’/‘saw’) causes confusion’, ‘some words mispronounced
(e.g., ‘reverse’ for ‘reserve’, ‘arrive’ for ‘alive’)’, ‘pronunciation dif-
ficulty, l/r, d/t, f/p, vowels, i/e,’, ‘pronunciation occasionally unclear
(e.g., ‘really’)’, ‘sometimes pronunciation is not clear, especially at
word onsets’, etc. The explicit pinpointing of pronunciation errors
might imply that the NS teachers tended to be sensitive or strict in

terms of phonological accuracy. It can also be interpreted to suggest
that the NS teachers were less tolerant of or more easily distracted
by phonological errors made by non-native English-speakers. These
findings are somewhat contradictory to those of previous studies
(e.g., Brown, 1995; Fayer & Krasinski, 1987) that indicated native
speakers are less concerned about or annoyed by non-native speech
features as long as they are intelligible. This inconsistency might ulti-
mately be due to the different methodological approaches employed
in the studies. While this study examined non-native speakers’ pho-
nological features through a qualitative lens, the previous studies
focused on the quantitative scores awarded on pronunciation as one
analytic evaluation criterion.
When the comments provided by the NNS teachers on pronun-
ciation were examined, they were somewhat different. Although
pronunciation was one of the most frequently mentioned evaluation
criteria and constituted 15.23% of the total comments, the NNS
teachers were more general in their evaluation comments. Instead
of identifying problems with specific phonological features, they
tended to focus on the overall quality of students’ pronunciation
performance. For example, their comments included ‘problems with
pronunciation’, ‘problems with word stress’, ‘hard to follow due to
pronunciation’, ‘good description of library but problems with pro-
nunciation (often only with effort can words be understood)’, etc. It
appears that the NNS teachers were less influenced by phonologi-
cal accuracy than by global comprehensibility or intelligibility. In
other words, as long as students’ oral performance was intelligible
or comprehensible, the NNS teachers did not seem to be interested
in the micro-level of phonological performance. Intelligibility was
the third most frequently mentioned evaluation criterion by the NNS
teachers, confirming that their attention is more focused on overall
phonological performance or intelligibility than specific phonologi-
cal accuracy. Another possible explanation might be that, as one of
the reviewers of this article suggested, the NNS teachers were more
familiar with the students’ English pronunciation than the NS teach-
ers because the NNS teachers shared the same first language back-
ground with the students.
Similar patterns appeared in the evaluation criteria of specific
grammar use and accuracy of transferred information. The NS teach-
ers provided more detailed feedback on specific aspects of gram-
mar use, making more comments compared to the NNS teachers
Youn-Hee Kim 207
(152 vs. 29 comments). For example, when evaluating students’

performance on Task 1 (describing the layout of a library), the NS
teachers paid more attention to accurate use of prepositions than to
other grammatical features. They further pointed out that accurate
use of prepositions might facilitate listeners’ visualization of given
information, for example, by stating ‘prepositions of place could be
more precise (e.g., ‘in front of’ computers)’ and ‘incorrect or vague
use of prepositions of place hinders visualization’.
The same observations were also made on Task 4 (narrating a
story from six sequential pictures) and Task 7 (describing a graph of
human life expectancy). Tasks 4 and 7 were similar in that students
had to describe events that had taken place in the past in order to
complete them successfully. It was therefore essential for students to
be comfortable with a variety of verb tenses (past, past progressive,
past perfect, present, and future) so as not to confuse their listeners.
As was the case with preposition use, the NS teachers were more
aware than the NNS teachers of the precise use of verb tenses, as
their comments make manifest: ‘successfully recounted in the past
with complex structure (i.e., past perfect, past progressive)’, ‘chang-
ing verb tense caused some confusion’, ‘all recounted in present
tense’, ‘tense accuracy is important for listener comprehension in
this task’, and ‘minor error in verb tense (didn’t use future in refer-
ence to 2010 at first)’.
By contrast, the NNS teachers neither responsively nor meticu-
lously cited the use of prepositions or verb tenses. Their 29 total
comments on specific grammar use were often too short to enable
interpretation of their judgments (e.g., ‘no prepositions’, ‘wrong
tense’, ‘problems with prepositions’, and ‘problems with tense’),
suggesting that the NNS teachers were less distracted by the mis-
use of prepositions and verb tenses than NS teachers, consistent
with Galloway’s (1980) findings. Speculating as to why native and
non-native speakers had different perceptions of the extent to which
linguistic errors disrupt communication, Galloway noted that ‘con-
fusion of tense may not have caused problems for the non-native
speaker, but it did seem to impede communication seriously for the
native speaker’ (p. 432). Although the native language group in the
Galloway study was quite different from that of the present study
(i.e., native Spanish speakers as opposed to native Korean speakers),
her conjectures are noteworthy.
The responses of the two teacher groups to the accuracy of trans-
ferred information followed the same pattern. Although the NNS
teachers provided more comments than did the NS teachers (50

vs. 46, respectively), their characteristics were dissimilar. This was
especially evident in Task 2 (explaining the library services based
on a provided informational note) and Task 7 (describing a graph
of human life expectancy), where students were asked to verbalize
literal and numeric information. On these two tasks, the NS teach-
ers appeared very attentive to the accuracy of transmitted informa-
tion, and jotted down content errors whenever they occurred. For
example, they pointed out every inconsistency between the provided
visual information and the transferred verbalized information, com-
menting ‘some key information inaccurate (e.g., confused renew-
als for grads & undergrads; fines of $50/day → 50¢/day)’, ‘some
incorrect info (e.g., closing time of 9:00 pm instead of 6:00 pm)’,
‘gradually accurate at first, then less so when talking about fines
(e.g., ‘$50’ → ‘50¢’)’, ‘some incorrect information (the gap between
men and women was smallest in 1930, NOT 2000)’, etc. By con-
trast, the NNS teachers were primarily concerned with whether the
delivered information was generally correct, for example, comment-
ing ‘accurate info’, ‘not very inaccurate info’, or ‘provided wrong
information’. The NNS teachers’ global judgments on the accuracy
of transmitted information raises the question of whether the NNS
teachers were not as attentive as the NS teachers to specific aspects
of content accuracy, as long as the speech was comprehensible. It
may simply be that the NNS teachers considered content errors to
be simple mistakes that should not be used to misrepresent students’
overall oral English proficiency.
The tendency of the NNS teachers to provide less detailed, less
elaborate comments than the NS teachers on certain evaluation cri-
teria requires careful interpretation. NNS teachers who teach daily
in an EFL context may be poorly informed about how to evaluate
students’ language performance without depending on numeric
scores and traditional fixed response assessment. Although there
have been recent advances in performance assessment in an ELF
context, it has been pointed out that NNS teachers had not been
effectively trained to assess students’ performance (Lee, 2007).
This different evaluation culture might have contributed to the
dissimilar evaluation patterns for the NS and NNS teachers. The
different evaluation behaviors might also be attributable to a meth-
odological matter. Because this study was intended only to capture
teachers’ evaluation behavior, those who participated in the study
were not told that they should make their comments as specific as
Youn-Hee Kim 209
possible, which might have influenced the NNS teachers’ lack of

evaluative comments. For example, the NNS teachers may sim-
ply have noted the major characteristics of students’ oral output,
focusing on overall quality without considering the granularity
of their own comments. As one of the reviewers suggested, it is
also possible that the NNS teachers did not orient their comments
toward providing feedback for the students. To suggest that the
NNS teachers did not identify linguistic errors as accurately as did
the NS teachers would therefore be premature, and more evidence
needs to be gathered to address the specific ways in which the NS
and NNS teachers provided students with feedback related to those
linguistic errors.
IV Conclusion and implications

This study has examined how a sample of NS and NNS teachers
assessed students’ oral English performance from comprehensive
perspectives. A variety of test tasks were employed, enabling the
teachers to exhibit varied rating behaviors while assessing diverse
oral language output. The teachers not only exhibited different
severity measures, but they also drew on different evaluation criteria
across different tasks. These findings suggest that employing multi-
ple tasks might be useful in capturing diverse rater behaviors.
Three different statistical approaches were used to compare teach-
ers’ internal consistency, and they revealed almost identical patterns.
Most of the NS and NNS teachers maintained acceptable levels of
internal consistency, with only one or two teachers from each group
identified as inconsistent raters. Similar results were obtained when
the severity of the two groups was compared. Of the eight individual
tasks, both teacher groups were most severe on Task 6, and neither
was positively or negatively biased toward a particular task. More
interestingly, a bias analysis carried out for individual teachers and
individual tasks showed that one teacher from each group exhibited
exactly the same bias patterns on certain tasks. A striking disparity,
however, appeared in the NS and NNS teachers’ evaluation criteria
for students’ performance. The NS teachers provided far more com-
ments than the NNS teachers with regard to students’ performance
across almost all of the evaluation criteria. A qualitative analysis
further showed the NS teachers to be more detailed and elaborate in
their comments than were the NNS teachers. This observation arose
from their judgments on pronunciation, specific grammar use, and

the accuracy of transferred information.
The comparable internal consistency and severity patterns that the
NS and NNS teachers exhibited appear to support the assertion that
NNS teachers can function as assessors as reliably as NS teachers
can. Although the NS teachers provided more detailed and elaborate
comments, the study has not evidenced how different qualitative
evaluation approaches interact with students and which evaluation
method would be more beneficial to them. Therefore, the study’s
results offer no indication that NNS teachers should be denied posi-
tions as assessors simply because they do not own the language
by ‘primogeniture and due of birth’ (Widdowson, 1994, p. 379).
Considering assessment practices can be truly valid only when all
contextual factors are considered, the involvement of native speakers
in an assessment setting should not be interpreted as a panacea. By
the same token, an inquiry into validity is a complicated quest, and
no validity claims are ‘one-size-fits-all.’ In a sense, NNS teachers
could be more compelling or sensitive assessors than NS teachers
in ‘expanding circles countries’ (Kachru, 1985), since the former
might be more familiar with the instructional objectives and cur-
riculum goals of indigenous educational systems. Further research is
therefore warranted to investigate the effectiveness of NNS teachers
within their local educational systems.
This study has shown that by combining quantitative and qualita-
tive research methods, a comprehensive understanding of research
phenomena can be achieved via paradigmatic and methodological
pluralism. Diverse paradigms and multiple research methods ena-
bled diverse social phenomena to be explored from different angles;
the inclusion of a qualitative analysis provided insight into the dif-
ferent ways in which NS and NNS teachers assessed students’ oral
language performance, above and beyond findings from the quan-
titative analysis alone. Collecting diverse data also helped to over-
come the limitations of the aforementioned previous studies, which
depended solely on numeric data to investigate raters’ behavior in
oral language performance assessment.
Several methodological limitations and suggestions should be
noted. First, this study’s results cannot be generalized to other
populations. Only Canadian and Korean English teachers were
included in the sample, and most of these were well-qualified and
experienced, with at least one graduate degree related to linguis-
tics or language education. Limiting the research outcomes to the
Youn-Hee Kim 211
specific context in which this study was carried out will make the
interpretations of the study more valid. The use of other qualitative
approaches is also recommended. The only qualitative data col-
lected were written comments, which failed to offer a full account
of the teachers’ in-depth rating behavior. Those behaviors could be
further investigated using verbal protocols or in-depth interviews
for a fuller picture of what the teachers consider effective language
performance. As one of the reviewers pointed out, it might be also
interesting to investigate whether the comments made by the NS
and NNS teachers tap different constructs of underlying oral profi-
ciency and thereby result in different rating scales. Lastly, further
research is suggested to examine the extent to which the semi-
direct oral test and the rating scale employed in this study represent
the construct of underlying oral proficiency.
Acknowledgements
I would like to acknowledge that this research project was funded
by the Social Sciences and Humanities Research Council of Canada
through McGill University’s Institutional Grant. My sincere appre-
ciation goes to Carolyn Turner for her patience, insight, and guid-
ance, which inspired me to complete this research project. I am also
very grateful to Eunice Jang, Alister Cumming, and Merrill Swain
for their valuable comments and suggestions on an earlier version
of this article. Thanks are also due to three anonymous reviewers of
Language Testing for their helpful comments.
V References
Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. Oxford:
Oxford University Press.
Barnwell, D. (1989). ‘Naïve’ native speakers and judgments of oral proficiency
in Spanish. Language Testing, 6, 152–163.
Brown, A. (1995). The effect of rater variables in the development of an occu-
pation-specific language performance test. Language Testing, 12, 1–15.
Caracelli, V. J. & Greene, J. C. (1993). Data analysis strategies for mixed-
method evaluation designs. Educational Evaluation and Policy Analysis,
15, 195–207.
Caracelli, V. J. & Greene, J. C. (1997). Crafting mixed method evaluation
designs. In Greene, J. C. & Caracelli, V. J., editors, Advances in mixed-
method evaluation: The challenges and benefits of integrating diverse
paradigms. New Directions for Evaluation no. 74 (pp. 19–32). San

Francisco: Jossey-Bass.
Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different
tests and rater groups. Language Testing, 12, 16–33.
Chalhoub-Deville, M. & Wigglesworth, G. (2005). Rater judgment and English
language speaking proficiency. World Englishes, 24, 383–391.
Clark, J. L. D. & Swinton, S. S. (1979). An exploration of speaking proficiency
measures in the TOEFL context (TOEFL Research Report No. RR-04).
Princeton, NJ: Educational Testing Service.
Clark, J. L. D. & Swinton, S. S. (1980). The test of spoken English as a
measure of communicative ability in English-medium instructional set-
tings (TOEFL Research Report No. RR-07). Princeton, NJ: Educational
Testing Service.
Crystal, D. (2003). English as a global language. Cambridge: Cambridge
University Press.
Fayer, J. M. & Krasinski, E. (1987). Native and nonnative judgments of intel-
ligibility and irritation. Language Learning, 37, 313–326.
Galloway, V. B. (1980). Perceptions of the communicative efforts of American
students of Spanish. Modern Language Journal, 64, 428–433.
Graddol, D. (1997). The future of English?: A guide to forecasting the
popularity of English in the 21st century. London, UK: The British
Council.
Greene, J. C., Caracelli, V. J. & Graham, W. F. (1989). Toward a conceptual
framework for mixed-method evaluation design. Educational Evaluation
and Policy Analysis, 11, 255–274.
Hadden, B. L. (1991). Teacher and nonteacher perceptions of second-language
communication. Language Learning, 41, 1–24.
Hill, K. (1997). Who should be the judge?: The use of non-native speakers
as raters on a test of English as an international language. In Huhta, A.,
Kohonen, V., Kurki-Suonio, L., & Luoma, S., editors, Current develop-
ments and alternatives in language assessment: Proceedings of LTRC
96 (pp. 275–290). Jyväskylä: University of Jyväskylä and University of
Tampere.
Jenkins, J. (2003). World Englishes: A resource book for students. New York:
Routledge.
Johnson, B. & Turner, L. A. (2003). Data collection strategies in mixed
methods research. In Tashakkori, A. & Teddlie, C., editors, Handbook
of mixed methods in social and behavioral research (pp. 297–319).
Thousand Oaks, CA: Sage.
Kachru, B. B. (1985). Standards, codification and sociolinguistic realism: The
English language in the outer circle. In Quirk, R. & Widdowson, H.,
editors, English in the world: Teaching and learning the language and
literatures (pp. 11–30). Cambridge: Cambridge University Press.
Kachru, B. B. (1992). The other side of English. In Kachru, B. B., editors, The
other tongue: English across cultures (pp. 1–15). Urbana, IL: University
of Illinois Press.
Youn-Hee Kim 213
Kim, Y-H. (2005). An investigation into variability of tasks and teacher-judges

in second language oral performance assessment. Unpublished master’s
thesis, McGill University, Montreal, Quebec, Canada.
Lee, H-K. (2007). A study on the English teacher quality as an English instruc-
tor and as an assessor in the Korean secondary school. English Teaching,
62, 309–330.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA
Press.
Linacre, J. M. (2005). A user’s guide to facets: Rasch-model computer pro-
grams. [Computer software and manual]. Retrieved April 10, 2005, from
www.winsteps.com.
Linacre, J. M. & Williams, J. (1998). How much is enough? Rasch Measurement:
Transactions of the Rasch Measurement SIG, 12, 653.
Lowenberg, P. H. (2000). Assessing English proficiency in the global con-
text: The significance of non-native norms. In Kam, H. W., editor,
Language in the global context: Implications for the language classroom
(pp. 207–228). Singapore: SEAMEO Regional Language Center.
Lowenberg, P. H. (2002). Assessing English proficiency in the Expanding
Circle. World Englishes, 21, 431–435.
Lunz, M. E. & Stahl, J. A. (1990). Judge severity and consistency across grad-
ing periods. Evaluation and the health professions, 13, 425–444.
McNamara, T. F. (1996). Measuring second language performance. London:
Longman.
Myford, C. M. & Wolfe, E. W. (2000). Monitoring sources of variability within
the test of spoken English Assessment System (TOEFL Research Report
No. RR-65). Princeton, NJ: Educational Testing Service.
Myford, C. M. & Wolfe, E. W. (2004a). Detecting and measuring rater effects
using many-facet Rasch measurement: Part I. In Smith, Jr., E. V. &
Smith, R. M., editors, Introduction toRasch measurement (pp. 460–517).
Maple Grove, MN: JAM Press.
Myford, C. M. & Wolfe, E. W. (2004b). Detecting and measuring rater effects
using many-facet Rasch measurement: Part II. In Smith, Jr., E. V. &
Smith, R. M., editors, Introduction to Rasch measurement. Maple Grove,
MN: JAM Press, 518–574.
O’Loughlin, K. (1995). Lexical density in candidate output on direct and
semi-direct versions of an oral proficiency test. Language Testing, 12,
217–237.
Stansfield, C. W. & Kenyon, D. M. (1992a). The development and validation of
a simulated oral proficiency interview. The Modern Language Journal,
72, 129–141.
Stansfield, C. W. & Kenyon, D. M. (1992b). Research on the comparability of
the oral proficiency interview and the simulated oral proficiency inter-
view. System, 20, 347–364.
Stansfield, C. W., Kenyon, D. M., Paiva, R., Doyle, F., Ulsh, I., & Antonia, M.
(1990). The development and validation of the Portuguese Speaking Test,
Hispania, 73, 641–651.
Tashakkori, A. & Teddlie, C., editors (2003). Handbook of mixed methods in

social and behavioral research. Thousand Oaks, CA: Sage.
Taylor, L. B. (2006). The changing landscape of English: Implications for lan-
guage assessment, ELT Journal, 60, 51–60.
Teddlie, C. & Yu, F. (2007). Mixed methods sampling: A typology with exam-
ples. Journal of Mixed Methods Research, 1, 77–100.
Underhill, N. (1987). Testing spoken language: A handbook of oral testing
techniques. Cambridge: Cambridge University Press.
Widdowson, H. G. (1994). The ownership of English. TESOL Quarterly, 28,
377–388.
Wright, B. D. & Linacre, J. M. (1994). Reasonable mean-square fit values.
Rasch Measurement: Transactions of the Rasch Measurement SIG, 8,
370.
Youn-Hee Kim 215
Appendix A: Rating scale for the oral English test
Overall communication is almost always successful; little or no lis-

tener effort is required.
Overall communication is generally successful; some listener effort

is required.
Overall communication is less successful; more listener effort is

required.
Overall communication is generally unsuccessful; a great deal of

listener effort is required.
Notes:
1. ‘Communication’ is defined as an examinee’s ability to both address a given task
and get a message across.
2. A score of 4 does not necessarily mean speech is comparable to that of native
English speakers.
3. No response, or a response of ‘I don’t know’ is automatically rated NR (Not
Ratable).
Appendix B: Definitions and examples of the evaluation criteria
Evaluation criteria & definitions Examples
1. Understanding the task: the degree to which a speaker ‘Didn’t seem to understand the task.’
understands the given task ‘Didn’t understand everything about the task.’
2. Overall task accomplishment: the degree to which a ‘Generally accomplished the task.’
speaker accomplishes the general demands of the task ‘Task not really well accomplished.’
‘Successfully accomplished task.’
3. Strength of argument: the degree to which the argument of ‘Good range of points raised.’
the response is robust ‘Good statement of main reason presented.’
‘Arguments quite strong’
4. Accuracy of transferred information: the degree to which a ‘Misinterpretation of information (e.g., graduate renewals for
speaker transfers the given information accurately undergrads, $50 a day for book overdue?)’
‘Incorrect information (e.g., ‘9pm’ instead of ‘6pm’)’
5. Topic relevance: the degree to which the content of the ‘Not all points relevant’
response is relevant to the topic ‘Suddenly addressing irrelevant topic (i.e., focusing on physically
harmful effects of laptops rather than on harmful effects of the
internet)’
6. Overall language use: the degree to which the language ‘Generally good use of language’
component of the response is of good and appropriate ‘Native-like language’
quality ‘Very limited language’
7. Vocabulary: the degree to which vocabulary used in the ‘Good choice of vocabulary’
response is of good and appropriate quality ‘Some unusual vocabulary choices (e.g., he crossed a girl.)’
8. Pronunciation: the degree to which pronunciation of the ‘Native-like pronunciation’
response is of good quality and clarity ‘Pronunciation difficulty (e.g., l/r, d/t, vowels, i/e)’
‘Mispronunciation of some words (e.g., ‘circulation ’)
9. Fluency: the degree to which the response is fluent without ‘Choppy, halted’
too much hesitation ‘Pausing, halting, stalling – periods of silence’
‘Smooth flow of speech’

10. Intelligibility: the degree to which the response is ‘Hard to understand language (a great deal of listener work required)’
intelligible or comprehensible ‘Almost always understandable language’
11. Sentence structure: the degree to which the sentential ‘Cannot make complex sentences.’
structure of the response is of good quality and complexity ‘Telegraphic speech’
‘Took risk with more complex sentence structure’
12. General grammar use: the degree to which the general ‘Generally good grammar’
grammatical use is of good quality ‘Some problems with grammar’
‘Few grammatical errors’
13. Specific grammar use: the degree to which the micro-level ‘Omission of articles’
of grammatical use is of good quality ‘Incorrect or vague use of prepositions of place’
‘Good use of past progressive’
14. Socio-cultural appropriateness: the degree to which the ‘Cultural/pragmatic issue (a little formal to congratulate a friend)’
response is appropriate in a social and cultural sense ‘Little congratulations, more advice (culturally not appropriate)’
15. Contextual appropriateness: the degree to which the ‘Appropriate language for a given situation’
response is appropriate to the intended communicative ‘Student response would have been appropriate if Monica had
goals of a given situation expressed worry about going to graduate school.’
16. Coherence: the degree to which the response is developed ‘Good use of linking words’
in a coherent manner ‘Great time markers’
‘Organized answer’
17. Supplement of details: the degree to which sufficient ‘Provides enough details for effective explanation about the graph.’
information or details are provided for effective ‘Student only made one general comment about the graph without
communication referring to specifics.’
‘Lacks enough information with logical explanation.’
18. Completeness of discourse: the degree to which the ‘Incomplete speech’
discourse of the response is organized in a complete ‘No reference to conclusion’
manner ‘End not finished.’
19. Elaboration of argument: the degree to which the argument ‘Mentioned his arguments but did not explain them.’
Youn-Hee Kim
of the response is elaborated ‘Good elaboration of reasons’

‘Connect ideas smoothly by elaborating his arguments.’
217

An Investigation Into Native and Non Native Teachers' Jugements of Oral English Peformance A Mixed Methods Approach

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Investigation Into Native and Non Native Teachers' Jugements of Oral English Peformance A Mixed Methods Approach

Uploaded by

Copyright:

Available Formats

–

Language Testing 2009 26 (2) 187 217

An investigation into native and

This study used a mixed methods research approach to examine how

Keywords: mixed methods, NS and NNS, oral English performance

In the complex world of language assessment, the presence of raters

© The Author(s), 2009. Reprints and Permissions: http://www.sagepub.co.uk/journalsPermissions.nav

inherent to that rater (McNamara, 1996). Use of a rater for perform-

1) Do NS and NNS teachers exhibit similar levels of internal

I Review of the literature

Hadden, 1991), raters were simply asked to assess speech samples

provide greater understanding of the NS and NNS teachers’ rating

Table 1 Distribution of students across class levels

was obtained via a questionnaire after their student evaluations

their opinions on a given topic, such as explaining their personal

teacher rated every student’s performance on every task, so that

a Quantitative data analysis: The data were analyzed using the

Single rater–rest of the

Task difficulty measures

Teachers’ severity Bias analysis (group

Bias analysis (individual

Figure 1 Flowchart of data analysis procedure

b Qualitative data analysis: The written comments were analyzed

III Results and discussion

Teacher Obsvd Fair–M Measure Model Infit Outfit PtBis

NS10 2.9 2.78 −0.60 0.20 1.51 1.37 0.56

Note: SR/ROR correlation is presented as the point-biserial correlation (PtBis) in the

and expected scores can provide useful information about rater

2 Do NS and NNS teachers exhibit interchangeable severity

NS Teacher Group NNS Teacher Group

Figure 2 Task difficulty measures by NS and NNS teacher groups

A bias analysis was carried out to further explore the potential

Figure 3 Bias analysis between teacher groups and tasks

Table 3 Bias analysis: Interactions between teachers and tasks

Teacher Task Obs-Exp Bias Model Z-score Infit

NS11 T6 0.54 −1.26 0.55 −2.29 0.9

teachers appeared to reach an agreement as to the score a test-taker

3 How do NS and NNS teachers differ in drawing on evaluation

cio pec ram ture

Figure 4 Frequency distribution of the comments by NS and NNS teacher groups

for accuracy of transferred information; 53 vs. 66 comments for

might imply that the NS teachers tended to be sensitive or strict in

(152 vs. 29 comments). For example, when evaluating students’

teachers provided more comments than did the NS teachers (50

possible, which might have influenced the NNS teachers’ lack of

IV Conclusion and implications

from their judgments on pronunciation, specific grammar use, and

paradigms. New Directions for Evaluation no. 74 (pp. 19–32). San

Kim, Y-H. (2005). An investigation into variability of tasks and teacher-judges

Tashakkori, A. & Teddlie, C., editors (2003). Handbook of mixed methods in

Appendix A: Rating scale for the oral English test

Overall communication is almost always successful; little or no lis-

Overall communication is generally successful; some listener effort

Overall communication is less successful; more listener effort is

Overall communication is generally unsuccessful; a great deal of

‘Smooth flow of speech’

of the response is elaborated ‘Good elaboration of reasons’

You might also like